DOI : https://doi.org/10.5281/zenodo.18901261
- Open Access
- Authors : Sai Puneeth P, Varshini T, Rishita S, Bharathi K, Ravi S, Venkataramana B. V.
- Paper ID : IJERTV15IS020339
- Volume & Issue : Volume 15, Issue 02 , February – 2026
- Published (First Online): 07-03-2026
- ISSN (Online) : 2278-0181
- Publisher Name : IJERT
- License:
This work is licensed under a Creative Commons Attribution 4.0 International License
A Hybrid DistilBERT-GRU Based Regression System for Automated Student Answer Scoring
Sai Puneeth P, Varshini T, Rishita S, Bharathi K, Ravi S, Venkataramana B. V.
Holy Mary Institute of Technology and Science, Hyderabad, Telangana, India
Abstract – Grading student responses manually is a time-consuming process, particularly in resource-constrained schools. Current automated solutions employ keyword matching or classification, which does not account for semantic meaning, coherence, and partial correctness. The regression approach proposed here overcomes the need for strict categories and more accurately represents the gradual differences in answer quality. The problem of scoring student responses is framed as a regression problem, predicting scores ranging from 0 to 100. The proposed hybrid model integrates a pretrained DistilBERT encoder for semantic encoding with a one-way GRU to represent the coherence of the sequence and produce score predictions. Experiments were carried out with both frozen and fine-tuned encoder configurations to assess the performance under computational constraints. The model performed well on short and long-answer datasets across various subjects, and it made consistent predictions. Freezing the encoder decreased training time but impacted generalization performance. A major drawback is that the model does not handle implicit reasoning and rubric-based scoring criteria well.
Keywords – Automated Answer Scoring, Regression-Based Evaluation, Semantic Answer Evaluation, Hybrid Neural Architecture, DistilBERT, Gated Recurrent Unit, rural student evaluation
-
INTRODUCTION
Evaluating the results is important but time-consuming and difficult, especially in low-resource environments [6], [13]. The current automated systems use keyword matching, rule-based systems, or classification meth- ods, which cannot handle the semantic meaning, co- herence, or partial correctness of the results, especially when the results are diverse [6], [7]. To evaluate the results, it is important to consider the semantic mean- ing, partial correctness, and the absence of keyword dependency. The learning-based method with the use of text representations may provide an efficient solu- tion; however, it should also be computationally effi- cient, especially in low-resource environments [13], [16]. This paper presents a learning- based method for the au- tomated answer scoring system in such environments. The method is evaluated with varying answer lengths and topics, including continuous scores between 0 and 100, using the question, context, and answer. The method may also be used to make independent infer- ences without access to the training data. The novelty of the paper is the presentation of the possibility of us- ing semantic representations and sequence models with the use of regression-based answer scoring systems. The rest of the paper includes the related work, methodol- ogy, experiments, and results sections.
-
LITERATURE REVIEW
-
Traditional Automated Short An- swer Grading Approaches
Early automated grading methods used ways to com- pare student answers with the right answers. They looked for keywords, words and used templates that were already set up. These methods worked well when the answers were what was expected. But they had a problem. They only looked at the words on the surface.
They did not understand what the words really meant. So, if a student used words to say the same thing the automated grading method would not understand it. Automated grading methods also had trouble giving credit when a student answer was not complete but still made sense. This was a limitation of automated grading methods. They could not really evaluate stu- dent answers that were incomplete but still had points. Automated grading methods like these were not good, at capturing the meaning of student answers. These problems with the way of doing things made people want to try something new. They started using meth- ods that can learn and understand things better. These new methods are good, at figuring out how things are connected in a way [6] [7] [8].
-
Learning-Based and Neural Approaches to Answer Scoring
Learning based models look at how good an answers by checking the text. This makes them better at dealing with ways of saying things. Some people like Alikaniotis et al. Tried using approaches. They found that learning from the text was better than using features that peo- ple made. Learning-based models, like these are good because they learn from the text directly. However, they had some problems. They needed labels that said exactly what was right or wrong. They could not give detailed scores. Learning-based models are still a way to evaluate answer quality from textual data. Farag and the other people who worked with him introduced a way of using deep learning models to look at sequences of things. These models were able to understand how well essays flowed and made sense without needing someone to manually design features for them. However, the main focus of Farag and the other people who worked with him was on using these models to score essays, rather than, on finding a way to use them quickly and easily. Farag and the other people who worked with
him did this. They did not really think about how to make it work well in real life as we can see in the work they published [9].
The Transformer architectures made automated scor- ing by using something called contextual embeddings. This is done with self-attention, which is used in mod- els like BERT and its variants. There are approaches that use embeddings, like Sentence-BERT, Universal Sentence Encoder and ALBERT. These approaches try to balance how well they represent the meaning of sen- tences with how computer power they use.
With all these improvements there are still some prob- lems with automated scoring. For example, it is hard to predict scores thatre not just right or wrong but also partly correct. Another problem is that these models need a lot of computer power to run which can be a problem in some situations. Transformer architectures and models, like BERT are still being improved to solve these problems.
-
Limitations and Open Challenges in Automated Answer Scoring
Many automated scoring systems still think of evalua- tion as a simple yes or no task, where they put answers into categories. This means they have a time seeing small differences, in how good something is and when an answer is only partly correct. Because of this an- swers that are correct but to the point may not get the value they deserve. On the hand explanations that are just full of keywords but are not very good may get scores that are too high. Automated scoring systems have this problem. The automated scoring systems are not doing a job of evaluating answers.
These challenges are compounded by limited labeled educational data and the high computational require- ments of modern pretrained models, restricting deploy- ment in low-resource environments [13], [14]. Therefore, there is a clear need for automated scoring approaches that support continuous evaluation, handle partial cor- rectness, and remain feasible under data and computa- tional constraints [16].
-
-
PROBLEM FORMULATION AND METHOD- OLOGY
A closer look begins here – focusing on how student answers get scored by machines through a steady pre- diction approach. Scoring range setup comes first, fol- lowed by how data enters the system. ne-part shapes the frameworks structure while another handles what it aims to minimize during learning. Design practical- ity guides choices more than chasing new forms. Each piece connects not just by order but purpose, building
toward consistent outcomes.
-
Problem Formulation
Some grading systems put answers into groups, which can miss the differences in how good the answers are. The regression approach is different because it looks at scores as numbers that can be anything, not whole numbers. This way the regression approach is better at showing when the answers are partly right and when
the written answers are getting a little better over time. The regression approach is really good, at seeing the steps people take to get better at writing the responses. The assessment input, for the students is really sim- ple. It is the students question and the students answer and any extra information all put together. The ques- tion is what the students are being asked. The extra information is anything that might help the students understand the question. The answer is what the stu- dents say to the question. The students answer and the extra information are all combined into one piece of text. This way, when we score the students answer we can see how well the students answer matches the students question. The goal is to predict a score be- tween 0 and 100, where higher values indicate better accuracy and completeness. Rather than assigning rigid labels, the system learns a function f (x) = y that pre- dicts a continuous score for each answer independently, without comparing it to others or referencing past data during evaluation.
-
Input Representation
The system only looks at text. It puts three things to- gether to make one input. This includes the question, any background information that is given and the stu- dents response. These things are put in order to make sense in a way. So, the question comes first then the background information and the students answer comes last. This way the scoring looks at the conversation to understand what it really means. The system consid- ers the question, the background information and the students response to get the meaning of what is being said.
The model does not use any labels or difficulty tags or scoring rubrics. It just uses the language to figure out if an answer is good or not. The model looks at the context. How the answer is expressed. When the input is too long it will shorten the background details first the question if necessary but it will always try to keep the student response intact. The student response is the important thing so the model will keep that part even if it has to cut out some other details.
Table 1: Input components used by the proposed model Input Compo- Description Provided
nent to Model
Question Problem statement Yes to be answered
Context Optional supplemen- Yes (if tary information avail-
able) Student An- Response to be eval- Yes Sswuebrject Label Cuautrerdiculum or sub- No
ject metadata
Topic / Chap- Human-defined tax-
No ter onomy
Difficulty Level Manual annotation No
Table 1 summarizes which elements are included and excluded, ensuring a consistent and content-focused in- put setup. This structured text representation provides
a stable foundation for the modeling steps that follow.
-
Model Architecture
Heres how the automatic student answer grading setup works. Built on a mix of methods, it uses an exist- ing language understanding module along with a sim- pler layer for handling sequences, followed by a pre- diction output part. Instead of heavy computation, it focuses on pulling out meaning, flow, and general re- sponse strength without demanding too much process- ing power. Its shape balances insight depth with prac- tical speed.
One step at a time, it takes the question along with any background details and what the student wrote. First off, those words become digital tokens shaped by prior learning in the model. After that comes a pass, through layers tuned to follow how ideas connect across responses. Only later does each full attempt get boiled down into one smooth scoring number.
-
Architecture Overview
A setup like this breaks down into three big pieces. One part handles the flow, while another manages storage alongside processing. The third piece ties everything to- gether through communication channels. A layer that turns plain text into meaningful word representations begins the process. Moving forward, context shapes how each piece gets encoded. What comes next is in- fluenced by surrounding words during embedding. This step ensures meaning stays tied to how tokens appear together. Instead of treating words separately, their roles shift with usage.
A series of steps tracks how the response builds from one part to the next. This layer checks whether ideas follow a clear path instead of jumping around. It pays attention to order, making sure each point comes at the right moment. Structure gets shaped quietly be- neath the surface. What matters is that thoughts link together like pieces fitting into place. A single number comes out here after everything gets pulled together. What happens before shapes that output through quiet steps behind the scenes. The final piece simply returns what was built upstream without flair or adjustment.
Figure 1 shows how data moves through the system altogether.
A look at how the system works appears in Figure
-
This setup uses a trimmed-down version of BERT combined with a GRU network. Instead of full BERT, it runs on distilled data to save time. The model pro- cesses answers step by step through sequences. Scoring happens automatically once input passes through lay- ers. Regression turns patterns into point values. Each part connects to form one smooth flow. Design focuses on speed without losing accuracy.
-
-
Semantic Encoding with DistilBERT
A tiny version of BERT kicks things off by turning words into smart number patterns that carry meaning. This leaner model trades less bulk for quicker results without losing too much brainpower along the way. Ef- ficiency matters here – so instead of heavy machinery, a streamlined engine handles understanding. It keeps performance sharp while using fewer resources than its
Figure 1: Architecture Diagram bigger cousins ever would.
Starting with the full input string, the encoder builds a unique representation for every word piece. Because context shapes understanding, each encoded form holds clues about meaning and how words relate. Instead of making final decisions on tasks, this part pulls out deep features that reflect what things actually mean.
Because the training data is small and computing power is limited, the main experiments keep the Distil- BERT encoder locked. That way, its core understand- ing stays steady. Training moves faster this way. It also helps prevent memorizing noise. The model still uses what it learned before, pulling from deep meaning without starting fresh.
-
GRU-Based Sequence Modeling
While transformers give meaning to words based on context, methods like averaging or using a single spe- cial token tend to flatten details evenly across positions
– losing track of where things happen in the sequence. Instead, models that process one step at a time build up understanding gradually, preserving shifts in focus and structure as they go forward. When judging writ- ten explanations, tracking how thoughts unfold matters just as much as what those thoughts contain.
Even though transformer encoders capture deep mean- ing at each word, they stop short of pulling everything together into one overall score. After that, the model feeds the stream of encoded features into a forward- moving GRU layer. Ths step ties scattered signals into a unified summary ready for numerical prediction.
One step at a time, the GRU handles token embed- dings while tracking how elements relate throughout the sequence. Though it may sound like a line-by-line process, it actually maps out how ideas connect and un- fold together. As a result, patterns like repeated points, sudden stops, shifts in thinking, or how well thoughts are grouped become visible. Structure emerges not from order alone, but through these linked movements within the students reply.
Heres something worth remembering – the recurring piece isnt about tracking time-based patterns. Rather than handling sequences across moments, it works by combining token embeddings in a way that respects their position. What stands out is how it picks up on the flow of ideas inside one answer from a learner. The last hidden value from the GRU stays behind, act- ing like a tight summary of all input steps. Because it simplifies things, the model focuses on capturing mean- ing and form together across the full reply. That kind of snapshot works well when judging whole answers at once.
-
Regression Head
A single scalar value comes from the linear regression part at the end of the structure. Following the GRUs output, it transforms the full-sequence embedding into an outcome. Prediction of the student responses nu- merical rating belongs to this component. Its job ends with delivering that number. Nothing limits the output size while training begins, so the system can freely ad- just its guesses. Since results get scaled later on, they eventually fit into the expected scoring window. Learn- ing works better without forced boundaries early in the process. The approach follows how regression problems usually behave when left open-ended.
-
Summary
A fresh take on the setup weaves together meaning- focused learning and pattern tracking inside one smooth prediction model. Instead of stacking pieces apart, it ties a ready-made text processor to a nimble repeating unit – this handles word sense plus sentence flow with- out slowing down. Built this way, it quietly enables what comes next: how things are tested and tuned later on.
-
-
Training Goals and Improvements This scoring system works by looking at its predictions and comparing them to the grades that people give. People do not always agree on things. The system
should not get confused when the grades are a little different. The goal of the scoring system is not to get everything exactly right but to get close, to what people think even when they do not all think the thing.
The system uses something called Huber loss when it is learning. Huber loss is a way of dealing with mistakes that the system makes. It does not get too upset when the system makes mistakes but it does pay attention when the system makes big mistakes. This means that the scoring system learns from its mistakes and it does not get thrown off by the big mistakes. The scoring sys- tem keeps learning and getting better when the grades that people give are not always consistent. The scoring system is designed to handle the fact that people can be a little inconsistent when they grade things. It is able to learn and improve anyway.
Training runs under AdamW, which balances learn- ing speed and weight updates while limiting changes to only the necessary layers. Early stopping halts train- ing once performance on held-out data stops improving, avoiding overfitting and wasted cycles. Predictions are trained within a fixed range for stability, then rescaled
to real-world scores after training, ensuring outputs match familiar grading standards without losing accu- racy.
-
How Limits Shape Choices
The system is made for schools where computers and labeled data are not very plentiful. To make training faster and not need much hardware the language model is not changed during training. If we were to make a lot of changes to the language model it might work a little better. It would need really powerful computers. This way the system can train faster. Still keep the important knowledge it learned which makes it work well for schools, like mine. Student responses vary in length, so inputs are truncated to retain key ideas and context within model limits. Batch size and optimiza- tion settings are selected to balance memory use and training stability. Overall, the approach prioritizes effi- ciency and robustness, enabling effective learning even with limited data and modest computing resources.
-
-
EXPERIMENTAL SETUP AND RE- SULTS
-
Experimental Setup
The experiments check if the model can grade student answers using a method when there is not a lot of data. The goal is not to beat systems but to see if the model can find consistent connections between what students write and their grades in real life situations. Each test uses grades given by people to help the model learn putting the question any information and the students answer together as one piece of input. When the model is learning it uses one set of data to get better. A separate set of data is used to track progress and de- cide when the model has learned enough, about the student answers and grades. Performance on unseen examples shows whether the model generalizes beyond memorization. Because datasets are small, the empha- sis is on learning behavior, stability, and practicality rather than absolute scores. All experiments run on standard consumer hardware with identical preprocess- ing and training settings, ensuring results are compa- rable and reflect real-world classroom conditions rather than idealized setups.
-
Datasets and Their Features Updated We have two test sets. One test set has answers and the other test set has longer responses from different subjects. The test sets are used to see how the scoring system handles writing styles. The scoring system does not need labeled data to handle the different writing styles.
The main dataset has 777 student answers from math, science, English and computer studies. The student an- swers are graded by humans on a scale that has points. The student answers from the math, science, English and computer studies are very different. They are dif- ferent, in length and structure and the quality of reason- ing. This helps the model learn to predict the student answers from the math, science, English and computer studies.
A smaller dataset of brief answers supports testing, offering faster evaluation and showing the system re- mains reliable across simpler inputs. Both datasets re- flect the limited, uneven data typical of real classrooms. All inputs follow the same preprocessing and trunca- tion strategy, without dataset-specific rules or scoring hints. After training, model parameters are fixed and each new answer is graded independently, relying only on learned patterns rather than stored examples.
-
Training Setup and Execution Details This part tells you about the training setup we used for all the experiments. The model learns by trying to match what it thinks to the scores that people gave. We did everything the same for both sets of data so it is fair. We used the way of getting the data ready putting the information together and checking how well the model does for both datasets.
The training process uses the AdamW optimizer with a fixed learning rate. This helps to make sure that the updates are stable and always the same. The batch size is picked to make sure that the computer has memory and that the gradients are stable when we use standard GPUs. The training process also makes sure that the scores are normalized. This is done for stability. Then these scores are rescaled back to the range when we do the prediction, with the AdamW optimizer and the fixed learning rate and batch sze.
Early stopping prevents overfitting by halting train- ing when performance on unseen data stops improv- ing. Two strategies are compared: freezing the encoder to reduce computation and full fine-tuning for deeper task adaptation. All experiments run under the same software and controlled conditions to ensure that per- formance differences reflect design choices rather than setup variations.
-
How Training Progresses and Reaches Stability Looking closely at how the training unfolds, this part checks if the regression scoring method settles into a steady pattern. Instead of zeroing in on exact num- bers, it watches for smooth progress during learning. A close eye is kept on whether jumps or drops appear too often in the results. Even with limited data and computing power, does the model adapt without wild swings. The way updates are applied gets tested for re- liability throughout the run. Small shifts matter more than peak scores here. Watched over many cycles, the system reveals if balance holds. Stability becomes clear only after repeated observation.
Figure 2: Training and validation loss curves illustrat- ing convergence behavior during model optimization.
in loss marks each round of learning, as shown in Fig- ure
-
From start to finish, the shape of both curves stays aligned – training on one side, validation on the other. Stability shows up early; wild swings never ap- pear. Even though less data could cause problems, per- formance holds steady. Nearness between the two lines hints at balance – one that avoids memorizing too much. Not seeing wild swings or sharp jumps in validation loss means the setup holds steady while learning un- folds. When resources are tight, pushing too hard or using a bulky model often ends badly – overfitting shows up fast, progress wobbles. Smooth paths in the loss curves point to balance: fixed meaning layers, lean se- quence handling, paired with a resilient loss function keep things on track. Stopping too late can hurt per- formance, so cutting training short at the right moment matters. Seen in Figure 2, the process halts when gains on validation data stall across several rounds. Without it, weights keep shifting even when learning offers no real benefit. Performance levels off because the model stops chasing tiny improvements that harm overall ac- curacy. What emerges is a steady path where progress
meets stability without extra steps.
-
Evaluation Metric Mean Absolute Error
A single number helps check how close the systems grades come to real ones – this gap gets measured using something called Mean Absolute Error. What matters here is the typical size of mistakes, figured by looking at differences between machine guesses and actual human scores already in the data. The math behind it adds up those gaps, skips minus signs, then divides by total attempts.
What counts most
shows up in subtle trends across time.
-
Training and Validation Loss Pat-
MAE = 1
n
|yi yi| n i=1
terns
Loss patterns during training shown in Figure 2 reveal how the model stabilizes over time. The lines track progress on both training and evaluation data, shifting steadily before levelling off. Each phase marks small adjustments that refine performance. Progress halts once improvements stop across both sets. Balance be- tween them signals completion of tuning. A slow drop
One reason MAE fits well here? It stays clear even under close inspection. Because errors add up at a steady pace, unlike those that square them, results stay grounded when measuring student work. Picture grad- ing papers by hand – small mistakes matter less than overall fairness. Systems that score automatically value this balance: predictable responses beat dramatic reac- tions to odd cases. Clarity tends to win when decisions must stand scrutiny.
Here the error measure sticks to actual score values so results match the grading scale exactly. About 0.23 off on average – thats how far guesses drift from real marks when tested. Most times, answers land within a small notch of the expected grade despite messy open- ended replies. Synthetic examples were used yet the fit stays tight between what the system outputs and human-style ratings. Rarely does such close tracking happen with made-up training data floating around. One way to judge how close the scores are is by looking at the MAE – it gives a clear number to work with when deciding where each student fits, whether low, medium, or high. What those numbers mean in real classroom terms shows up down the line, once deeper analysis takes place.
-
-
Ablation Study Design Sensitivity
Figure 3: Ablation study loss curves comparing the full hybrid architecture with a reduced variant
Loss curves shown here contrast the complete hybrid setup against a simpler version. The comparison ap- pears in Figure 3. Each line tracks performance drop during testing. One follows the full model. The other reflects stripped-down changes. Differences emerge over time. Not every shift behaves the same way. Details sit below the main trend. Looking closer at how parts of the design matter, one version removes sequence-level modeling while another keeps it all, then they are com- pared. Shown in Figure 3, their training loss trends appear side by side across time.
Right away, the stripped-down version learns faster at first. Yet it wobbles more later, showing worse scores when tested. Instead, the complete hybrid settles into a steadier path, inching closer to better performance over time. It turns out that guessing meaning straight from context vectors catches shallow patterns just fine. What holds everything together tighter is weaving in how words follow one another. That extra layer adds backbone, making judgments less jumpy across runs. Longer answers show clearer gaps when structure and flow matter most. Because of this, removing parts step by step shows that keeping the sequence model helps turn small bits into a sensible full response. Its light on resources but still does the job well.
One thing to keep in mind – the point of an ablation study isnt proving one model beats another. Instead, it shows how changes in design affect performance. What
stands out is how tightly results tie to architecture details. When data gets sparse, handling sequences well begins to matter more. Structure tweaks can shift outcomes noticeably. This kind of test reveals which parts really pull their weight. Focus lands on stability during prediction tasks. Small adjustments may rip- ple through results. Not every component contributes equally when information runs low.
-
Convergence Patterns Noted
Every time it ran – first tests or stripped-down ver- sions – the system settled into similar rhythms, holding steady during checks. Loss graphs show how learning stays on track, rarely slipping into memorization traps, despite tight computing limits and smaller data pools. From here, insights begin shaping the deeper look ahead
– later parts explore how the model actually behaves, not just its overall error patterns. How it acts in real situations takes centre stage once numbers fade into background.
-
-
Ablation and Design Exploration
So, we do an ablation study to see how each part of the design affects how well it works when we do not have a lot of data and computer power. We look at the hybrid model and compare it to a simpler version of the hybrid model. The simpler version of the hybrid model does not have the sequence processor. It just uses the lan- guage model outputs to predict scores. Everything else is the same, in both the hybrid model and the simpler version of the hybrid model.
The simpler model learns fast at first but it makes more mistakes when it is tested later o. This model is not very good at understanding how things are con- nected so it can say things that make sense on the surface. Do not really make sense when you think about it. The hybrid setup is better because it looks at things in order and it helps the model learn steadily and grade things more accurately especially when the answers are very detailed. The hybrid setup is re- ally good at handling answers because it understands how things are connected. Results show that the hy- brid model is better at this, than the model. These findings highlight the value of combining contextual encoding with lightweight sequence modeling to im- prove regression-based scoring while maintaining com- putational efficiency.
-
One Sample Analysis and Nonnumeric Review This part is about what the system does when it is actually being used not when it is being trained or de- signed. In a classroom each answer has to be looked at on its own
without looking at what other people have submitted or what has been done before. When the system has been trained it only works in a mode where it makes guesses it takes one question, some informa- tion if needed and one answer from a student and then it gives a score from 0 to
100. The system is used in classrooms to grade student responses, like the system and it does this by looking at the student response and the question and the extra information and then the
system gives a score.
Each time we get a response it goes through the steps that we used when we were training. The semantic en- coder looks at the text and does not change the se- quence model looks at how things are structured and the regression head gives us a value that we adjust to fit the original range of grades. Since the parameters are fixed the predictions are always the same and new inputs do not change them. The semantic encoder and the sequence model and the regression head all work to- gether to make sure the predictions are consistent and the semantic encoder and the regression head are al- ways using the parameters.
Table 2: Illustrative example of single-sample inference Question Student Answer Predicted
Score
What we see suggests that tossing in a small recurrent piece makes sense – not too heavy, yet still powerful enough. The design choice holds up under scrutiny. So, the findings really show how the method works when data and computing power are limited, while also high- lighting consistent behavior – yet they do not claim one system outperforms another. Not a final verdict on scoring tools, just what held up under tight constraints. Stability in the forecasted scores backs up the regres- sion method, since differences in answers show clearly without needing fixed categories like classification sys- tems do. One by one, the model handles student replies once learning finishes, which shows it works on its own. Even without seeing past data or comparing to other students, it still produces sensible results. That fits how real grading usually happens. Because of this, the method isnt stuck requiring groups of submissions and
functions well in everyday scoring tasks.
(Example question text)
(Example student re- sponse)
(Predicted score)
One thing becomes clear from the data – the sug- gested method using regression and a mixed design works well enough for grading written responses when real-
An example shown in Table 2 illustrates a single post- training prediction, demonstrating how the system op- erates on one answer at a time. This example is in- tended to show the workflow, not to claim perfect ac- curacy. Scores may differ from human judgment due to vague rubrics or subjective grading. Overall, the system reliably scores individual responses without relying on batches or stored data, aliging with real-world grading conditions.
-
-
DISCUSSION AND FUTURE RESEARCH
DIRECTIONS
-
Discussion of Results
Smooth drops in error rates appear across both train- ing and validation phases, hinting at steady learning without sudden jumps. What stands out is how well the system holds up even when data stays limited, rely- ing on distilled meaning plus dynamic tracking through answer sequences. Instead of collapsing or fluctuating wildly, performance inches forward with each cycle. Be- hind this lies a mix: rich prebuilt understanding flows into compact pattern detection, guided by a penalty sig- nal built to resist noise. Evidence from tests in Section 4 shows scores form reliably under tight computing limits. Stability comes not from brute force but balance – light design meets smart adaptation. Noticing how things settle shows the frozen language encoder still lets the model pick up useful scoring habits. Even if tweaking just part of the system holds back deep customization, stability during training improves – overfitting drops too – a real plus with limited data. Loss trends stick- ing together through rounds suggest this setup fits well when resources are thin.
It turns out the ablation study sheds light on how sequence modeling affects the systems performance. Instead of just relying on raw embedding inputs, us- ing a recurrence-based approach helps keep predictions steadier during testing. Longer answers benefit most – especially when structure matters as much as meaning.
world limits exist. Though it doesnt top current best systems, the approach holds up consistently across tests. What matters most here is that it functions as intended, stays reliable, and meets the core goals set at the start. Success isnt measured by speed or flashiness, but by steady, correct operation under pressure.
-
Limitations and Future Research Di- rections
The proposed system gets results but it has some ob- vious problems. It only works with text that people type so it does not work with handwriting or scanned things. If you freeze the language model it goes faster. It is not as good at dealing with special words that are only used in certain subjects. The system does not use a set of rules to grade things it just looks for patterns, which can make it hard to understand how it is grading things and it may not be consistent. The system grades each answer on its own it does not keep track of how a students doing over time and the language model does not say how sure it is, about its answers. We need to do testing to see if the results are true for the evaluation. The evaluation is based on an amount of labeled data so we have to make sure it works for the evaluation, in all situations. This is important for the evaluation to be trustworthy.
Future work can improve transparency by integrat- ing rubric-aware scoring and aligning predictions with instructional goals. Adapting the model to new sub- jects using minimal labeled data, incorporating teacher feedback, adding uncertainty estimation for human re- view, and extending to multilingual scoring with effi- cient models would enhance reliability while maintain- ing suitability for limited-resource settings.
-
-
CONCLUSION
This paper presented a regression-based approach for automated scoring of descriptive student answers using a hybrid DistilBERTGRU architecture. The proposed system formulates answer evaluation as a continuous
prediction task, enabling fine-grained scoring that bet- ter reflects partial correctness, semantic relevance, and explanation depth than discrete classification-based ap- proaches. The design emphasizes practical feasibility under constrained data and computational resources, which are common in many educational settings.
The experimental analysis demonstrated that the pro- posed model exhibits stable training behavior and con- trolled convergence when optimized using a robust re- gression objective. Ablation results highlighted the con- tribution of sequence-level modeling in improving vali- dation stability and scoring consisency, particularly for longer responses. Qualitative evaluation further con- firmed that, after training, the model can operate inde- pendently to score individual student responses without reliance on the training dataset or batch-level process- ing.
While the evaluation does not aim to establish state- of- the-art performance, the results support the feasi- bility of combining pretrained semantic representations with lightweight sequence modeling for automated an- swer scoring in low-resource scenarios. The explicit sep- aration between training and inference, along with the regression- based formulation, ensures that the system aligns with realistic usage conditions in educational as- sessment workflows. Several limitations remain, includ- ing the absence of rubric-aware scoring, uncertainty es- timation, and support for non-textual inputs. Address- ing these challenges offers promising directions for fu- ture research and would further enhance the applicabil- ity of automated scoring systems in diverse educational contexts.
Overall, this work contributes a practical and exten- sible framework for automated student answer scoring that balances semantic expressiveness, computational efficiency, and interpretability, providing a foundation for future developments in learning-based educational assessment systems.
-
D. Bahdanau, K. Cho, and Y. Bengio, Neural Ma- chine Translation by Jointly Learning to Align and Translate, Proceedings of the International Con- ference on Learning Representations (ICLR), 2015.
-
S. Burrows, I. Gurevych, and B. Stein, The Eras and Trends of Automatic Short Answer Grading, International Journal of Artificial Intelligence in Education, vol. 25, no. 1, pp. 60117, 2015.
-
A. Mohler and R. Mihalcea, Text-to-Text Seman- tic Similarity for Automatic Short Answer Grad- ing, Proceedings of the 12th Conference of the European Chapter of the ACL, pp. 567575, 2009.
-
D. Alikaniotis, H. Yannakoudakis, and M. Rei, Automatic Text Scoring Using Neural Networks, Proceedings of the Conference on Empirical Meth- ods in Natural Language Processing (EMNLP),
pp. 715725, 2016.
-
A. Farag, M. Elsherif, and A. M. Riad, A Deep Learning Architecture for Automatic Essay Scor- ing, arXiv preprint arXiv:2206.08232, 2022.
-
N. Reimers and I. Gurevych, Sentence-BERT: Sentence Embeddings using Siamese BERT- Networks, Proceedings of EMNLP, 2019.
-
D. Cer, Y. Yang, S. Kong, et al., Universal Sentence Encoder,
Proceedings of EMNLP, pp. 169174, 2018.
-
P. J. Huber, Robust Estimation of a Location Pa- rameter, The Annals of Mathematical Statistics, vol. 35, no. 1, pp. 73101, 1964.
-
M. Hedderich, D. Adelani, D. Zhu, et al., A Sur- vey on Recent Approaches for Natural Language Processing in Low-Resource Scenarios, Proceed- ings of the ACL, 2021.
-
P. Ganesh, Y. Chen, Y. Lou, et al., Compressing Large-Scale Transformer-Based Models: A Case Study on BERT, Transactions of the Associa- tion for Computational Linguistics, vol. 9, pp. 10611080, 2021.
-
R. Pascanu, T. Mikolov, and Y. Bengio, On the Difficulty of Training Recurrent Neural Networks, Proceedings of the International Conference on Machine Learning (ICML), pp. 1310 1318, 2013.
-
Y. Zhang, A. Shah, and M. Chi, Deep Learning and Student Modeling for Personalized Learning, International Journal of Artificial Intelligence in Education, vol. 30, pp. 101125, 2020.
REFERENCES
-
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
L. Jones, A. N. Gomez, L- . Kaiser, and I. Polo- sukhin, Attention Is All You Need, Proceed- ings of the 31st International Conference on Neu- ral Information Processing Systems (NeurIPS),
pp. 59986008, 2017.
-
J. Devlin, M. Chang, K. Lee, and K. Toutanova, BERT: Pre- training of Deep Bidirectional Trans- formers for Language Understanding, Proceed- ings of NAACL-HLT, pp. 41714186, 2019.
-
V. Sanh, L. Debut, J. Chaumond, and T. Wolf, DistilBERT: A Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter, Proceed- ings of the 5th Workshop on Energy Efficient Ma- chine Learning and Cognitive Computing (EMC2), 2019.
-
J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, Empirical Evaluation of Gated Recurrent Neu- ral Networks on Sequence Modeling, NIPS 2014 Workshop on Deep Learning, 2014.
