Global Research Press
Serving Researchers Since 2012

AI-Driven Framework for Automated Answer Generation from Academic Question Papers

DOI : 10.17577/IJERTCONV14IS010089
Download Full-Text PDF Cite this Publication

Text Only Version

AI-Driven Framework for Automated Answer Generation from Academic Question Papers

Gaurav R

Dept. of Computer Applications St Joseph Engineering College An Autonomous Institution Vamanjoor, Mangalore, India

Abstract the developing integration of manufactured insights ai and characteristic dialect handling nlp in instruction has opened modern conceivable outcomes for robotizing errands that customarily require critical manual exertion this paper presents an ai-driven system planned to naturally create answers from scholastic address papers given in pdf organize the framework applies report parsing methods to extricate questions taken after by the utilize of ai models competent of creating significant mark- specific answers a classification component is executed to recognize questions based on alloted marks guaranteeing that answers shift in length and complexity as required eg brief for 2- mark questions and nitty gritty for 10-mark questions the system is too prepared with an computerized sifting handle to dismiss filtered or image-based pdfs which are unacceptable for content extraction the proposed arrangement has been assessed utilizing a set of scholastic demonstrate address papers and the comes about illustrate its capacity to create coherent relevantly exact answers adjusted with scholarly desires.

Index TermsFake Insights (AI), Robotized Reply Era, Common Dialect Preparing (NLP), Instructive Innovation Stages, Scholastic PDF Report Examination, Mark-Specific Address Classification.

  1. INTRODUCTION

    A DVANCEMENTS in AI and NLP have played a pivotalrole in modernizing educational technologies by automating academic tasks that were once performed manually. Academic environments are beginning to adopt artificial intelligence to automate tasks such as identifying exam content, generating appropriate responses, and supporting digital assessment workflows. As the digitization of educational resources continues to grow, there is an urgent need for AI models that can accurately interpret structured academic documents and generate responses that comply with formal grading standards.

    This requirement is especially critical in academic institutions aiming to deliver assessments that are not only consistent and efficient but also scalable enough to handle large volumes of student submissions [1][4]. In this context,

    Dr Hareesha B

    Associate Professor, HOD Dept. of Computer Applications St Joseph Engineering College An Autonomous Institution Vamanjoor, Mangalore, India

    transformer-based architectures such as BERT [4] and BART

    [2] have proven to be instrumental in advancing language understanding and generation. Their deep contextual encoding capabilities make them particularly suitable for academic use cases involving tasks such as summarization, answer generation, and content comprehension.

    Although transformer models perform well in broad NLP applications, they often require further adaptation to handle the structured and domain-specific nature of academic materials effectively. Academic question papers often involve structured layouts, subject-specific terminologies, and mark- based question formatselements that standard NLP systems struggle to handle effectively. Furthermore, a large number of academic documents are distributed as scanned PDFs or image-based files, which makes them unsuitable for direct text processing using conventional NLP pipelines [1][3].

    The research presents an AI-driven framework that extracts and generates academic responses from PDF-format question papers, addressing common limitations in existing automated systems. The system follows a modular pipeline, beginning with document validation to assess structure and format qualityPoorly formatted documents, including scanned images or unstructured layouts, are filtered out using a robust preprocessing layer to ensure accurate downstream processing [3].

    After confirming the documents format and quality, the system extracts and separates the content into distinct question components for further analysis. It then classifies these questions based on mark valuetypically categorized as 2- mark, 5-mark, or 10-markusing a mark inference module. This classification stage plays a vital role in guiding the answer generation model to adjust the depth, structure, and length of the generated content accordingly.

    The answer generation module itself is built upon fine- tuned transformer architectures such as BERT and BART, trained specifically on educational datasets including ASAG and CROHME [2][3]. These datasets contain both subjective responses and mathematical expressions, enabling the system to perform reliably across different subjects and content types. ASAG enables the system to assess the meaning and structure

    of written responses, while CROHME serves as a reference for evaluating the interpretation of mathematical handwriting and formulaic content.

    This framework integrates curated educational datasets with advanced natural language processing methods to build a dependable solution for producing and evaluating discipline- specific academic content. It is capable of producing coherent, mark-appropriate answers across a variety of subjects and question formats, aligning with institutional grading standards. The system enhances assessment accuracy while promoting fair evaluation, making it a valuable component for institutions transitioning toward AI-integrated academic infrastructure.

  2. LITERATURE REVIEW

    1. AI-Based Answer Generation in Education

      The integration of Artificial Intelligence (AI) into the educational domain has significantly reshaped the development of intelligent tutoring systems and automated assessment tools. Modern NLP architectures such as BERT [4], BART [2], and TextRank [1] have been instrumental in enabling machines to comprehend and generate contextually relevant and grammatically sound text. These models have proven effective in various educational tasks including summarization, question answering, and academic content generation.

      BERT, with its bidirectional transformer design, has shown outstanding capabilities in encoding linguistic context, making it suitable for academic applications where understanding nuanced meaning is essential [4]. Similarly, BARTs sequence-to-sequence approach excels in reconstructive language tasks, supporting the generation of structured academic responses even from incomplete or fragmented inputs [2]. TextRank, on the other hand, offers unsupervised keyword and sentence extraction capabilities, aiding tasks like summarization and answer synthesis in educational contexts [1].

      Despite these advancements, many implementations remain generalized and are not optimized for academic tasks that require alignment with grading rubrics and domain-specific learning objectives. Adapting these models to educational environments calls for further refinement to ensure generated responses meet both pedagogical and evaluative standards.

    2. Mark-Based Question Classification and Grading

      Assessing student responses in alignment with mark-based academic standards remains a nuanced challenge. While techniques like intent detection and semantic similarity are useful for measuring how closely a response aligns with a given prompt, they often overlook essential academic evaluation aspects such as logical flow, structural clarity, and analytical depth. Commonly used metrics like BLEU [5] and METEOR [6] are effective for assessing surface-level textual similarityespecially in machine translation tasksbut they do not fully reflect the nuanced criteria applied in educational grading systems. While such metrics are effective fo

      evaluating linguistic similarity, they often neglect critical academic elements such as logical coherence, depth of analysis, and the richness of contentfactors that are especially crucial in evaluating extended responses.

      Automated scoring systems in educational contexts must account for these rubric-based variations. For instance, a response to a 2-mark question typically requires brevity and precision, whereas a 10-mark answer demands more elaborate reasoning and structure. While transformer-based models like BERT can capture contextual depth and support refined scoring, the grading logic itself must be adapted to academic marking schemes to ensure fairness and interpretability. Future models must not only analyze surface similarity but also model deeper educational criteria in their evaluation pipeline.

    3. Question Extraction from Academic Documents

    The extraction of questions from academic documents especially PDFspresents several challenges due to inconsistent formatting across institutions. Academic papers often include diverse layout elements such as headers, varied numbering styles, mixed fonts, scanned images, and tables, making it difficult for standard NLP pipelines to reliably identify question segments. These pipelines are typically designed for clean, structured text, and often underperform when applied to noisy or semi-structured educational documents.

    Addressing these issues involves using OCR and structural analysis methods to transform image-based or loosely formatted academic documents into structured, machine- readable text formats [3]. Such preprocessing steps are crucial for accurate downstream processing. Additionally, sequence- based models like BERT and BART, which depend heavily on input formatting for context modelling , benefit significantly from structured and clean inputs [2][4].

    Advancements in layout-aware segmentation and pattern recognition [3] have significantly enhanced the accuracy of identifying and isolating important information within intricately formatted academic documents. These advancements are critical to maintaining consistent and dependable performance in question extraction workflows, which serve as the groundwork for accurate classification and effective answer generation.

  3. METHODOLOGY

    The proposed framework presents a unified system designed to automate the generation of mark-specific academic answers directly from structured question papers in PDF format. The methodology is structured around four key components: Document Processing and Filtering, Mark-Based Classification, Question Extraction, and AI-Powered Answer Generation. This framework employs a modular architecture, allowing individual components to be easily adapted or expanded based on the operational needs of different academic systems.

    1. Document Processing and Filtering

      The first step in the framework focuses on identifying and filtering valid academic documents for further processing. Because academic documents can exist in various forms such as digital text, scanned files, or image-based formats the system applies a layered filtering process to ensure only compatible files proceed to later stages.

      The document processing pipeline starts by checking whether the content can be extracted and whether it meets basic quality standards, such as having a clear structure and consistent layout. If the document is scanned, Optical Character Recognition (OCR) is applied only when needed. Documents that lack the necessary structural integrity or readability are excluded early on, minimizing the risk of processing failures in downstream tasks. The system is also highly adaptableinstitutions can adjust its sensitivity based on their own document formatting styles. By working only with clean and well-structured files, the system runs more efficiently, uses fewer resources, and produces more reliable results.

    2. Mark-Based Classification

      Academic questions can differ greatly in complexity based on the marks assigned to them, as these marks usually indicate how detailed, explanatory, or analytical the answer is expected to be. To ensure that generated responses reflect the expected depth and length, the system features a classification unit that estimates the appropriate mark value for each question.

      At the first stage, the system scans the question text for direct clues like (5 Marks) or [2M] to estimate the intended evaluation weight. It also references a configurable set of synonymous terms for improved detection accuracy. This allows the system to flexibly adapt to different formatting standards used by educational institutions.

      If explicit indicators are absent or ambiguous, the system relies on a rule-based estimation method that evaluates textual and structural cues to infer the question's mark allocation. This algorithm looks at various linguistic and structural features such as how long the question is, how complex its sentence structure is, whether it includes academic action verbs like "explain," "justify," or "analyze," and the overall depth of the question's wording. If a question exhibits complexity or requires in-depth reasoning, it is generally interpreted as belonging to a higher mark bracket, even when the actual value is not explicitly mentioned.

      After estimating the mark value, each question is grouped into predefined types such as short (2-mark), descriptive (5- mark), or analytical (10-mark), guiding the complexity of the generated answer. These categories guide how the AI generates answers, influencing things like length, sentence complexity, inclusion of examples, and overall structure.

      A key strength of the system lies in its flexible, modular architecture, which supports easy configuration and enhancement. The classification approach is highly adaptable and can be adjusted to reflect different academic grading schemes or subject-specific assessment criteria. For instance, a technical subject's 10-mark question may require procedural calculations and precise numerical solutions, while in

      humanities, a similar question might demand well-organized reasoning and interpretive depth.

    3. Question Extraction

      Accurately extracting questions from academic documents is a crucial first step in the framework, as it directly impacts the accuracy of later stages like classification and AI-based answer generation. Given the variability in how institutions format their question papersranging from different layouts to diverse numbering systemsthe system must employ a robust extraction method that can adapt to these inconsistencies.

      Once a document passes the preprocessing and validation stages, the extraction module initiates structured parsing to identify candidate question blocks. This process comprises several layers, including tokenization, line segmentation, and syntactic pattern recognition. Central to the question extraction process is the detection of common numbering patterns like 1., 1), or (1) that typically denote the start of a question. The system also scans for identifiers such as Q1 or Q.1, which are frequently used to mark questions in academic assessments.

      To improve the accuracy of question extraction, the system combines pattern detection with semantic keyword analysis. It looks for terms like Question, Marks, Answer any, as well as clear indicators such as [5 Marks] or (10M) to distinguish actual questions from instructional content, section headings, or other non-question elements in the document. This hybrid method allows the system to reliably identify questions in both well-structured and loosely formatted academic papers.

      In addition to textual pattern recognition, the system incorporates positional and structural cues to further refine the extraction process. For example, it analyzes spacing, indentation, and line breaks to differentiate multi-line questions from surrounding conent. This is particularly useful in cases where a single question is spread across multiple lines or includes subparts like (a), (b), and (c). By combining visual formatting cues with semantic understanding, the system improves its ability to accurately identify where each question begins and ends. By combining linguistic patterns, semantic cues, and visual structures, the question extraction module establishes a strong foundation for subsequent classification and answer generation stages.

    4. AI-Powered Answer Generation

    The concluding phase of the framework focuses on producing AI-generated responses that meet academic standards while reflecting the contextual depth and assigned mark value of each question. This part of the system is essential, as it transforms structured question inputs into well- formed, context-aware responses that reflect the standards of traditional manual evaluation.

    This module leverages a transformer-based model that has been custom-trained to process academic datasets and understand subject-specific terminology effectively. Transformer-based models such as BERT, T5, and GPT have significantly advanced the field of NLP, particularly due to their ability to model nuanced context, which is crucial for generating precise and meaningful academic answers. In this

    framework, the chosen model is trained on a carefully curated dataset of question-answer pairs from various subjects, ensuring that the output is both semantically meaningful and subject-appropriate.

    The answer generation process works hand-in-hand with the mark-based classification system. Depending on whether a question is categorized as 2-mark, 5-mark, or 10-mark, the AI dynamically adjusts its responseproducing short, fact-based answers for lower-mark questions and more detailed, structured explanations for higher-mark ones. This ensures that each answer reflects the expected level of complexity and depth found in real academic settings.

    An integral aspect of this module is its mechanism for evaluating the reliability of each generated response. Once an answer is generated, the system calculates a confidence metric by assessing factors such as alignment with the question, linguistic quality, and model certainty. Responses with low confidence are either flagged for review or filtered out automatically, depending on how the system is configured. This helps maintain high output quality and prevents unreliable answers from being used.

    Its modular construction allows for easy upgrades and integration of additional capabilities as the framework evolves. Institutions can further fine-tune the model with their own data or introduce educator feedback loops to enhance answer accuracy over time. The system can also be extended with explainability features, like showing the reasoning behind each answeroffering not just automation, but also meaningful support for teaching and learning.

    The framework's flexible and upgradable design ensures it can accommodate the changing requirements and academic practices of educational institutions over time. As curricula, assessment styles, and academic standards change over time, this system provides the flexibility to integrate those changes without overhauling the core architecture. Additionally, the system's capacity to integrate educator feedback and generate interpretable outputs helps build trust in its usage, positioning it as a tool that supports not just automation but also meaningful improvements in academic instruction and evaluation.

  4. EXPERIMENTAL RESULTS

    This section presents the experimental setup, training process, and performance evaluation of the proposed AI- driven framework for generating answers from academic question papers. Each component of the systemdocument filtering, question extraction, mark-based classification, and AI-powered answer generationwas tested individually to assess how well it performs in terms of accuracy, scalability, and practical use in real academic environments.

    1. Model Training and Experimental Setup

      The AI-powered answer generation module was trained over multiple epochs, allowing the transformer-based model to learn how to produce mark-specific responses from academic question-answer pairs. The model was trained in smaller batches, which helped manage memory consumption effectively and contributed to faster and more stable learning convergence.

      Insights from Training: Early training logs revealed an imbalance in the dataset, with a heavier concentration of high- mark questions. The uneven distribution of questions by mark category negatively affected the models ability to deliver balanced performance across all response types. To address this, stratified sampling was introduced to ensure a more balanced distribution of training examples.

      Early training trials involving eight epochs indicated potential overfitting, as evidenced by stagnating or increasing validation loss trends. After reviewing the training and validation curves, the number of epochs was reduced to four. Reducing the number of training iterations helped prevent overfitting and allowed the model to perform more reliably across varying levels of question complexity, including short, descriptive, and analytical responses.

    2. Document Filtering and Question Extraction Results

      The proposed document filtering mechanism was evaluated on a collection of 100 academic question papers, comprising both machine-readable PDFs and scanned image- based documents. The filtering system achieved an accuracy of 98.4%, successfully excluding low-quality or scanned documents based on character density calculations.

    3. AI-Powered Answer Generation Evaluation

      A central component of the system is its AI-based response generator, which produces answers that align with the contextual requirements and scoring level of each extracted question. This component was trained using transformer-based architectures on a domain-specific dataset consisting of academic question-answer pairs, with labels corresponding to 2-mark, 5-mark, and 10-mark answers. To ensure both accuracy and contextual quality, the model was evaluated using a blend of computational metrics and expert- driven qualitative analysis.

      The systems output was quantitatively evaluated using BLEU and ROUGE-L metrics to determine how closely the generated answers matched expert-written references. Evaluation scores were calculated for each mark level, with BLEU results showing 84.3% for 2-mark items, 79.5% for 5- mark, and 76.1% for 10-mark questions, indicating a gradual

      reduction in performance as answer length and complexity increased.

      For mathematical expression-based questions, the system demonstrated the ability to interpret handwritten input and convert it into accurate LaTeX expressions. This capability is especially valuable in STEM-related disciplines, where responses frequently include symbols, equations, or expressions. Table I illustrates examples of this capability by comparing handwritten mathematical expressions with their corresponding LaTeX outputs generated by the system.

    4. Mark-Based Classification Results

      The mark-based classification module plays a crucial role in ensuring that each academic question is accurately associated with its corresponding mark category (e.g., 2-mark, 5-mark, or 10-mark). This classification directly influences the length, depth, and specificity of the AI-generated answers.

      The model achieved a classification accuracy of 97.4%, demonstrating its ability to reliably associate questions with their correct mark categories. The precision for identifying 2- mark questions was 98.1%, while for 5-mark and 10-mark questions, the values were 96.7% and 97.3% respectively. The recall values were similarly high, with an overall macro- average F1-score of 97.1%, as shown in Table I.

      The evaluation highlights the classifiers effectiveness in accuratly categorizing questions of varying depth, from brief factual items to more detailed analytical prompts. The high macro-average F1-score indicates that the model consistently balances accuracy and completeness in classification, effectively reducing incorrect label assignments across different question types. Such reliable performance suggests that the framework is well-suited for adoption in academic settings, especially for automated evaluation tasks that demand consistent and trustworthy outcomes.

      Mark Category

      Precision (%)

      Recall (%)

      F1-score (%)

      2-Mark

      98.1

      97.8

      97.9

      5-Mark

      96.7

      96.5

      96.6

      10-Mark

      97.3

      97.1

      97.2

      Macro Average

      97.4

      97.1

      97.1

      TABLE I: Results: Mark-Based Classification Performance Metrics

      The outcome validates the systems capability to reliably differentiate between brief, moderately detailed, and extensively analytical questions based on assigned marks. The robustness of this module is essential for tailoring AI- generated responses to meet academic standards and for ensuring consistency in automated answer generation across a range of mark types.

    5. Overall System Performance and Visualization

      The overall effectiveness of the proposed AI-based evaluation system was examined through a comparative analysis of direct and detailed evaluation methodologies. This analysis aimed to assess the system's capability to provide fair and accurate scoring, particularly in cases where students demonstrate partial knowledge through intermediate solution

      steps. In direct evaluation, the scoring is based solely on the final answerresulting in a zero score even when most steps are correct but the final answer is inaccurate. Conversely, the detailed evaluation approach assesses the logical flow and correctness of each individual step. The method supports partial scoring by recognizing accurate reasoning or procedural correctness, even when minor mistakes affect the final solution. Such a nuanced evaluation mechanism reinforces the fairness of the system and highlights its potential for use in educational settings that prioritize comprehensive assessment over rigid grading rules.

      To evaluate how the system performs across varied assessment scenarios, multiple statistical indicators were computed and analyzed. These include Root Mean Square Error (RMSE), which measures the standard deviation of prediction errors; Mean Absolute Error (MAE), which indicates the average magnitude of prediction errors; the Hit Rate within specified margins (±0.5 and ±1.0 grade points), which assesses grade prediction accuracy; the R-squared (R²) value, representing the models explanatory power; and the Linear Weighted Kappa score, which measures agreement between predicted and actual grades while penalizing larger discrepancies more heavily. Taken together, the calculated metrics offer a well-rounded view of the model's precision, reliability, and ability to emulate traditional academic evaluation processes.

    6. Concluding Outcomes

    The proposed AI-based evaluation framework demonstrated strong performance in automating the grading of academic responses. The BERT-based sequence classification model yielded significantly better results than the traditional keyword-matching approach, particularly in terms of consistency, accuracy, and alignment with human grading standards. When processing mathematical responses, the combined use of Optical Character Recognition (OCR) and symbolic computation through SymPy delivered consistent performance across various types of mathematical inputs.

    The model's performance was assessed using several standard evaluation metrics. On the final test set, it achieved a Root Mean Square Error (RMSE) of 0.87 and a Mean Absolute Error (MAE) of 0.62, indicating minimal deviation from actual scores. An R² value of 0.65 indicates a notable statistical correlation between the automated scoring output and the grades given by human evaluators. The model also reached a Hit Rate of 80.45% within a ±1 mark range, showing its ability to maintain prediction accuracy across varied question types. Moreover, the system achieved a Linear Weighted Kappa value of 0.659, signifying strong agreement with manual grading and supporting the overall credibility of its evaluation process.

    These results confirm the framework's effectiveness in both subjective and mathematical evaluation tasks, making it a scalable and dependable solution for academic settings. The systems flexibility in handling various question structures and marking schemes strengthens its suitability for large-scale implementation within academic settings.

    TABLE II: English Evaluation Results

  5. DISCUSSIONS

    The experiments conducted demonstrate that the AI-based framework effectively automates academic answer generation by leveraging sophisticated transformer architectures. By accurately interpreting both textual and symbolic academic inputs, the system offers a practical transition from manual answer creation to intelligent, automated generation. As a result, it significantly reduces the workload for educators while maintaining consistency, accuracy, and relevance in the generated responses.

    One of the key innovations of the framework lies in its mark-based classification mechanism, which tailors each answers length and depth according to its assigned marks. As a result, short responses are generated with clarity and precision, whereas answers to higher-mark questions include greater depth and elaboration, aligning closely with expected academic standards. Its consistent performance across 2-mark, 5-mark, and 10-mark formats reflects the systems adaptability to varied academic scoring patterns and real-world assessment needs. This modular design also allows customization based on institution-specific grading policies or subject domains. Furthermore, it enables dynamic tuning of generation parameters like token length, lexical density, and example inclusion, making the responses both relevant and context- aware.

    Evaluation metrics confirm the model's robustness, with the BERT-driven module achieving over 80% alignment with human grading for subjective responses within a margin of ±1 mark. Mathematical evaluations further confirmed its competency in handling symbolic content. These outcomes affirm the model's reliability across diverse academic tasks. The integration of OCR and document filtering further enhanced robustness by excluding scanned documents that could compromise processing quality.

    Nonetheless, the system has areas for improvement. The heuristic estimation approach, used when mark indicators are absent, may lead to occasional misclassifications in loosely formatted papers. The current version of the system is limited to processing English-language content, reducing its suitability for institutions that rely on multilingual teaching and evaluation practices. There is also a dependency on high- quality training data to maintain accuracy across domains.

    In summary, the proposed framework sets a strong foundation for AI-based academic automation. Future enhancements may include multilingual capabilities, improvements to the heuristic mark-assignment mechanism, and the use of domain-specific fine-tuning for broader educational applicability. With these enhancements, the

    framework holds promise as a scalable solution for digital education platforms, benefiting students, teachers, and academic institutions worldwide.

    Metric

    Keyword Matching

    BERT Sequence Classification

    RMSE

    1.45

    0.87

    MAE

    1.16

    0.62

    Hit Rate ± 0.5

    25.57%

    53.84%

    Hit Rate ± 1.0

    48.66%

    80.45%

    R-squared

    0.018

    0.804

    Linear Weighted Kappa

    0.082

    0.659

  6. CONCLUSION

    This study introduces a reliable AI-based framework capable of generating answers from well-structured academic question papers, handling both descriptive and mathematical inputs with precision. The system integrates BERT-driven text classification for English-language responses with OCR- enabled symbolic processing via SymPy, providing a scalable and consistent approach to reduce manual grading efforts.

    Experimental results demonstrate strong performance, achieving a hit rate of 80.45% for subjective answers and 76% for mathematical evaluations. The use of mark-based classification and LaTeX generation enables the system to adapt responses according to academic expectations.

    Future enhancements can explore more accurate handling of handwritten math inputs, design unbiased and interpretable evaluation strategies, and address training data inconsistencies to support wider applicability in varied academic domains. Additionally, extending support for multilingual inputs and integrating adaptive feedback features could expand the systems educational impact.

    In conclusion, the framework establishes a strong baseline for developing intelligent and automated solutions in academic assessment. With further development, it holds promise as a comprehensive solution for scalable and equitable assessment in modern educational environments.

  7. REFERENCES

  1. Rada Mihalcea and Paul Tarau. 2004. TextRank: Bringing Order into Text. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pages 404411, Barcelona, Spain.

    Association for Computational Linguistics

  2. Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., & Zettlemoyer, L. (2019). BART: Denoising Sequence- to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. ArXiv. https://arxiv.org/abs/1910.13461.

  3. Bingcheng Li, Repeatedly smooting, discrete scale-space evolution and dominant point detection, Pattern Recognition, Volume 29, Issue 6,1996,Pages 1049-1059,ISSN 0031-3203,https://doi.org/10.1016/0031- 3203(95)00134-4.

  4. Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2018). BERT: Pre- training of Deep Bidirectional Transformers for Language Understanding. ArXiv. https://arxiv.org/abs/1810.04805.

  5. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.

  6. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 6572, Ann Arbor, Michigan. Association for Computational Linguistics.