DOI : 10.17577/IJERTCONV14IS010004- Open Access

- Authors : Nikitha Nayak, Ms. Rakshitha P, Mr. Hareesh B
- Paper ID : IJERTCONV14IS010004
- Volume & Issue : Volume 14, Issue 01, Techprints 9.0
- Published (First Online) : 01-03-2026
- ISSN (Online) : 2278-0181
- Publisher Name : IJERT
- License:
This work is licensed under a Creative Commons Attribution 4.0 International License
A Comparative Study of Traditional, Deep Learning, and Multimodal OCR Systems for Document Digitalization and Information
Nikitha Nayak Department of Computer Applications
St Joseph Engineering College Vamanjoor, Mangalore, Karnataka
Ms. Rakshitha P Assistant Professor Department of Computer Applications
St Joseph Engineering College Vamanjoor, Mangalore, Karnataka
Mr. Hareesh B Associate Professor Department of Computer Applications
St Joseph Engineering College Vamanjoor, Mangalore, Karnataka
Abstract – This paper presents a comparative study of traditional Optical Character Recognition (OCR) systems, deep learning-based models, and large multimodal models (LMMs) for document digitization and structured information extraction. In this Research paper I compared different of OCR techniques, for different types of documents such as scanned text, handwritten documents, documents written in multiple languages, scene text like street signs, vehicle number plates. This research included both open source models as well as commercial models like Tesseract, TrOCR, PP-OCR, DocTR, PaddleOCR, EasyOCR. The large language models like GPT-4V, Claude3.5. The research paper is testing all the documents based on factors. like Accuracy, Speed, Costt. After testing research is saying that all Tradition OCR models perform well good on clean, printed documents. Deep learning and Large Language models performed good on the complex layout documents, documents having multiple languages and the handwritten documents. Large language model is having the highest Accuracy compared to all the other modelss. Deep Learning Models were little cheaper than Large Language Models with good Accuracy percentage.
Index TermsOptical Character Recogniton, EasyOcr, Large Language Models, Digitalizing. Documents.
-
INTRODUCTION
OCR, also known as Optical Character Recognition, technology allows digital devices like smartphones, scanners, cameras, and many other devices to detect text found in various types of documents like printed documents, handwritten materials, scanned documents and convert it into real editable text that you can copy, edit, or use.
Traditional OCR models work good on clean, printed documents, but they dont work good when the document having multiple language, blur. text, or handwritten content [1, 6, 8, 9]. With the development of deep learning and large multimodal models (LMMs), OC_R systems have now become more advanced, as they can understand not only individual letters or words but also the overall structure and meaning of a document [1, 5, 7, 8, 9].
This paper provides a comparative study evaluating traditional OCR engines, deep learning models, and LMMs. The goal is to inform researchers and practitioners about the effectiveness of these models under various document conditions.
-
LITERATURE REVIEW
Several studiees have examined different OCR systems for varied document types: Paper [1] explored digitization using Tesseract, EasyOCR, DocTR, and LLMs like GPT- 4 for structured extraction. It reported that Google Vision and GPT-4 achieved the highest accuracy across structured and handwritten documents [1]. In [2], Baidus PP-OCR showed strong performance on multilingual and printed business forms, with emphasis on lightweight deployment [2]. EffOCR in [3] focused on historical and low-resource language documents, achieving 9599% accuracy [3]. Paper [4] introduced PDTNet, a hybrid model outperforming EasyOCR and Tesseract in scene text (96%) [4]. In [5], OCRBench compared GPT-4V, TrOCR, and DocTR, highlighting LMMs strengths [5]. Study [6] reported ABBYY FineReader had top accuracy for structured and low-quality scans [6]. Benchmark MLOT
[7] showed GPT-4o (~94%) leading across multilingual tasks [7]. Paper [8] emphasized LLM superiority oncursive handwriting [8]. Lastly, [9] reported GPT-4o-mini and Google Vision performed best on meme-style and noisy multilingual images [9].
-
METHODOLOGY
-
Categorization of OCR Models
Traditional OCR: Tesseract v3/v4, OCRopus
Deep Learning OCR: EasyOCR, DocTR, PP-OCR, PaddleOCR, TrOCR, Calamari
Large Multimodal Models: GPT-4V, Claude 3.5, Gemini, BLIP-2, InternVL, Bard
-
Traditional OCR
Conventional OCR systems, such as Tesseract and OCRopus, provide quick, open-source, and cost-free solutions that work well on clear, printed documents. But they are having problem with intricate layouts, handwritten text, and noisy scans. They are perfect for simple digitisation activities where accuracy is not crucial because of their speed and ease of use.
-
Deep Learning OCR
Strong accuracy (~9095%) is achieved by deep learning-based models like DocTR, PP-OCR, TrOCR, and PaddleOCR on handwritten, multilingual, and structured documents. They are appropriate for academic, mobile, and edge deployments because they strike a balance between computing efficiency and performance. Without the expensive expense of LMMs, these models perform better than classic OCR in a variety of complicated circumstances.
-
Large Language Models
Semantically rich documents with intricate layouts, handwriting, and noise are well-interpreted by models like as GPT-4V, Claude 3.5, and Gemini. Although they are slower and require premium API access, they can do reasoning and structured extraction and attain the best accuracy (9396%). When extensive document knowledge is needed for enterprise or high-stakes use cases, their power is most effectively utilised.
-
-
Document Types Evaluated [19]
Scanned printed documents, Receipts and invoices, Structured forms, Handwritten content, Multilingual texts, Scene and meme-style images.
-
Evaluation Metrics [19]
Accuracy (Character Error Rate, Precision, Recall), Speed (local inference time vs. API latency), Cost (free, open- source, or commercial).
-
Experimental Consolidation [19]
In order to guarantee consistency and comparability across various methodologies, we combined the presented
findings from all nine research publications, paying particular attention to their assessments carried out using controlled datasets.
-
Additional Contributions [1-9]
To provide further in-depth understanding and analysis in the Results section, we created a number of thorough comparison visualisations. These visualisations show the practical efficacy of each strategy through sample text extraction outputs, pipeline design comparisons to highlight structural variations in the techniques, and accuracy comparisons across several models and methodologies. We sought to give a better grasp of the advantages, disadvantages, and practicality of the assessed approaches by contrasting these disparate viewpoints.
-
-
RESULTS AND ANALYSIS
-
Visualizations
Fig1:Performance Comparison by accuracy,Cost,Speed
Fig. 2: OCR model comparison by document type.
Fig. 3. Accuracy comparison of OCR models
Fig. 4. Pipeline comparison across OCR models
Fig. 5. Sample Bill
Fig. 6. Sample text extraction outputs from various OCR model categories.
-
CONCLUSION
The study says that traditional OCR systems like Tesseract are effective for clean printed documents due to their speed and open-source availability, but they falter with complex layouts, handwriting, or multilingual texts [1, 6]. Deep learning OCR models, such as DocTR, PP-OCR, and TrOCR, strike a robustz balance betweenaccuracy and computational efficiency, making them ideal for students and academic users as they are free and perform well on structured forms and multilingual documents [1, 2, 3, 5]. Large multimodal models (LMMs) like GPT-4V and Claude 3.5 lead the field, excelling in extracting structured information from handwritten, noisy, or semantically complex documents, though their slower API-based inference and costs limit their use to enterprise or high- stakes applications [5, 7, 8, 9]. Future work should focus on hybrid pipelines combining traditional OCRs speed with LMMs semantic capabilities, alongside real-time, privacy-preserving OCR systems to support historical, handwritten, and low-resource language documents [19].
-
REFERENCES
-
-
Sinha, Rasha. "Digitization of Document and Information Extraction using OCR." arXiv preprint arXiv:2506.11156 (2025).
-
Du, Yuning, et al. "Pp-ocr: A practical ultra lightweight ocr system." arXiv preprint arXiv:2009.09941 (2020).
-
Carlson, Jacob, Tom Bryan, and Melissa Dell. "Efficient ocr for building a diverse digital history." Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024.
-
Prakash, Puneeth, and Sharath Kumar Yeliyur Hanumanthaiah. "A comparative analysis of optical character recognition models for extracting and classifying texts in natural scenes." Int J Artif Intell ISSN 2252.8938: 1291.
-
Liu, Yuliang, et al. "OCRBench: on the hidden mystery of OCR in large multimodal models." Science China Information Sciences 67.12 (2024): 220102. [9] X. Liu et al., Comprehensive OCR Benchmark for Multimodal Models, arXiv:2403.09566, 2024.
-
Jain, Pooja, Kavita Taneja, and Harmunish Taneja. "Which OCR toolset is good and why: A comparative study." Kuwait Journal of Science 48.2 (2021).
-
Yang, Zhibo, et al. "CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy." arXiv preprint arXiv:2412.02210 (2024).
-
Kim, Seorin, et al. "Early evidence of how LLMs outperform traditional systems on OCR/HTR tasks for historical records." arXiv preprint arXiv:2501.11623 (2025).
-
Singh, Iknoor, Miguel Colom, and Kalina Bontcheva. "A Comparative Analysis of OCR Models on Diverse Datasets: Insights from Memes and Hiertext Dataset." Proceedings of the Winter Conference on Applications of Computer Vision. 2025.
