Information Extraction from Unstructured data using Augmented-AI and Computer Vision

Download Full-Text PDF Cite this Publication

Text Only Version

Information Extraction from Unstructured data using Augmented-AI and Computer Vision

Aditya N. Parikh Department of Electronics and Telecommunication Engineering

Vishwakarma Institute of Technology, Pune

Abstract Process of information extraction (IE) is often used to extract meaningful information from unstructured and unlabeled data. Conventional methods of data extraction including application of OCR and passing extraction engine, are inefficient on large data and have their limitation. In this paper, a peculiar technique of information extraction is proposed using A2I and computer vision technologies, which also includes NLP.

Keywords AWS, Augmented, OCR, Textract, NLP


    The information extraction (IE) method extracts meaningful structured information in the form of entities, relations, objects, events, and a variety of other sorts from unstructured data. Data is prepared for analysis using the extracted information from unstructured data. As a result, the IE process increases data analysis by efficiently and accurately transforming unstructured data. For diverse data kinds, such as text, picture, audio, and video, a variety of approaches have been presented.

    We frequently confront a large-scale issue as well, because the volume of communication to deal with might be enormous. Manually processing this data is a very expensive activity, and business and market demands have prompted a significant amount of research and development in the area of automated processing of administrative documents.

    PDF documents are commonly utilised in today's workplace for transferring business information, both internally and with trading partners. You've probably seen a lot of PDFs in the form of invoices, purchase orders, shipping notes, pricing lists, and other documents. Despite its use as a digital substitute for paper, PDF documents provide a difficulty for automated data processing. Because certain PDFs are meant to communicate information to humans rather than machines, it is as accessible as material printed on paper. Unstructured information that lacks a pre-defined data model or is not arranged in a pre-defined manner can be found in such PDFs. They're usually text-heavy, with a combination of figures, dates, and statistics.


    Due to the large number and complexity of unstructured data, extracting valuable information from various forms of data has become a difficult undertaking. A thorough literature study was done in this respect to identify state-of-the-art difficulties. Many commercial systems that provide such services are based on the creation of pre-set spatial templates, which essentially map the regions where the OCR must read the fields to be extracted. Such simple tactics may work flawlessly in small settings, but they begin

    to present complications when we encounter large-scale scenarios or if the document layout changes over time.

    Such methods use a sample document to learn a local layout structure, which is subsequently registered to the test pictures to determine the fields to extract. In such systems, however, the user is required to designate multiple semantically important layout entities in order to construct a suitable model.


    1. Data conversion, segmentation and pre-processing

      Input data can we in any format including PNG, JPEG, PDF, etc., conversion from one format to a standard format preferably as an image, will inevitably be the first forth process. Numerous libraries are available for this, PyPDF and poppler are few of the popular ones. This process can be automated using scripting.

      Image segmentation is a process of digital image processing for portioning. Pre-processing on input mostly images are done using computer vision libraries. Changing resolution and re-sizing are also part of the same.

    2. Using Deep Learning for object detection

      Objects in input can be in form of tables, indexes, file images, etc. This method focuses mainly on detection of tabular object and extract the same.

      TableNet is one such model.

      1. In the year 2019, a team from TCS Research suggested a contemporary deep learning architecture. The major objective was to use mobile phones or cameras to retrieve data from scanned tables. They developed a technique that comprises recognising and extracting information from the rows and columns of the identified table after accurately detecting the tabular region inside a picture.

      2. Dataset Used and Architecture:

        Marmot was the dataset utilised. It has 2000 documents in PDF format, all of which are accompanied with ground-truths. The design is based on the encoder-decoder paradigm for semantic segmentation proposed by Long et al. Table extraction uses the same encoder/decoder network as the FCN design. Tesseract OCR is used to pre- process and modify the pictures.

        DeepDeSRT used for structure and table detection.

        1. DeepDeSRT is a neural network system for detecting and comprehending tables in texts and pictures. The dataset utilised is an ICDAR table competition dataset from 2013, which has 67 documents totalling 238 pages.

        2. After a table has been successfully discovered and its position has been determined, the next step in comprehending its contents is to recognise and find the rows and columns that make up the table's physical structure. As a result, they employed a fully connected network with VGG-16 weights to extract data from the rows and columns.


    3. Post-processing and validation

    The process of modifying the data acquired to improve the image is known as post processing. The better the data acquired from a camera to make a shot, the greater the possibilities of improvement. There are an increasing number of cameras on the market that can capture RAW data. Raw files have a lot more data per pixel, which helps with post- processing and image enhancement.

    Although post-processing can assist to improve an image, it may not be able to transform a poor exposure to a good one. Depending on the desired result, many levels of post processing are available.

    • Unsupervised image classifications, supervised image classifications, neural network classifiers, simulated annealing classifiers, and fuzzy logic classification systems are all examples of image processing approaches.

    • The most extensively used indices and categorization techniques for land use and land cover

    • Filtering and change detection are examples of post- processing methods.

    • Validation of results and assessment of accuracy

    • Integration of remotely sensed data with other traditional survey and map form data for Earth observation objectives is an example of data integration and spatial modelling.


    1. Named Entity recognition

      One of the key functions of IE systems that extract descriptive entities is named entity recognition. It aids in the identification of generic or domain-independent things like locations, people, and organisations, as well as domain- specific entities like diseases, drugs, chemicals, proteins, and so on. Entities are recognised and semantically categorised into pre-defined classes throughout this procedure. Rule- Based Methods (RBM), Learning-Based Methods (LBM), and hybrid methods were all used in traditional NER systems. In language modelling and context, IE, in conjunction with NLP, plays a critical role. Language analysis is carried out utilisng morphological, syntactic, phonetic, and semantic methods. The IE process is made easier by morphologically rich languages such as Russian and English. Due to the lack of nouns in morphologically low languages, IE is challenging.

    2. Relation Extraction

      RE (relation extraction) is an IE subtask that extracts significant relationships between entities. By examining the semantic and contextual aspects of data, entities and relations are employed to accurately annotate the data. For RE, supervised methods employ feature-based and kernel-based methodologies.

      To extract one-to-one and many-to-many links between entities, several supervised, weakly supervised, and self- supervised techniques have been proposed. Various lexical, semantic, syntactic, and morphological aspects were retrieved in this work, and subsequently learning-based approaches were used to identify relationships between entities.

    3. Visual Relation Extraction

      Visual relationship detection extracts interaction information of objects in images. These semantic representations of the relationship of objects are presented in the form of triples (Subject, Predicate, Object). The semantic triples extraction from images would benefit various real- world application such as content-based information retrieval, visual question answering, sentence to image retrieval and fine-grained recognition. Object classification and detection and context or interaction recognition are main tasks of visual relationship detection in image understanding.

    4. Text recognition

    The text content in photographs may be used to extract a wealth of information. Text within photos and videos provides more meaningful information about the visual material while also improving keyword-based searching, indexing, information retrieval, and automatic image captioning efficiency. Text information extraction (TIE) systems recognise, detect, and locate text in visual data such as photos and videos. Perceptual content and semantic content are two types of visual content. Colour, form, texture, and temporal qualities are examples of perceptual content, whereas semantic content is concerned with the identification and recognition of things, entities, and events.

    The OCR (Optical Character Recognition) technique of recognising characters from photographs or scanned documents is strongly tied with the text recognition job. To

    extract relevant information from document pictures using OCR, a segmentation approach that encompassed multiple stages: image pre-processing, feature extraction, character recognition, and digital text conversion was used to recognise Tamil text in ancient documents and palm manuscripts.


    A. A2I

    In essence, Augmented Intelligence is Artificial Intelligence with a twist, also known as intelligence amplification (IA), cognitive augmentation, decision assistance, machine augmented intelligence, and improved intelligence. While Artificial Intelligence is the construction of robots that operate and behave like people, Augmented Intelligence is the application of those same machines to improve the human worker. Indeed, Augmented Intelligence entails people and robots cooperating to maximise corporate value by utilising their respective skills. In other words, the major goal of IA is to enable humans to operate more effectively and efficiently.

    Augmented intelligence platforms may collect various sorts of data (structured and unstructured) from a variety of sources, including disparate and siloed systems, and present it in a way that offers human employees a comprehensive 360- degree perspective of each consumer. The information gleaned from such data and given to the user is richer and more comprehensive than it has ever been. Consequently, workers are more educated about what is going on in their business, how it may affect their clients, and potential possibilities and risks. This technology's strength comes from its ability to combine a plethora of data with a human touch.


    C. AWS and GCV services

    Amazon Textract is an ML service that extracts text, handwriting, and data from scanned documents automatically. To recognise, analyse, and extract data from forms and tables, it goes beyond simple optical character recognition (OCR). Many businesses now manually extract data from scanned documents like PDFs, pictures, tables, and forms, or use basic OCR software that requires human configuration (which often must be updated when the form changes). Textract utilises machine learning to read and analyse any form of document, reliably extracting text, handwriting, tables, and other data without the need for user intervention. Whether you're automating loan processing or extracting information from invoices and receipts, you can swiftly automate document processing and act on the information gathered.



    B. NLP services

    Information extraction software allows you to extract data from text documents, databases, webpages, and other sources. IE can extract data from machine-readable text that is unstructured, semi-structured, or structured. IE, on the other hand, is most used in natural language processing (NLP) to extract structured content from unstructured material.

    In language modelling and contextualization, IE, in conjunction with NLP, plays a critical role. Languages are analysed morphologically, syntactically, phonetically, and semantically. The IE process is aided by morphologically rich languages such as Russian and English. Due to the lack of a comprehensive lexicon, IE is difficult for morphologically deficient languages, which necessitates extra work for morphological rules to extract nouns.

    AWS Architecture [ image by AWS ]

    Instead of hours or days, Textract may extract the data in minutes. Amazon Augmented AI can also be used to add human evaluations to your models to give oversight and double-check sensitive data.


    The goal of this systematic literature review is to investigate state-of-the-art strategies for IE from unstructured large data formats including text, picture, audio, and video, as well as their limitations. In addition, the difficulties using IE in a big data context have been recognised. With the massive growth of unstructured big data, it has been discovered that data analysis and mining are becoming more complex. Deep learning, with its generalizability, adaptability, and ability to require less human involvement, is a key player in this regard. To cope with the dynamicity and sparsity of unstructured data, more flexible and scalable strategies are necessary to process exponentially expanding data. Overall, present IE approaches outperform traditional techniques for

    bigger datasets, but they are unable to adequately cope with the fast development of unstructured big data, particularly streaming data. Incorporating these IE approaches into a large data environment, scalability, accuracy, and latency are all key considerations. In the big data IE, Apache MapReduce is also experiencing scalability issues. MapReduce-based deep learning methods are the future of large data IE systems for overcoming these issues. HealthCare analytics, surveillance, e-Government systems, social media analytics, and corporate analytics will all benefit from these technologies.

    Prior to extracting information from unstructured data, enhanced data preparation techniques, semantically and contextually rich IE systems, the rise of pragmatics, and sophisticated IE techniques are all necessary for IE systems in an unstructured big data environment. In order to solve the problems of multidimensional unstructured big data, scalable, computationally efficient, and consolidated IE systems are necessary.


The review's main goal was to look at the difficulties that IE systems face while dealing with multidimensional unstructured huge data. Data preparation is equally important to the efficiency of IE systems, according to a detailed discussion of IE techniques using a variety of data types. Advanced data improvement techniques will also help IE systems run more efficiently. As a result of the review's conclusions, a usability enhancement model for unstructured

big data will be developed in order to extract the most usable information from this data.


  1. Gantz J, Reinsel D. The digital universe in 2020: big data, bigger digital shadows, and biggest growth in the far east. IDC iView IDC Analyze Future. 2012;2007(2012):116.

  2. Lomotey RK, Deters R. Topics and terms mining in unstructured data stores. In: 2013 IEEE 16th international conference on computational science and engineering, 2013. p. 85461.

  3. Wang Y, Kung LA, Byrd TA. Big data analytics: understanding its capabilities and potential benefits for healthcare organizations. Technol Forecast Soc Change. 2018;126:313..

  4. Lomotey RK, Deters R. RSenter: terms mining tool from unstructured data sources. Int J Bus Process Integr Manag. 2013;6(4):298.

  5. Brereton P, Kitchenham BA, Budgen D, Turner M, Khalil M. Lessons from applying the systematic literature review process within the software engineering domain. J Syst Softw. 2007;80(4):57183.

  6. Data from unstructured PDF, by Ashish J, article pushished as a part of Data Science Blogathon

  7. Intelligent Text Extraction, Tamir Hassan, paper published at IADIAS International conference 2005

  8. Boytcheva S, Angelova G, Angelov Z, Tcharaktchiev D. Text mining and big data analytics for retrospective analysis of clinical texts from outpatient care. Cybern Inf Technol. 2015;15(4):5877.

  9. Napoli C, Tramontana E, Verga G. Extracting location names from unstructured italian texts using grammar rules and MapReduce. In: International conference on information and software technologies. Cham: Springer; 2016; p. 593601.

  10. Wang K, Shi Y. User information extraction in big data environment. In: 3rd IEEE international conference on computer and communications (ICCC). New York: IEEE; 2017; p. 23158.

Leave a Reply

Your email address will not be published.