🏆
Global Publishing Platform
Serving Researchers Since 2012

LOK RAKSHA : INDIAN Personally Identifiable Information System

DOI : 10.17577/IJERTV15IS040366
Download Full-Text PDF Cite this Publication

Text Only Version

LOK RAKSHA : INDIAN Personally Identifiable Information System

Mihir Gosavi

Artificial Intelligence and Data Science

A.C.Patil College of Engineering Kharghar, Navi Mumbai

Vipulkumar Gupta

Artificial Intelligence and Data Science

A.C.Patil College of Engineering Kharghar, Navi Mumbai

Nisarga Lande

Artificial Intelligence and Data Science

A.C.Patil College of Engineering Kharghar, Navi Mumbai

Purva Khandagale

Artificial Intelligence and Data Science

A.C.Patil College of Engineering Kharghar, Navi Mumbai

Shipali Bansu

Artificial Intelligence and Data Science

A.C.Patil College of Engineering Kharghar, Navi Mumbai

AbstractThe increasing adoption of digital platforms across India has led to a signicant rise in the collection and process- ing of sensitive personal information. Data such as Aadhaar numbers, PAN details, and contact information is frequently handled across sectors, raising concerns about privacy, security, and regulatory compliance. Existing solutions for Personally Identiable Information (PII) detection are largely designed for Western datasets and often fail to accommodate Indias multilingual and structurally diverse documents. To address these challenges, this paper introduces LOK RAKSHA, a unied system for identifying and protecting Indian PII document. The proposed framework combines Optical Character Recognition (OCR), rule-based methods, and machine learning models to enable accurate and context-aware detection. It is specically tailored to support regional languages and align with the Digital Personal Data Protection (DPDP) Act, 2023. The system also incorporates explainability features to improve transparency and usability in real-world applications.

Index TermsPersonally Identiable Information, Data Pri- vacy, Optical Character Recognition, Natural Language Process- ing, Indian Data Protection, Compliance.

  1. Introduction

    The rapid digitization of services in India, particularly across government, banking, and healthcare sectors, has signif- icantly increased the volume of personal data being generated and stored. Systems such as Aadhaar and PAN have become central to identity verication, making sensitive information more frequently processed in digital environments. While this shift has improved accessibility and efciency, it has also introduced new risks related to data exposure and misuse. Similar concerns regarding large-scale handling of Personally Identiable Information (PII) have been highlighted in recent studies on privacy and data protection systems [2], [4]. During our initial observations, we found that many existing tools struggle when applied to Indian documents, especially those

    containing multiple languages or non-standard formats. For ex- ample, documents often include a mix of English and regional scripts, which reduces the effectiveness of traditional detection systems. Research on multilingual datasets further emphasizes the importance of region-specic models for accurate PII detection [5]. Another limitation is the lack of alignment with Indias Digital Personal Data Protection (DPDP) Act, 2023. Most available solutions are designed around Western regulations and fail to address local compliance requirements. Existing frameworks for automated PII detection and redaction often focus on global standards, such as GDPR, without considering region-specic legal constraints [4], [6]. To ad- dress these challenges, we propose LOK RAKSHA, a system designed specically for Indian use cases. Instead of relying on a single technique, the system combines OCR, rule-based detection, and hybrid AI model to improve accuracy across different document types. Prior work has shown that hybrid approaches integrating OCR and AI models can enhance detection performance in complex documents [1], [3]. The goal is not only to detect sensitive information but also to ensure that it is handled in a compliant and interpretable manner.

  2. Related Work

    Earlier approaches to PII detection largely depended on regular expressions and predened rules. These methods work well for clearly structured data such as phone numbers or identication codes, but their performance drops when the input becomes unstructured or multilingual. In several stud- ies, rule-based systems were shown to be effective only in controlled environments, with limited adaptability to real- world data variations [2]. To improve this, researchers have incorporated Optical Character Recognition (OCR) techniques

    to process scanned documents and images. For instance, OCR combined with hybrid AI models has been used to detect sensitive information in visual data, including identity cards and signatures [1]. Similarly, hybrid systems integrating OCR with regex-based detection have demonstrated improved performance in extracting and masking PII from document images [3]. However, these approaches are often trained on limited datasets and may not generalize well to region-specic formats. Hybrid models that combine rule-based techniques with machine learning have been proposed to balance precision and exibility. These systems can handle both structured and semi-structured data, but their effectiveness depends heavily on the availability of diverse and high-quality training data [2]. In multilingual environments like India, this remains a signicant challenge. More recent work has explored the use of Named Entity Recognition (NER), to detect context-dependent personal information. These approaches improve the identi- cation of implicit PII in unstructured text and are increasingly used in modern privacy-preserving systems [4]. Additionally, large language models have been applied for context-aware redaction tasks, enhancing semantic understanding and mask- ing accuracy [6]. Despite these advancements, most exist- ing systems are designed with a focus on Western datasets and regulatory frameworks. Datasets developed for specic languages, such as Korean or other non-English corpora, highlight the importance of localized training for improving detection accuracy [5]. However, solutions tailored specically for Indian datasets characterized by multiple languages, mixed scripts, and unique identication formats remain limited. This gap directly motivates the development of LOK RAKSHA.

  3. Proposed Framework

    1. System Overview

      LOK RAKSHA is designed as a modular and scalable framework aimed at protecting PII in Indian digital documents. The system consists of multiple interconnected components that process input les and generate redacted outputs along with risk assessments and explanatory insights.

    2. Text Extraction with OCR

      In real-world scenarios, documents are often available in formats such as scanned images, PDFs, and handwritten forms. To process such data, the framework incorporates an OCR module optimized for Indian languages. This module converts visual content into structured text, enabling further analysis in subsequent stages.

    3. Hybrid PII Detection

      The detection process combines multiple techniques to achieve high accuracy:

      • Rule-based regular expressions are used to identify struc- tured identiers such as Aadhaar numbers, PAN details, passports, and phone numbers.

      • NER models trained on Indian datasets are employed to detect entities such as names, email addresses, and locations.

        This layered approach minimizes false positives whileincreas- ing recall across different document types and formats.

    4. Compilance Engine

      To ensure adherence to the DPDP Act, 2023 and UIDAI masking standards, the compliance module enforces legal rules on how PII should be masked or stored. It validates detection results and guides the redaction process to ensure lawful handling of sensitive data.

    5. Risk Scoring and Explainability

    Detected PII is evaluated by a risk scoring module that categorizes the sensitivity and potential exposure level. A Gen- erative AI layer generates human-readable explanations and summarizations of detection results, improving transparency and user interpretability.

  4. Results and Discussion

    The LOK RAKSHA framework was tested on a diverse col- lection of Indian digital documents, including scanned identity proofs, structured records, and multilingual datasets contain- ing both structured and unstructured content. The evaluation focused on the systems ability to accurately detect and redact

    Fig. 1. Overall Architecture of the Proposed LOK RAKSHA System

    Indian-specic PII such as Aadhaar numbers, PAN details, phone numbers, names, and address-related information.

    The hybrid detection approach demonstrated strong perfor- mance in identifying structured data using regex-based rules, while NER and tesseract models signicantly improved detec- tion in unstructured and context-dependent scenarios. Com- pared to purely rule-based systems, the proposed approach showed a noticeable reduction in false positives, especially in documents containing unrelated numerical data.

    The OCR component successfully extracted text from image-based and multilingual documents, enabling the system to handle real-world data commonly found in government, banking, and healthcare sectors. The compliance module en- sured that all detected PII was masked according to DPDP Act guidelines, strengthening data protection and regulatory alignment.

    From a usability perspective, the inclusion of a risk scoring system allowed prioritization based on sensitivity levels, while the Generative AI module provided clear and understandable explanations. These features enhance the practical usability of the system for organizations managing large volumes of sensitive data.

    Overall, the results indicate that LOK RAKSHA is an effective and scalable solution for PII protection in India, addressing the limitations of existing systems by incorporating localization, contextual understanding, and explainability.

    TABLE I

    Comparison of PII Detection Techniques Used in LOK RAKSHA

    Technique

    PII Type De-

    tected

    Key Advantage

    Regex-Based De-

    tection

    Aadhaar, PAN,

    Passport, Phone Numbers

    High precision

    for structured

    and format-based identiers

    Named Entity

    Recognition (NER)

    Names,

    Addresses, Email IDs, Locations

    Context-aware detec-

    tion in unstructured text

    OCR Layer

    Scanned

    documents and images

    Enables text extrac-

    tion from multilin- gual and image-based records

    Generative AI

    Module

    Explanation and

    compliance rea- soning

    Enhances

    transparency and user interpretability

    Table I highlights the hybrid detection strategy adopted in LOK RAKSHA, where rule-based, statistical, and hybrid AI

    TABLE II

    Feature Comparison with Existing PII Protection Systems

    approaches are combined to achieve accurate and context- aware PII detection across diverse Indian document formats.

  5. Conclusion

    LOK RAKSHA provides a strong foundation for addressing Indias growing data privacy challenges. Unlike conventional systems that rely solely on pattern matching, the proposed framework integrates regex-based methods, NER models, and advanced language models to effectively process multilingual and complex document structures. By embedding compliance with the DPDP Act, 2023 directly into the system, it demon- strates how technological solutions can align with regulatory requirements. This approach offers a reliable and scalable solution for organizations aiming to secure sensitive personal data in an evolving digital landscape.

  6. Future Work

Future enhancements will focus on incorporating multi- modal AI techniques capable of processing text, images, and audio simultaneously to improve detection accuracy. Addi- tional efforts will be directed toward expanding support for more Indian languages and dialects. Furthermore, the inte- gration of privacy-preserving techniques such as federated learning will be explored to enhance model performance while ensuring data condentiality.

References

  1. O. Shaikh et al., Detection and Classication of Personally Identiable Information in Images Using Articial Intelligence, TechRxiv, May 2024.

  2. J. Jaikumar, Mohana, and P. Suresh, Privacy-Preserving Personal Identiable Information (PII) Label Detection Using Machine Learn- ing, in Proc. Int. Conf. Computing, Communication and Net- working Technologies (ICCCNT), 2023, pp. 15, doi: 10.1109/ICC- CNT56998.2023.10307924.

  3. D. K. Tunwal et al., A Hybrid OCR and Regex-based PII Detection and Masking Tool with Deepfake and Forgery Detection Capabilities, Int. J. Advance Research and Innovative Ideas in Education, vol. 11, no. 3, pp. 14601466, 2025.

  4. S. Asthana et al., Adaptive PII Mitigation Framework for Large Language Models, arXiv preprint arXiv:2501.12465, 2025.

  5. L. Fei et al., KDPII: A New Korean Dialogic Dataset for the De- identication of Personally Identiable Information, IEEE Access, vol. 12, pp. 135626135641, 2024.

  6. P. Thetbanthad, B. Sathanarugsawait, and P. Praneetpolgrang, Auto- mated Redaction of Personally Identiable Information on Drug Labels Using Optical Character Recognition and Large Language Models for Compliance with Thailands Personal Data Protection Act, Applied Sciences, vol. 15, no. 9, p. 4923, 2025.

  7. G. Gambarelli, A. Gangemi, and R. Tripodi, SPeDaC: A New Resource and Benchmark for Training Sensitive Personal Data Classiers, IEEE Access, vol. 11, pp. 1086410880, 2023.

Feature

Existing Systems

LOK RAKSHA

Indian Government ID Support

Partial

Yes

Multilingual Document Han-

dling

Limited

Yes

Context-Aware Detection

No

Yes

DPDP Act Compliance

No

Yes

Explainability via GenAI

No

Yes

Risk Scoring Mechanism

No

Yes