Design and Implementation of a Chatbot-Based System for Legal Document Retrieval

Neha Dinakar Ail; Murari B K; Pragathi Jayakar; Sunith Kumar T

doi:10.17577/IJERTCONV14IS010042

Techprints 9.0 - 2026 (Volume 14 - Issue 01)

Design and Implementation of a Chatbot-Based System for Legal Document Retrieval

DOI : 10.17577/IJERTCONV14IS010042

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 21
Authors : Neha Dinakar Ail, Murari B K, Pragathi Jayakar, Sunith Kumar T
Paper ID : IJERTCONV14IS010042
Volume & Issue : Volume 14, Issue 01, Techprints 9.0
Published (First Online) : 01-03-2026
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Design and Implementation of a Chatbot-Based System for Legal Document Retrieval

Neha Dinakar Ail

Department of Computer Applications St Joseph Engineering College, Mangaluru

Pragathi Jayakar

Department of Computer Applications St Joseph Engineering College, Mangaluru

Murari B K

Assistant Professor Department of Computer Applications

St Joseph Engineering College, Mangaluru

Sunith Kumar T

Assistant Professor Department of Computer Applications

St Joseph Engineering College, Mangaluru

Abstract – We propose a role-based legal docu- ment management system designed to help lawyers and their assistants organize and retrieve case files more efficiently. The platform allows uploading and accessing legal documents under civil and criminal classifications, simplifying the handling of large legal datasets. At the core of the system is an AI-driven chatbot that leverages TF-IDF and cosine similarity to locate relevant PDF segments. These are then processed by Gemini AI to generate clear, contextual responses. Built with Flask, Streamlit, and SQLite, the system is tailored for small and mid- level law firms aiming to enhance their workflow and file organization.

Index Terms – Legal chatbot, document retrieval, file management, Flask, Gemini AI, role-based ac- cess, TF-IDF, cosine similarity, Streamlit.

INTRODUCTION

The legal industry is rapidly evolving, and with it comes an explosion in the volume of case- related documentation. From petitions and affi- davits to court orders and evidence files, legal pro- fessionals today are required to manage vast quan- tities of digital and physical records. This growing information load can result in delayed access, misplacement of critical data, and a lack of stan- dardization in document handling – particularly in small to mid-sized law offices that often lack robust IT infrastructure.

In such environments, the reliance on man- ual processes
- like sifting through paper files or browsing scattered PDFs
  
  -leads to inefficiencies and a significant drain on time and resources. Moreover, the need to locate specific information from these documents under time constraints, such as during client consultations or court proceedings, further exacerbates the problem.
  
  To address these challenges, we propose an AI-powered legal document management platform that streamlines the process of storing, retrieving, and interacting with case files. The solution is built around a role-based system where lawyers and assistants have clearly defined privileges. Users can upload case-related PDF documents under predefined categories – civil or criminaland ac- cess them through an intelligent chatbot interface. The system
  
  architecture is lightweight yet func- tional, combining the power of Python Flask for the backend, SQLite for database management, and Streamlit for the chatbot interface. PDF files are processed using PyPDF2, vectorized using the TF-IDF technique, and semantically matched us- ing cosine similarity. The final response generation is handled by Gemini AI, which delivers clear, human-readable answers based on the context of the users query. The goal is to simplify access to legal information, reduce dependency on manual lookup, and improve overall productivity within legal offices.
PROBLEM STATEMENT

Despite the ongoing digitization in many indus- tries, legal professionals continue to face obstacles in managing and retrieving documents effectively. Most law offices still depend on traditional file systems or basic folder structures on shared drives. These methods are not only outdated but also fail to offer intelligent search, classification, or access control features.

Manual retrieval of specific informationsuch as dates, case numbers, or involved par- tiesrequires lawyers to spend excessive time scrolling through large files or multiple folders. This approach is not scalable and increases the likelihood of human error, particularly when ur- gent access to a case detail is required.

While some commercial solutions offer docu- ment management features, they are either cost- prohibitive or overloaded with generic tools not tailored for the legal domain. Moreover, most do not offer features like chatbot- based search, role- based access, or intelligent response generation from file contents. These shortcomings make them unsuitable for small legal practices that require simplicity, speed, and accuracy.

The absence of a smart, scalable system that supports file upload, structured classification, role- based access, and AI- powered querying has cre- ated a significant gap in legal workflow optimiza- tion. There is a clear need for a focused solution that combines ease of use with intelligent search and

secure access control, enabling legal profes- sionals to manage their case files more efficiently and interact with them using natural language.
RELATED WORK

The use of artificial intelligence in legal tools has gained momentum. In [1], researchers pro- posed a chatbot legal assistant with natural lan- guage processing capabilities, though without file- level search.

TF-IDF and cosine similarity have been applied in legal document classification as explored in [2], offering effective techniques for comparing legal texts.

A lightweight legal document management ap- proach was explored in [4], focusing on low- resource deployment, which aligns closely with our goals. In addition, role-based access mecha- nisms for secure user management are thoroughly discussed in [3].

Several works, such as [7] and [8], highlight how AI can support both legal research and client- facing services in small and medium-sized prac- tices. Meanwhile, [9] explores the importance of explainability in AI outputs, which is especially critical in sensitive domains like law.

Googles Gemini AI model, used in our chatbot, was introduced in [5], and forms the foundation of our response generation layer. Prior research also supports the viability of chatbot-based document retrieval, as seen in [6] and [10], where similar architectures were tested for information retrieval tasks.

Many existing systems target large-scale judi- ciary databases, but they are often not scalable for individual or small office needs. Our system addresses this limitation by being lightweight and specifically optimized for smaller teams.
METHODOLOGY

The proposed system follows a modular, role- based architecture that separates users into two categories: lawyers and assistants. Lawyers have full administrative access, including the ability to manage assistants, upload case files, and query documents via the chatbot. Assistants are limited to uploading and viewing case files.

PDF documents are uploaded based on case type (civil or criminal), with metadata such as custom file name and case number. These files are processed using the PyPDF2 library, which extracts the raw text content. The text is di- vided into chunks, which are vectorized using the Term Frequency Inverse Document Frequency (TF-IDF) technique. Cosine similarity is applied to compare the users query with the document chunks, and the most relevant match is selected.

The selected chunk is passed to the Gemini AI model through Googles API, which generates a concise, human-readable

response in plain language. This workflow ensures context- aware retrieval and efficient access to case details.

The system architecture follows a three-layer model: View Layer, Business Logic Layer, and Access Layer.

Fig. 1. Three-Layer Architecture of the Legal Document System
1. View Layer: This handles all front-end func- tionality. It includes:
  - Lawyer and Assistant login pages.
  - Upload interface for civil/criminal case doc- uments.
  - Add Assistant module (only for Lawyers).
  - Chatbot query interface.
2. Business Logic Layer: This is the engine of the application:
  - Validates user roles and manages logins.
  - Parses PDF files using PyPDF2.
  - Vectorizes extracted content using TF-IDF.
  - Selects best-matching content using cosine similarity.
  - Calls Gemini API for generating chatbot an- swers.
3. Access Layer: It maintains all data:
  - File metadata stored in SQLite.
  - Case files stored securely on disk.
SYSTEM IMPLEMENTATION

The system uses Flask for the backend and Streamlit for chatbot interaction. The assistant dashboard is intentionally minimal for ease of use [4]. Document indexing and AI- assisted search approaches have been validated in prior systems

like [6], showing improved efficiency for users un- familiar with traditional search commands.Role- based dashboards control access:
- Lawyers: Upload/view files, query chatbot, manage assistants.
- Assistants: Upload and view only.
The upload handler checks file size (max 10MB) and type (PDF only). After upload, con- tent is parsed, chunked, and

indexed. Queries go through semantic matching using cosine similarity, and Gemini AI returns answers [3]. The assistant dashboard is intentionally minimal for ease of use [4].
LIMITATIONS AND FUTURE SCOPE

While the proposed system works effectively in small-scale legal environments, it currently has several limitations:
RESULTS AND DISCUSSIONS

The system was tested in a simulated legal of- fice setup using sample case files representing both civil and criminal matters. Users with different roles – lawyers and assistants – logged in and performed their respective tasks, such as uploading documents and interacting with the system.

Once authenticated, users were redirected to role-based dashboards. Lawyers had access to all features, including assistant management, docu- ment uploads, and the chatbot interface. Assistants were limited to uploading and viewing case doc- uments, following the access control rules.

To assess chatbot performance, legal profession- als submitted natural questions related to their uploaded case filesfor example, queries about FIR details, party names, and filing dates. The system processed these queries by using TF-IDF to locate the most relevant document sections. Cosine similarity was then applied to choose the best-matching text segments.

These identified segments were passed to the Gemini AI model, which generated concise and meaningful responses. The chatbot was able to un- derstand the context of the questions and provide accurate answers based on the uploaded PDFs. Users found it helpful, especially when trying to locate specific case information without read- ing through lengthy documents.

This accessibility factor makes the system particularly valuable in firms with limited digital literacy resources or minimal technical staff [8]. It also aligns with findings in [10], where AI-based chat systems improved productivity across different document- heavy domains.

Overall, the system proved effective in reducing manual workload and improving legal document management. The chatbot feature offered a con- venient and time-saving way to extract insights, making it a valuable tool for legal professionals. The access control mechanisms also ensured data privacy and proper role separation throughout the process.

Fig. 2. Chatbot Interface for Querying Legal Documents

User Query

Chatbot Response

What is the case about?

Land dispute between

two parties.

FIR number?

Registered as 147/C/2022.

TABLE I – SAMPLE QUERIES AND CHATBOT RESPONSES

In

addition to its speed and clarity, the chatbot maintained consistency in tone and format across responses, making it suitable for daily use in a legal workspace. The experiment confirmed that even users unfamiliar with legal jargon could obtain relevant information without technical diffi- culties. This accessibility factor makes the system particularly valuable in firms with limited digital literacy resources or minimal technical staff.
CONCLUSION

The proposed chatbot-based legal document re- trieval system presents a promising solution for enhancing efficiency in document access and man- agement within small to mid- sized law firms. By combining the robustness of natural language processing with an intuitive user interface, the system effectively streamlines legal workflows.

Its modular design supports clean role-based separation between lawyers and assistants, min- imizing risk and

promoting better control over sensitive case data. Lawyers can rely on the chat- bot for instant answers to case-related questions, reducing the need to manually browse lengthy documents and improving their productivity dur- ing court preparation or client consultations.

The technical backboneTF-IDF, cosine sim- ilarity, and Gemini AIensures that answers are both relevant and contextually accurate. Moreover, the integration of Flask and Streamlit enables lightweight deployment, making the system fea- sible even for firms with limited IT infrastructure. This work demonstrates the value of combining AI and legal tech in a user-friendly, practical for- mat. As digital transformation accelerates across industries [7], such systems will play a critical role

in improving legal service delivery.

Future improvements, such as adding multilin- gual support, speech-based queries, and OCR for image-based files, will make the system even more

inclusive and scalable. With ongoing development, this platform has the potential to become a com- prehensive digital assistant for law professionals operating in both rural and urban legal environ- ments.

REFERENCES

S. Singh and A. Gupta, AI-Based Legal Assistant Chat- bot, IJCA, vol. 182, no. 15, pp. 2530, 2019.
Y. L. Liu and Y. Zhang, TF-IDF and Cosine Simi- larity for Legal Documents, IEEE Access, vol. 7, pp. 185647185657, 2019.
M. Aggarwal and S. Shukla, Role-Based Access Control in Web Applications, IJCA, vol. 95, no. 24, pp. 812, 2014.
P. Gupta et al., A Lightweight Document Management System, IJERT, vol. 11, no. 4, pp. 102107, 2023.
Google AI Blog, Introducing Gemini, 2023. [Online].

Available: https://ai.googleblog.com
A. K. Jain, P. Kumar, Document Retrieval with AI, IJCTT, vol. 68, no. 4, pp. 1218, 2020.
K. Bhatt and S. Ramesh, AI in Legal Tech, IJLT, vol. 6, no. 2, pp. 101109, 2022.
A. Sharma, Legal Data Management using Python, IRJET, vol. 8, no. 6, pp. 210213, 2021.
T. Miller, Explanation in AI, Artificial Intelligence, vol. 267, pp. 1 38, 2019.
S. Lee and M. Kang, AI-Based Chatbots for IR Tasks, Journal of Web Intelligence, vol. 21, no. 3, pp. 154162, 2022.

Design and Implementation of a Chatbot-Based System for Legal Document Retrieval

INTRODUCTION

PROBLEM STATEMENT

RELATED WORK

METHODOLOGY

Business Logic Layer: This is the engine of the application:

Access Layer: It maintains all data:

SYSTEM IMPLEMENTATION

LIMITATIONS AND FUTURE SCOPE

RESULTS AND DISCUSSIONS

CONCLUSION

REFERENCES