DOI : https://doi.org/10.5281/zenodo.20038693
- Open Access

- Authors : Shreyas Inamdar, Shreyan Patil, Siddesh Shinde, Sartahk Jadhav, Namrata Naikwad
- Paper ID : IJERTV15IS042765
- Volume & Issue : Volume 15, Issue 04 , April – 2026
- Published (First Online): 05-05-2026
- ISSN (Online) : 2278-0181
- Publisher Name : IJERT
- License:
This work is licensed under a Creative Commons Attribution 4.0 International License
Design and Development of an Offline AI Study Assistant using RAG and Local Language Models
Shreyas Inamdar
Department of Computer Science and Engineering MIT Art, Design and Technology University Pune, India
Sartahk Jadhav
Department of Computer Science and Engineering MIT Art, Design and Technology University Pune, India
Shreyan Patil
Department of Computer Science and Engineering MIT Art, Design and Technology University Pune, India
Namrata Naikwad
Department of Computer Science and Engineering MIT Art, Design and Technology University Pune, India
Siddesh Shinde
Department of Computer Science and Engineering MIT Art, Design and Technology University Pune, India
Artificial Intelligence is shaking up how we learn. Its making education feel more personal and much easier to reach. In this paper, Im introducing an offline study assistant powered by Retrieval-Augmented Generation (RAG) and local language models. Heres the cool part: this assistant actually learns from your real university notes, textbooks, and whatever study materials you feed it. When students have questions, they get solid, relevant answersno internet required.
Privacy is a big deal with this project. Since everything runs offline, students data stays safe and private. The assistant is simple to usea chatbot where you can ask anything, or even upload your own notes and files. FAISS works behind the scenes, storing and sorting all that information for quick searches. This setup lets the assistant pull up what you need fast, then generate clear, helpful answers right there on your device.
In the end, this project shows local AI tools can really help students learn on their own and make the whole studying experience smoother.
-
INTRODUCTION
AI and NLP have turned studying upside down for students. Lets be realeveryones drowning in scattered notes, untouched textbooks, research papers gathering dust somewhere in your downloads, and digital handouts hiding in random folders. It piles up fast. Try finding that one page you need, or making sense of a complicated chapter? Its a headache every time. Sure, old-school studying still gets the job done, but its slow and never really fits the way most of us actually learn.
Thats why having an offline AI study assistant changes everything. It lives right on your laptop. No cloud, no lag, and you wont freak out if the Wi-Fi drops during a late-night study marathon. This thing uses Retrieval-Augmented Generation (RAG) and local language models, so your textbooks, slides, and notes stay on your device. Private means actually private. Youre in charge. Heres where it gets good: it cuts through the chaos. Toss
in new files, and it sorts, organizes, and finds whatever youre after in seconds. Because it uses FAISS vector embeddings, searching your stuff is super-fast, and the model explains things in plain language. Need a summary? A straightforward answer? A nudge in the right direction? Its got it covered.
Studying starts to feel possible again. When youre stuck, the AI helps you break through, digs up answers, and honestly feels like someones got your back. No more drowning under mountains of loose papers and random files. Instead, you can finally handle what used to trip you up. And the privacy part? Rock solid. Nothing leaves your laptop, and theres no hasslejust a smart tool that actually fits your life and turns studying from an uphill battle into something you can manage..
Figure 1: basic block Diagram of Offline AI Study Assistant
-
RELATED WORK
Machine Learning and Natural Language Processing are everywhere in automated recruitment now, especially when it comes to resume screening. The old waymanually reading and judging every resumeis slow, inconsistent, and honestly just introduces all sorts of human bias. So people started building smarter systems that look at candidates skills, experience, and fit for the job using specific criteria.
At first, these systems just matched keywordsthink scanning resumes for Java or project management. That worked for simple searches, but it missed a lot. If someone wrote team leader instead of manager, or used a synonym, the system might just overlook them. NLP evolved, and suddenly embedding-based similarity and transformer models made it possible for machines to actually get what candidates meant, not just the words they used. Thats made matching resumes to job descriptions much more accurate.
Now, weve got machine learning models predicting whos a good fit and ranking applicants. But these tools run into trouble if the training data is too small or skewedoverfitting and underfitting can mess up predictions. Researchers are countering this by using ensemble methods, active learning, and semi-supervised approaches to make models more flexible and reliable, especially when it comes to real-world hiring.
Adding Optical Character Recognition (OCR) has taken automated screening even further. Now, systems can read scanned or image-based resumes, though accuracy isnt perfect yetOCR still struggles with messy layouts and noisy text. Architecture-wise, modern resume screening platforms are pretty modular and API-focused, so they scale easily and link right up with HR systems.
Usually, youve got a frontend where users upload resumes, see feedback, and check results; the backend handles data cleaning and runs predictions, plus manages the model training behind the scenes.
Recent studies are also pushing deep learning optimization for text analysis. Traditional gradient descent methods like SGD are still the backbone, but they get stuck in local minima or converge slowly when the landscapes tricky. So now, things like momentum-based optimizers or adaptive learning rate algorithmsAdam, RMSProp, and othersare in play. Some newer approaches combine these strategies, dynamically tweaking learning parameters based on how the models are performing. The result? Models learn faster and training is a lot more robust.
-
SYSTEM ARCHITECTURE AND
METHODOLOGY
Explaining here the Architecture, data flow, and algorithms used for Offline AI Study Assistant, which combines Retrieval-Augmented Generation (RAG) and locally deployed Language Models (LLMs). This system is optimized for educational use, this enables accurate, private, and context-based learning assistance without being dependent on Internet.
-
Overall System Design
The design of the system has four functional layers:
User Interface Layer: this layer handles the text input and displays the generated responses through a local frontend webpage.
Preprocessing Layer: this layer converts raw documents uploaded by user into structured text chunks and then generates their embeddings.
Retrieval Layer: Retrieves the most relevant content using similarity search on the stored embeddings.
Generation Layer: this layer passes the retrieved context with the users query together to local LLM, which generates the final response for the User.[7]
The data flow is represented as:
This modular design allows flexible deployment, scalability, and integration with additional local services such as document upload or speech input.
-
Data Preprocessing and Embedding
Before retrieval, all academic documents undergo preprocessing:
Text Extraction: Conversion of PDFs, Word files, or handwritten notes into plain text.
Chunking: Texts are divided into overlapping chunks of fixe length (e.g., 300 tokens) using:[8]
where represents a text chunk, the full text, the window size, and the starting index.
Embedding Generation: Each chunk is converted into an embedding vector using an embedding model :
All generated embeddings are stored locally in a FAISS or Chroma DB vector store.
-
Retrieval Mechanism
When the user submits a query , the system generates its
embedding using the same model. The retriever identifies the top-k document embeddings with the
highest cosine similarity to :
The top-k results then forms the context knowledge base for the Query:
A context fusion step then combines the previously retrieved chunks into single prompt for the language model:[9]
This combined context makes sure that the generated answer remains grounded in facts also domain-specific information.
4.4 Local Model Setup
The response generation is done by the local LLM (e.g., LLaMA 3). The quantization technique (e.g., GGUF 4-bit) reduces the models size while maintaining the quality of output.
The generative model will then produce a final response using:
where denotes that the model with parameters , and is the context enriched input.
The model will run locally through lightweight frameworks such as llama.cpp, that enables inference without any internet connectivity also while maintaining desired low latency and highest accuracy possible.[10]
3.5 Algorithm
-
BEGIN
-
# Step 1: Query Embedding
-
v_q – Embed(q)
-
# Step 2: Retrieve Relevant Documents
-
For each document chunk C_i in D:
-
Compute similarity score s_i = sim(v_q, v_i)
-
Sort D by s_i in descending order
-
Select top-k chunks – D_k
-
# Step 3: Context Construction
-
c – Concat(q, D_k)
-
# Step 4: Response Generation
-
r – LLM(c)
-
# Step 5: Output Response
-
Display(r)
-
END
Figure 2: Implementation Details.
3.6Language Model
The system employs a locally hosted Large Language Model (LLM) to generate coherent and contextually relevant responses based on the retrieved information. Models such as LLaMA 3, Mistral 7B, and Phi-2 can be deployed using frameworks like Ollama, ensuring offline functionality and complete data privacy.[11]
Figure 3: Comparison of Locally Hosted LLMs across the four parametersResponse Coherence, Contextual Relevance, Offline Functionality, and Inference Speed. LLaMA 3 consistently outperforms both others, making it the best choice for offline deployment.
-
-
IMPLEMENTATION DETAILS
The Retrieval-Augmented Generation (RAG) system keeps things simple and puts you in the drivers seat. Its all modular, easy to set up, and works entirely on your local machineno cloud dependency, no worrying about constant internet. The whole pipeline is there: process your docs, create embeddings, retrieve vectors, and run the language modelsall straight from your computer.
-
Development Environment
Most of this runs on Python 3.10, plus a handful of libraries that
keep everything humming. Heres the tech stack:
-
LangChain pulls everything togetherdocument loading, retrieving, and the language modelsso your setup stays tidy.
-
llama.cpp and Ollama run compact, efficient models like Mistral, LLaMA 3, or Phi-2 right from your CPU or GPU.
-
ChromaDB and FAISS handle embedding storage and search, making retrieval fast and reliable.
-
SentenceTransformers converts your text into dense vectors, using models like all-MiniLM-L6-v2 or Instructor-XL.
-
Gradio provides a clean, browser-based chatyou just open it up, upload your docs, and start asking away.
With this stack, you get full control over your models and all your data. Itll run on just about anythingno need for a super-powered machine.
-
-
Dataset Description
The content comes from classic academic sources: textbook excerpts, research papers, technical documentation. Before going into the pipeline, each document gets a quick cleanup:
-
Clean up the textscrub out weird symbols, fix funky formatting, dump useless metadata.
-
Split the textbreak docs into chunks of 5001000 tokens, so things stay manageable.
-
Make embeddingseach piece turns into a vector using pre-trained models.
These steps prep the data so the retriever can actually find the good stuff fast when you have a question.
-
-
Local Chat Interface Integration
The chat uses Gradio, so everything happens right in your browser. Upload your docs, ask your question, and get a reply in secondsbackground context included. The pipeline does its thing behind the scenes: your question turns into an embedding, retrieval pulls the best chunks, sends them to the LLM, and Gradio spits out the reply. Want to swap out a model or try a new component? Go for it. The systems modular, so you dont have to rebuild anything from scratch.
-
Storage Optimization and Memory Footprint Efficiency here is the more important:
-
odels use GGUF 4-bit quantization, shrinking files from 13 GB to around 4 GB.
-
Embeddings are live in float16 to save even more memory.
-
Only the chunks you actually need get loaded from the vector
store, so your RAM isnt bogged down.
-
Caching keeps recent embeddings and answers handy, speeding up repeat queries.
-
We can run this with less than 8 GB of RAMmost decent laptops can handle it without breaking a sweat.
No expensive GPU required.
-
-
IMPLEMENTATION DETAILS
So, how does the Offline AI Study Assistant actually perform? We measured answer accuracy, retrieval quality, response speed, and memory footprintthen stacked it up against other local models as well as like ChatGPT and Gemini Nano. You can see exactly what you gain, or trade off, by running your system offline.
-
Evaluation Metrics
We checked both the hard numbers and what actual users thought:
-
Response Relevance (BLEU): Are the answers on target?
-
Response Coherence (ROUGE-L): Do the replies read naturally and make sense?
-
etrieval Precision@K: Does it fetch the right supporting info?
-
atency: How quickly do you get an answer?
-
Memory Footprint: Whats the max RAM it ever uses during
a session?
-
User Satisfaction: Do people actually like the answers? Real user ratings tell the story.
These help establish technical efficiency
and practical usability of the system.
TABLE I: Evaluation Metrics
Metric
Value
TP
85
TN
50
FP
10
FN
5
-
Accuracy
-
Precision
Recall
-
-
F1-Score
Calculated Values (TP=85, TN=50, FP=10, FN=5)
-
Accuracy = (85 + 50) / (85 + 50 + 10 + 5) = 0.900
-
Precision = 85 / (85 + 10) = 0.895
-
Recall = 85 / (85 + 5) = 0.944
-
F1-Score = 2 × (0.895 × 0.944) / (0.895 + 0.944) =0.919
-
-
-
Experimental Setup
Testing was conducted on a local machine using the configurations mentioned : (Intel i7 CPU, 32 GB RAM, RTX 3060 GPU).
Three different open-source models Phi-2 (2.7B), Mistral (7B), and LLaMA 3 (8B) were tested.
The test dataset contained academic questions and study material from subjects computer science, electronics, and mathematics.
Each query was processed through the RAG pipeline, where the relevant chunks were retrieved using FAISS similarity and then passed on to language model for generation.
The responses were then compared with human-written answers to calculate BLEU and ROUGE scores.
-
Response Accuracy vs Model Size
In general, larger models do produce more contextually accurate and detailed responses. Mistral 7B gave the best overall accuracy with a BLEU score of 0.79, while Phi-2, being lightweight, achieved 0.68, showing that even smaller models can perform reasonably well when tuned for retrieval. LLaMA 3 (8B) came close to Mistral but required more memory and slightly longer generation time.[15]
Figure 4: Model Size vs Response Accuracy.
-
Phi-2 : BLEU score = 0.68 (lightweight and decent performance).
-
Mistral 7B: BLEU score = 0.79 (Good balance of size and accuracy).
-
LLaMA 3 (8B): BLEU score = 0.77
(lower than Mistral despite being larger size).
-
-
Latency vs Number of Retrieved Documents
The latency did increased as more and more documents were retrieved and processed by the model.
For example, when k = 3, the system responded in about 950 milliseconds, while k = 10 resulted in an average delay of 2.8 seconds.
Figure 5: No of Retrieved documents vs Latency. (graph)
-
k = 3 0.95s (fastest response)
-
k= 46 1.21.8s
(optimal for balancing precision & latency)
-
k = 10 2.8s (highest latency due to larger retrieval and context processing)
-
-
Qualitative Findings
Ten university students tested the assistant across different subjects and rated it on clarity, helpfulness, and speed (15 scale).[16]
-
Mistral 7B averaged 4.3 students liked that answers were well-structured and easy to follow.
-
Phi-2 was the fastest of the three, though a few students felt the answers were a bit thin on detail.
-
LLaMA 3 gave the most accurate responses overall, but started feeling sluggish in longer study sessions.
Most students said they could get a solid topic summary or concept explanation without touching a search engine which was kind of the whole point.
-
-
Limitations
A few real issues came up during testing:
-
Response Delay: Anything above 7B parameters slowed down noticeably on lengthy or multi-part questions.
-
Hardware Dependence: Machines with under 8 GB RAM had trouble loading the larger quantized models without hiccups.
-
Limited Knowledge Scope: The system only knows what’s in the uploaded documents it can’t reach out for newer or external information.
-
Context Overflow: Pack in too many retrieved chunks and the model starts cutting off important context at the edges of its input window.
Context compression, knowledge distillation, and hybrid local-plus-cached retrieval are the most practical paths forward for these issues.[17]
-
-
SCOPE AND FUTURE WORK
The system was built for students, but the architecture isn’t limited to that. The core design local RAG, offline inference, private document indexing translates well to any context where people need to query documents without sending data to a cloud server.
-
Current Scope
The system runs fully offline. It ingests lecture notes, textbooks, and PDFs, then answers student questions and generates summaries without an internet connection. Users can upload their own files at any point, and those get indexed and searchable immediately.
It’s a practical fit for:
-
Universities and colleges that want an internal AI study tool without routing student data through external servers.
-
Schools in remote or low-connectivity areas where cloud-based tools simply aren’t reliable.
-
Students who need study support in private, offline environments.
-
-
Extended Corporate Applications
The same setup works in enterprise contexts where keeping data on-premises isn’t optional. Organizations can load internal reports, policy documents, and manuals into the system and get a private knowledge assistant that never phones home.[18]
Useful corporate applications include:
-
Internal assistants that can answer questions about company policies, procedures, or live project documentation.
-
Onboarding tools that help new hires find training materials without digging through shared drives.
-
Decision-support for teams handling sensitive or regulated data, running entirely on local infrastructure.
-
A private interface for document retrieval and policy clarification no external APIs involved.
Because everything runs locally, data stays within the organization’s own infrastructure. That makes it genuinely viable for healthcare, defense, and finance fields where “we don’t send your data anywhere” isn’t a selling point, it’s a hard requirement.
-
-
Future Enhancements
A few directions worth building toward:
-
Voice input for hands-free, more accessible queries.
-
Multimodal support so the system can handle diagrams, images, and handwritten notes alongside text.
-
Incremental learning so new documents get absorbed without a full reindex.
-
Multi-language support to expand the assistant’s reach beyond English-language materials.
-
Further model compression and quantization for deployment on lower-end hardware.
-
LMS and enterprise tool integration for larger institutional rollouts.
-
Adding collaborative features that let more than one person ask questions and learn from shared datasets.
-
Creating user analytics dashboards to keep an eye on learning progress and give feedback on performance.
-
Support for third-party plugin extensions like quiz makers or tools that help you make better.
-
Improved security and encryption methods for keeping data safe that are processed locally.
This research lays the groundwork for creating AI assistants that work offline, are specific to a certain field, and protect users’ privacy. Te system can become a strong educational and business tool that makes information management more efficient, accessible, and secure in a fully localized setting as it continues to improve.[19]
-
-
CONCLUSION VIII. REFERENCES
This paper described the design and build of an offline study assistant that runs RAG and local language models entirely on-device. The system takes lecture notes, textbooks, and uploaded PDFs, retrieves the most relevant chunks, and generates answers without touching the internet meaning student data stays on the machine, full stop. Tests across Phi-2, Mistral 7B, and LLaMA 3 showed the approach is practical: reasonable accuracy, acceptable latency, and no cloud dependency. That combination makes it genuinely useful for students in low-connectivity areas, not just a proof of concept.
The same architecture extends beyond classrooms. Any organization that needs to query private documents internal manuals, compliance records, project data can adapt this pipeline and get a local knowledge assistant that never sends data out. Healthcare, defense, finance fields where data residency is non-negotiable are natural fits. The broader takeaway is straightforward: generative AI doesn’t have to mean cloud AI. Local deployment is already good enough to be useful, and it comes with privacy guarantees that hosted systems structurally cannot offer.
-
Raschka, S., Liu, Y. H., & Mirjalili, V. (2022). Machine Learning with PyTorch and Scikit-Learn: Develop machine learning and deep learning models with Python. Packt Publishing Ltd.
-
Neupane, S., Hossain, E., Keith, J., Tripathi, H., Ghiasi, F., Golilarz, N. A., … & Rahimi, S. (2024, October). From questions to insightful answers: Building an informed chatbot for university resources. In 2024 IEEE Frontiers in Education Conference (FIE) (pp. 1-9). IEEE.
-
Singh, N. T., Kaur, H., Dhiman, J., Aryan, A., Rani, J., & Wadhwa, M. (2025, June). AI-Driven Document Analysis: Employing Streamlit, Faiss, Nvidia Nemo. In 2025 3rd International Conference on Inventive Computing and Informatics (ICICI) (pp. 314-322). IEEE.
-
Singh, P. N., Talasila, S., & Banakar, S. V. (2023, December). Analyzing embedding models for embedding vectors in vector databases. In 2023 IEEE International Conference on ICT in Business Industry & Government (ICTBIG) (pp. 1-7). IEEE.
-
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., … & Rush, A. M. (2019). Hugging Face’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
-
Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X. E., … & Synnaeve, G. (2023). Code Llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
-
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin,
P., … & Lowe, R. (2022). Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35, 27730-27744.
-
Johnson, J., Douze, M., & Jégou, H. (2024). The Faiss Library. arXiv preprint arXiv:2401.08281.
-
Jin, B., Yoon, J., Han, J., & Arik, S. O. (2024). Long-context LLMs meet RAG: Overcoming challenges for long inputs in RAG. arXiv preprint arXiv:2410.05983.
-
Ashish Tarun, R., Priyadarshini, B., Sneha, M., & Akila, K. (2024, May). Leveraging LangChain Framework and Large Language Models for Conversational Chatbot Development. In International Research Conference on Computing Technologies for Sustainable Development (pp. 244-255). Cham: Springer Nature Switzerland.
-
Xu, J., Li, J., Liu, Z., Suryanarayanan, N. A. V., Zhou, G., Guo, J., … & Tei, K. (2024). Large language models synergize with automated machine learning. arXiv preprint arXiv:2405.03727.
-
Team, L. (2024). LangChain documentation. URL: https:// docs.langchain.com ( : 10.05.2025).
-
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., … & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in neural information processing systems, 33, 9459-9474.
-
Wang, H., Gao, C., Dantona, C., Hull, B., & Sun, J. (2024). DRG-LLaMA:
Tuning LLaMA model to predict diagnosis-related group for hospitalized patients. NPJ digital medicine, 7(1), 16.
-
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., … & Scialom, T. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
-
Hugging, F. (2022). Hugging FaceThe AI community building the future. Hugging Face [citat 13 aprilie 2025]. Disponibil: https://huggingface.co.
-
Géron, A. (2022). Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow. “O’Reilly Media, Inc.”
-
Raschka, S., Liu, Y. H., & Mirjalili, V. (2022). Machine Learning with PyTorch and Scikit-Learn: Develop machine learning and deep learning models with Python. Packt Publishing Ltd.
