DOI : 10.17577/IJERTCONV14IS010043- Open Access

- Authors : Praneeth Kumar Gowda, Nishmitha J
- Paper ID : IJERTCONV14IS010043
- Volume & Issue : Volume 14, Issue 01, Techprints 9.0
- Published (First Online) : 01-03-2026
- ISSN (Online) : 2278-0181
- Publisher Name : IJERT
- License:
This work is licensed under a Creative Commons Attribution 4.0 International License
Comparative Evaluation and Implementation of a Fine-Tuned TinyLLaMA Model for Automated Exam Email Communication
Praneeth Kumar Gowda
Student, St Joseph Engineering Collage, Mangalore
Nishmitha J
Assistant Professor,St Joseph Engineering College, Mangalore
Abstract – Models that strike a compromise between hardware efficiency, privacy, and performance are necessary for automating academic communications, such as exam duty notifications. The effectiveness of four compact language modelsTinyLLaMA, DeepSeek-R1 (1.5B), Gemma (2B), and Phiin producing official exam duty emails is assessed in this work. Following evaluation, TinyLLaMA's effectiveness and organized output led to its selection for practical implementation. We present how a refined TinyLLaMA model, tailored for offline academic use, is implemented within a Node.js backend.
Index Terms – TinyLLaMA, compact language models, exam duty automation, academic communication, low- resource deployment, on-device LLMs, privacy-preserving AI, Node.js backend.
-
INTRODUCTION
The escalating demand for automation in administrative tasks within educational institutions finds a powerful ally in generative AI [1]. However, the unique challenges of privacy[4], resource constraints, and the imperative for reliability necessitate the adoption of lightweight, locally deployable models [2].
In contrast to cloud-based Large Language Models (LLMs), locally hosted models, particularly those leveraging platforms like Ollama, present compelling advantages in terms of cost- effectiveness, data privacy[4], and lower hardware requirements . This on-premise approach eliminates recurring cloud service fees and significantly reduces dependency on external Application Programming Interfaces (APIs). Crucially, it grants institutions complete control over their sensitive academic data, mitigating risks associated with third- party data handling . educational institutions operating under stringent budget limitations, strict data protection policies, and potentially constrained IT resourcFurthermore, the optimized
nature of many local LLMs means they can run efficiently on more modest computing infrastructure, reducing the need for expensive, high-end GPUs often associated with larger, cloud- based models . This makes local deployments an exceptionally suitable and sustainable solution for es.
-
RELATED WORK
The terrain of generative AI[1] has witnessed a proliferation of different model architectures with a sharp bifurcation observed between cloud-based, powerful big models and their smaller, efficient versions. Earlier studies strongly point towards the escalating role of compact models[3], e.g., TinyLLaMA[5], and their natural fit to low-resource settings. Such smaller models, typically optimized via methods such as quantization and pruning[6], have a compelling balance of performance and computational size that makes them the go- to choice for situations where high-performance hardware or large cloud resources are impractical. Their capability to execute within devices that have minimal memory and processing power, such as regular laptops or edge devices, greatly enlarges the scope of advanced AI capabilities that are within reach.
In addition, the use of AI systems, particularly in sensitive areas such as education, requires effective privacy[4] protection. Federated learning (FL) [7] research provides an influential privacy-preserving[4] model training and deployment paradigm. FL supports multi-party model building across decentralized data sources without having to move raw sensitive data to a centralized location, thereby preventing privacy[4] risks from centralized data collection. This method is especially applicable to schools dealing with student records and other private data.
Complementing FL, Low-Rank Adaptation (LoRA)[9] has emerged as a highly effective and parameter-efficient fine- tuningtechnique [8]. LoRA[9] allows for the adaptation of large pre-trained models to specific tasks or domains by introducing only a small number of trainable parameters,
significantly reducing computational overhead and memory requirements during fine-tuning[8]. This makes LoRA[9] an invaluable tool for tailoring compact LLMs[3] like TinyLLaMA[5] to the unique needs of educational administrative tasks without necessitating extensive retraining or large datasets. Research on privacy-preserving model deployment further explores mechanisms such as differential privacy and homomorphic encryption, which can be integrated with FL and LoRA[9] to enhance data security during both training and inference in sensitive environments . These frameworks collectively provide a robust methodological foundation for leveraging generative AI in educational settings while adhering to strict data protection policies and resource constraints.
-
METHODOLOGY
To investigate the practical applicability and efficiency of lightweight, locally deployable generative AI models [1] for administrative tasks in educational institutions, we conducted a controlled experimental evaluation.
-
Model Selection
We selected four distinct compact Large Language Models (LLMs) for evaluation: TinyLLaMA, DeepSeek-R1 (1.5B), Gemma (2B), and Phi (a representative small model, e.g., Phi-2). These models were chosen due to their design principles focusing on efficiency and suitability for resource- constrained environments, making them strong candidates for local deployment . Their varying architectures and parameter counts allowed for a comparative analysis of performance characteristics across different lightweight model families.
-
Deployment Environment
All models were deployed locally using Ollama, an open- source framework designed to simplify the execution and management of LLMs on personal hardware . This approach ensured that the evaluation truly reflected a local deployment scenario, allowing for direct measurement of resource utilization without the complexities and variability of cloud- based inference services. The experiments were conducted on a standardized computing environment to ensure comparability of results, with the following specifications:
Hardware Model: HP HP 245 G6 Notebook PC
Processor: AMD PRO A4-4350B R4, 5 COMPUTE CORES 2C+3G × 2
Memory (RAM): 12.0 GiB
Graphics: AMD Radeon R4 Graphics
Operating System: Fedora Linux 41 (Workstation Edition), 64-bit
Kernel Version: Linux 6.11.7-300.fc41.x86_64
-
Task Definition and Data Collection
Each selected model was tasked with generating formal exam duty emails. This specific administrative task was chosen for its commonality in educational settings, requiring a balance of formality, clarity, and adherence to specific instructions, thereby serving as a relevant benchmark for real-world utility . A standardized prompt was designed to elicit consistent outputs from all models, focusing on elements typically found in such communications (e.g., date, time, venue, student details, supervisor instructions).
During the inference process for each model and task execution, we systematically recorded key performance and resource consumption metrics:
CPU Usage: Measured as the percentage of CPU utilized by the model's inference process.
RAM Consumption: Recorded the peak Random Access Memory (RAM) consumed by the model during generation.
Inference Time: Measured the total time taken from initiating the prompt to receiving the complete generated response. This metric is crucial for assessing real-time pplicability.
These metrics were captured using the Python systeminformation and pidusage packages, which provide robust capabilities for monitoring system-level and process- specific resource utilization . All collected logs, including timestamped CPU, RAM, and inference time data for each model's generation, were meticulously saved in CSV (Comma Separated Values) format for subsequent analysis.
-
-
RESULTS
Table I summarizes the average generation time, CPU utilization, and RAM consumption for each evaluated model when tasked with generating formal exam duty emails, using the data recorded by the JavaScript script. Qualitative notes on the output characteristics are also provided.
Model
Time (s)
CPU (%)
RAM (MB)
Notes
TinyLLaMA (Base)
53.29
0.5
244.4
Fastest, slightly verbose
TinyLLaMA (TheBloke)
77.39
0.4
293.4
High quality, good structure
Gemma 2B
85.69
1.2
241.9
Consistent formatting
Phi
97.49
0.6
240.0
Inconsistent, off- topic
DeepSeek
145.79
0.6
247.9
Most structured
Table I: Model Evaluation Metrics
B. Performance Analysis GenerationTime:
Figure 1. Response time by each model
The "TinyLLaMA (Base)"[5] model demonstrated the fastest inference time, completing the email generation task in an average of 53.29 seconds. This positions it as the most responsive model for real-time administrative applications among those tested. "TinyLLaMA (TheBloke)" followed, completing the task in 77.39 seconds, indicating strong efficiency for a fine-tuned variant. Gemma 2B and Phi also showed reasonable speeds at 85.69 seconds and 97.49 seconds, respectively. DeepSeek exhibited the longest generation time at 145.79 seconds, which could impact user experience in scenarios requiring quick turnaround.
Resource Utilization (CPU & RAM):
Figure 2.CPU usage by each model
All models, when measured via the ollama process using pidusage, maintained remarkably low CPU utilization, ranging from 0.4% to 1.2%. Both TinyLLaMA (TheBloke) and Phi registered among the lowest at 0.4% and 0.6% respectively. Gemma 2B showed a slightly higher, though still minimal, CPU usage of 1.2%. These figures represent the CPU consumed by the specific Ollama process relative to one CPU core's capacity.
However, it is important to note that observations from system-wide monitoring tools like btop may show significantly higher overall CPU utilization (e.g., 95-99%) during the same period. This discrepancy arises because:
System-Wide vs. Process-Specific Measurement: btop reports the aggregate CPU usage across all cores and all processes on the system (including the operating system kernel, background services, and other applications), whereas the CPU (%) in Table I reflects only the isolated ollama process's consumption relative to a single core.
Ollama's Optimized Backend: Ollama's underlying engine is highly optimized, often written in C/C++ and leveraging multi-threading and specialized CPU instructions (like SIMD) to distribute computations efficiently across multiple cores. This parallel processing, while making the LLM inference fast, can collectively drive the total system CPU utilization to high levels, even if the individual ollama process isn't saturating a single core. The CPU is actively engaged in memory operations and heavy computations spread across its available cores.
Regarding RAM consumption, Phi recorded a memory footprint of 240.0 MB. TinyLLaMA (Base)[5] and Gemma 2B followed closely with 244.4 MB and 241.9 MB, respectively. TinyLLaMA (TheBloke) and DeepSeek had slightly higher RAM usage at 293.4 MB and 247.9 MB, respectively. All these figures remain highly acceptable for local deployment on the specified 12 GiB RAM system, affirming the "lightweight" nature of these models. The low resource demands across the board validate the suitability of locally hosted LLMs for environments with constrained IT infrastructure, such as educational institutions.
-
DISCUSSION
The empirical evaluation of lightweight Large Language Models (LLMs) for local deployment in administrative tasks within educational institutions reveals a nuanced landscape of performance, resource efficiency, and output quality. Our findings underscore the practical feasibility of on-device generative AI, while also highlighting critical trade-offs that dictate optimal model selection.
Overall, the results demonstrate that contemporary lightweight LLMs can operate with remarkably low resource footprints on a standard notebook PC (12 GiB RAM, AMD PRO A4-4350B processor). The measured process-specific CPU utilization for the ollama server, which performs the actual model inference, consistently remained below 1.5%. Similarly, RAM consumption for all tested models was well within acceptable limits for local machines, ranging from 240 MB to 293.4 MB. These low resource demands, particularly for RAM, validate the suitability of locally hosted LLMs for environments with constrained IT infrastructure, affirming their "lightweight" nature for on-device generation in privacy-constrained settings.
It is important to contextualize these process-specific CPU readings. While the ollama process itself, as measured, showed low percentages of a single core's capacity, system- wide monitoring (e.g., via btop) often indicated much higher aggregate CPU utilization (up to 95-99%) during active
generation. This disparity arises because Ollama's highly optimized backend efficiently leverages multiple CPU cores and low-level system resources (like memory bandwidth) for parallel computation, which contributes significantly to the total system load, even if the individual process's single-core percentage appears low. This demonstrates that while the models are resource-efficient per single thread, their aggregate impact on the CPU can still be substantial as they strive for rapid inference.
Examining individual model performance, distinct strengths and weaknesses emerge:
TinyLLaMA (Base): As the fastest model tested (53.29 seconds), TinyLLaMA (Base)[5] offers an ideal balance for real-time, on-device generation. Its speed is a significant advantage for administrative workflows requiring immediate responses, aligning well with the need for agile solutions in educational settings. While noted for being "slightly verbose," minor post-generation editing can often rectify this, making its speed a compelling factor for high-volume tasks.
TinyLLaMA (TheBloke): This fine-tuned version, though slightly slower than its root counterpart (77.39 seconds), showed better output quality with "high quality, good structure." This indicates that fine-tuning with a targeted approach[8] can greatly improve the adherence of a lightweight model to the standards of formal communication, providing a better fit for formal administrative outputs where accuracy and formatting are of critical importance. Its resource utilization remained impressively low.
Gemma 2B: With "consistent formatting" and well-balanced generation time (85.69 seconds) and moderate resource utilization, Gemma 2B is a stable and reliable choice. Its consistency is an extremely useful feature in automating mundane work, decreasing the amount of manual checking required for generated content.
DeepSeek: This model was characterized by generating the "most structured" answers, reflecting a good understanding of formal email conventions. This, however, came at the expense of having the highest generation time among the models tested (145.79 seconds). Although providing rich answers, its increased computational latency may restrict its use in situations where very fast content generation is required.
Phi: While it had a minimal RAM footprint of 240.0 MB, Phi performed very poorly in contextual reliability, generating outputs that tended to be "inconsistent" and "off-topic." This lack of qualitative output makes Phi less useful for administrative automation, since the correction effort would more than likely offset whatever gains from its low resource usage.
In summary, the research emphasizes the essential trade-offs among generation speed, resource effectiveness, and quality of output in light-weight LLMs. The "optimal balance" largely depends on the particular administrative task and on the priorities of the institution. For those valuing speed and rapid response, TinyLLaMA[5] (Base) is a preferred choice. For those requiring more quality and satisfaction with stringent formats, TinyLLaMA (TheBloke) and Gemma 2B provide strong compromises. DeepSeek offers great form but with a significant time price, whereas Phi reminds us that resource frugality is insufficient to ensure practical usefulness if output quality is sacrificed. These results are very informative for educational institutions who want to responsibly and efficiently incorporate local generative AI applications into their administrative processes, maintaining data privacy[4] and cost savings.
-
IMPLEMENTATION
The automated system for generating and dispatching exam duty emails to faculty members was meticulously implemented with a strong emphasis on local operation, efficiency, and data privacy[4]. This section details the system's architecture, backend workflow, resource monitoring mechanisms, and compliance with offline and privacy requirements.
-
System Overview
The system is engineered to seamlessly integrate into an educational institution's administrative workflow, providing automated generation and sending of exam duty emails to designated faculty. It operates entirely offline, making it a self-contained solution deployed on a Fedora 41 system. This choice of operating environment provides a stable and secure foundation for running the application. The core artificial intelligence component is a fine-tuned TinyLLaMA model, specifically adapted for formal communication, which is hosted locally via Ollama. This strategic decision ensures that all sensitive data processing occurs on-premises, eliminating reliance on external cloud services. The backend is powered by Node.js, combined with the lightweight Express.js framework, together forming an efficient, non-blocking, and highly scalable server-side pipeline designed for rapid development and high performance in I/O-bound tasks.
Figure 3 Architecture
-
Backend Workflow
The system's operational workflow is managed by a scheduled cron job, which triggers the email creation process at predetermined intervals (e.g., daily or weekly, based on administrative requirements). This automation reduces manual labor and guarantees timely communication. The workflow moves through a number of distinct stages:
Data Retrieval: This process starts by having the system link to a local academic database (i.e., PostgreSQL or SQLite instance) in order to retrieve securely the newest upcoming exam schedules and assigned faculty details. This means professor names, dates, times, and room assignments so that the information utilized for generating email is always up-to- date and correct.
Prompt Generation: For every faculty entry that needs an exam duty notice, a strongly structured and specific prompt is generated dynamically. The prompt follows a particular format to effectively direct the LLM, for example: "Create a formal exam duty notice email for Prof. [Name], on [Date] from [Start Time] to [End Time] at Room [Number]." The use of a template model input ensures uniformity, which results in more predictable and relevant responses.
Model Inference: The dynamically generated prompt is then sent through an HTTP POST request to the local TinyLLaMA endpoint provided by Ollama. The request is sent to POST http://localhost:11434/api/generate. Ollama, as the local LLM inference server, runs this prompt through the fine-tuned TinyLLaMA model. Upon successful deduction, the model produces a JSON output with the created email content, with the email subject and body usually having separate fields, both composed in formal scholarly language appropriate for institutional communication.
Email Sending: The returned JSON response from Ollama is carefully parsed by the Node.js server. Extracted email subject and body are then passed smoothly to the targeted faculty member by Nodemailer, a feature-packed Node.js library for sending emails. Nodemailer is set up with the institution's secure SMTP server settings to send the created alerts reliably and authenticated. By using the LLM's output directly for both subject and body, the system is fully contextually consistent and removes any possibility of human entry errors.
-
-
CONCLUSION
This analysis of light Large Language Models (LLMs) for on- premises use in educational institutions provides critical insights into the real-world challenges and compromises present. Of the models evaluated, TinyLLaMAboth the base and fine-tuned "TheBloke" variantswere found most appropriate for offline, privacy-conscious academic environments. Its high-speed generation capability and low, acceptable CPU and RAM utilization make it extremely efficient for on-device, real-time text generation. Conversely,
models such as Gemma 2B and DeepSeek provided reliable formatting and organized outputs but were lacking in terms of speed and overall effectiveness. Phi, being efficiency-friendly, could not provide contextual reliability, illustrating that performance should be balanced against output quality.The success of this on-premises deployment points to its fundamental advantages for schools: data privacy and autonomy in operation. Handling data solely on-premises keeps strict privacy laws in check, eliminates dependence on cloud APIs, minimizes security threats, and ensures seamless operation even without internet connectivity. This configuration offers a solid foundation for applying generative AI to administrative processes where data governance and system uptime are critical.
In the future, some areas can be improved upon:
LoRA-based fine-tuning can further facilitate TinyLLaMA's flexibility in academic application with low computational expense, since only a limited subset of parameters are trained.
Differential privacy can protect individual points in institutional settings so outputs do not compromise sensitive patterns or individual information.
Scaling the system to automate more general administrative functionssuch as report generation or FAQsand ultra- low-power edge device optimization may make it more accessible and less expensive in hardware.
In summary, this deployment represents an important milestone toward secure, efficient, and feasible AI integration in education, with plenty of room for future improvement.
-
REFERENCES
-
Banh, L., Strobel, G. Generative artificial intelligence. Electron Markets 33, 63 (2023). https://doi.org/10.1007/s12525-023-00680-1
-
Kumar, B.V.P., Ahmed, M.D.S. Beyond Clouds: Locally Runnable LLMs as a Secure Solution for AI Applications. DISO 3, 49 (2024). https://doi.org/10.1007/s44206-024-00141-y
-
Denis Tarasov, Kumar Shridhar. Distilling LLMs' Decomposition Abilities into Compact Language Models. htts://doi.org/10.48550/arXiv.2402.01812
-
Das, B.C.; Amini, M.H.; Wu, Y. Security and Privacy Challenges of Large Language Models: A Survey. ACM Comput. Surv. 2024, 57,
152. Google Scholar CrossRef
-
Zhang, P., Zeng, G., Wang, T., Lu, W. Tinyllama: An open-source small language model. arXiv preprint arXiv:2401.02385.
-
Han, S., Mao, H., and Dally, W. J. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. In ICLR, 2016.
-
Yang, Q., Liu, Y., Chen, T., & Tong, Y. (2019). Federated machine learning: Concept and applications. ACM Transactions on Intelligent Systems and Technology (TIST), 10(2), 1-19.
-
Weirui Kuang, Bingchen Qian, Zitao Li, Daoyuan Chen, Dawei Gao, Xuchen Pan, Yuexiang Xie, Yaliang Li, Bolin Ding, and Jingren Zhou. FederatedScope-LLM: A Comprehensive Package for Fine-tuning Large Language Models in Federated Learning. arxiv: 2309.00363.
-
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
