Zentry AI: A Self-Hosted Malayalam Telephony Conversational Agent Using Open- Source Speech and Language Models

doi:https://doi.org/10.5281/zenodo.18889679

Volume 15, Issue 02 (February 2026)

Zentry AI: A Self-Hosted Malayalam Telephony Conversational Agent Using Open- Source Speech and Language Models

DOI : https://doi.org/10.5281/zenodo.18889679

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 29
Authors : Chintu P. Chacko, Habel Shaji, Lino Tom, Sooraj Suresh, Tom Benny
Paper ID : IJERTV15IS020753
Volume & Issue : Volume 15, Issue 02 , February – 2026
Published (First Online): 06-03-2026
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Zentry AI: A Self-Hosted Malayalam Telephony Conversational Agent Using Open- Source Speech and Language Models

Chintu P. Chacko (Guide)

Assistant Professor,

Dept. of Computer Science and Engineering Toc H Institute of Science and Technology, Kerala, India

Habel Shaji, Lino Tom, Sooraj Suresh, Tom Benny

Department of Computer Science and Engineering Toc H Institute of Science and Technology, Kerala, India

Abstract – Educational institutions in Kerala face a recurring operational burden during admission seasons, when telephonic enquiries from prospective students surge signicantly. These interactions are predominantly conducted in Malayalam a morphologically complex Dravidian language characterised by regional dialectal variation and pervasive code-mixing with English, locally termed Manglish. Existing Interactive Voice Response (IVR) systems and generic multilingual chatbots are ill-equipped for such linguistically nuanced interactions, while cloud-based LLM APIs introduce recurring per-call costs and student data privacy concerns.

This paper presents Zentry AI, a fully self-hosted, real- time Malayalam conversational admission assistant designed for deployment over standard telephony infrastructure. The system integrates Twilio for SIP call management via TwiML Bins, a FastAPI/ngrok webhook server, a Malayalam-optimised Faster-Whisper model for automatic speech recognition (ASR), IndicTrans2 for bidirectional MalayalamEnglish neural machine translation, a Retrieval-Augmented Generation (RAG) frame- work powered by Phi-4 Mini (3.8B) via llama.cpp, and a W8A16 mixed-precision quantised Indic Parler-TTS model for speech synthesis. A post-call LLM extraction module additionally structures caller intent and personal details from completed call transcripts into an administrative database.

Prototype evaluation on consumer-grade hardware (NVIDIA RTX 3080 class) reveals an end-to-end conversational latency of approximately 3.54.5 seconds, with autoregressive TTS genera- tion and LLM inference identied as the dominant bottlenecks. The deployed ASR model achieves a Word Error Rate of 11.49% with text normalisation on the Common Voice 11.0 Malayalam evaluation set. These results demonstrate the functional viability of a privacy-preserving, cost-efcient conversational agent for regional Indian languages, while honestly mapping the hardware constraints that remain to be overcome.

Index TermsMalayalam speech recognition, conversational AI, telephony automation, retrieval-augmented generation, Indic TTS, post-call analytics, low-resource NLP, model quantisation

Introduction
College admission helplines in Kerala receive hundreds of daily telephonic enquiries during peak enrollment seasons, covering course eligibility, fee structures, seat availability, and application procedures. Managing these calls manually leads to long wait times and signicant administrative overhead. Automating such helplines is technically demanding when the primary language is Malayalam an agglutinative Dra-vidian language spoken by over 38 million people, charac-terised by morphological richness and pervasive English code-mixing [2]. Generic multilingual ASR and language models perform poorly on Malayalam, particularly on dialectal and code-mixed speech [4]. Processing interactions entirely on-premise to preserve student data privacy introduces a further challenge: running a full pipeline of Speech-to-Text (STT), Machine Translation (MT), a Large Language Model (LLM), and Text- to-Speech (TTS) through a single consumer-grade GPU creates latency bottlenecks that disrupt natural conversational ow.

Understanding and documenting these constraints is as valuable as demonstrating what is achievable.

To address these gaps, we present Zentry AI a self-hosted Malayalam conversational agent for institutional telephony and evaluate its performance characteristics on consumer- grade hardware. The primary contributions of this work are:
- A production-grade telephony pipeline using Twilio TwiML Bins and a FastAPI/ngrok webhook server, elim- inating complex self-hosted PBX conguration.
- A bidirectional IndicTrans2 translation bridge [6] enabling an English-optimised LLM to serve Malayalam callers without language-specic model retraining.
- A W8A16 mixed-precision quantised Indic Parler- TTS pipeline [11], delivering substantially improved prosodic naturalness over prior MMS-based synthesis within consumer VRAM limits.
- A post-call LLM extraction module that automatically structures caller information into administrative records from completed call transcripts.
- A performance characterisation on consumer GPU hardware, identifying autoregressive TTS and VRAM sat-
uration as the primary latency constraints and proposing concrete mitigation strategies.
Related Work
1. Natural Language Processing for Indic Languages
  Das and Das identied language ambiguity, contextual comprehension, and representational bias as central barriers to effective NLP-driven humancomputer interaction, noting these challenges intensify for languages underrepresented in pretraining corpora [1]. Kakwani et al. introduced Indic- NLPSuite, demonstrating that models pre-trained on Indic- specic corpora signicantly outperform generic multilingual baselines across downstream tasks in Malayalam and related languages [7].
2. Code-Mixed Speech and Language Identication
  Thara and Poornachandran demonstrated that transformer- based models achieve high accuracy in word-level language identication for MalayalamEnglish code-mixed text, while conrming that models unexposed to Manglish during training fail substantially on this input distribution [2]. This nding directly motivates the use of domain-specic ne-tuning rather than zero-shot multilingual models in the Zentry ASR com- ponent.
3. Automatic Speech Recognition
  Radford et al. introduced Whisper, a large-scale weakly su- pervised ASR model trained on 680,000 hours of multilingual audio [3]. While Whisper achieves strong zero-shot generalisa- tion, transcription accuracy degrades noticeably for Dravidian languages due to training data imbalance. Brydinskyi et al. showed that domain-specic ne-tuning consistently yields signicant WER reductions over generic baselines [4]. Ivan et al. explored ensemble strategies for improving ASR robustness under varied acoustic conditions [5], a consideration directly relevant to telephony deployments where callers may be in noisy environments.
4. Neural Machine Translation for Indian Languages
  Gala et al. presented IndicTrans2, achieving state-of-the-art BLEU scores on MalayalamEnglish translation on the Flores- 200 benchmark [6]. Integrating IndicTrans2 as a bidirectional bridge decouples translation quality from LLM reasoning quality, allowing compact English-optimised models to be used without Malayalam-specic retraining.
5. Retrieval-Augmented Generation and Telephony Latency
  Lewis et al. established the foundational RAG paradigm, demonstrating that augmenting LLM generation with re- trieved document context substantially reduces hallucination in knowledge-intensive tasks [8]. Skantze reviewed turn-taking mechanisms across conversational systems, establishing that response delays exceding two seconds cause measurable degradation in perceived interaction naturalness [9] the primary target metric for pipeline latency in this work.
6. Text-to-Speech Synthesis for Indic Languages
  Casanova et al. surveyed TTS progress for low-resource lan- guages, noting that autoregressive neural models consistently outperform statistical systems in naturalness but at substan- tially higher computational cost [10]. Khan et al. addressed this for Indian languages through AI4Bharats Indic Parler- TTS [11], achieving signicantly improved MOS scores over non-autoregressive alternatives such as MMS-TTS at the generation latency cost characterised empirically in this work.
Problem Statement
The problem addressed is: Given an incoming telephone call in which a prospective student poses admission-related queries in conversational Malayalam, automatically generate factually grounded spoken responses using exclusively local computation, while characterising the VRAM and latency constraints of consumer-grade edge hardware on this multi- model pipeline.

The system must satisfy the following constraints:
1. 1. Accurately transcribe Malayalam speech including di- alectal variation and code-mixed inputs.
  2. Generate responses grounded in veried institutional documents, avoiding hallucination.
  3. Synthesise intelligible, natural-sounding Malayalam au- dio for telephonic playback.
  4. Execute all inference locally without cloud API depen- dencies.
  5. Operate within consumer GPU memory limits while minimising conversational latency.
Constraint 5 is in direct tension with constraints 2 and 3, since high-quality RAG-augmented LLMs and autoregressive TTS models are individually memory- and compute-intensive. A central contribution of this work is empirically mapping this tension.
System Architecture
Zentry AI employs a modular ve-stage pipeline. Fig. 1 il- lustrates the overall architecture; a ChromaDB vector store and post-call extraction module serve as supporting components.

Fig. 1. High-level architecture of Zentry AI. Calls arrive via Twilio and are streamed to the FastAPI/ngrok server. Audio traverses Faster-Whisper (ASR), IndicTrans2 (MalayalamEnglish), Phi-4 Mini via llama.cpp with ChromaDB RAG, IndicTrans2 (EnglishMalayalam), and Indic Parler-TTS before response audio is returned to the caller.
1. Telephony Layer: Twilio, TwiML, and FastAPI
  Incoming calls are routed through a Twilio phone number to a TwiML Bin, which instructs Twilio to establish a bidi- rectional WebSocket media stream. The application server is built on FastAPI [14] served by Uvicorn, exposed publicly through an ngrok [15] HTTPS tunnel. Twilio delivers -law encoded audio frames at 8 kHz; processed response audio is returned over the same WebSocket connection.
Implementation
1. Hardware and VRAM Allocation
  The system targets NVIDIA RTX 3080-class GPUs (10 12 GB VRAM). Table I reports the VRAM consumed by each loaded component.
  
  TABLE I
  
  GPU VRAM Allocation per Component
  
  An earlier prototype used a self-hosted FreeSWITCH PBX,
  
  which required complex SIP trunk conguration, manual TLS certicate management, and exhibited audio buffer synchro- nisation instability at the TTS output boundary. Migrating to Twilio eliminated these integration burdens.
2. Speech Recognition: Optimised Faster-Whisper
  Incoming audio is decoded and upsampled to 16 kHz. Silero VAD [20] segments speech using dynamic energy threshold- ing. Transcription is performed by Faster-Whisper [16], using thennal/whisper-medium-ml a Whisper Medium model ne-tuned on Common Voice 11.0 Malayalam and converted to CTranslate2 [17] INT8 format for low-latency GPU inference.
3. Translation Bridge: IndicTrans2
  Rather than requiring an LLM capable of reasoning natively in Malayalam, Zentry AI translates the ASR output to English before LLM inference and translates the English response back to Malayalam before synthesis, using IndicTrans2 [6]. This decouples translation quality from LLM reasoning quality.
4. Language Understanding and RAG
  
  Component Precision VRAM (GB)
  
  Faster-Whisper Medium INT8 1.2
  
  IndicTrans2 (enc. + dec.) FP16 3.1
  
  Phi-4 Mini (3.8B) Q4 K M 2.4
  
  Indic Parler-TTS (1.2B) W8A16 2.2
  
  System / driver overhead 0.8
  
  Total 9.7
  
  On 8 GB mobile GPUs, this pipeline overows VRAM entirely, causing system RAM swapping and severe generation lag a constraint directly observed during prototype testing on a mobile GPU.
  1. Asynchronous Pipeline Orchestration
  To mitigate sequential blocking, GPU-bound operations (ASR, IndicTrans2, LLM, TTS) are submitted to a GPU semaphore queue while CPU-bound pre/post-processing runs concurrently. Listing 1 shows the core NLP pipeline orches- tration.
  
  1 async def handle_llm(call_id, caller_id, phone, text_ml):
  
  2 # 1. Translation Bridge: Malayalam -> English
  
  3 text_en = await cpu_scheduler.run(ml_to_en, text_ml)
  
  4 text_en_clean = fix_translation_errors(text_en)
  
  Translated English queries are embedded and matched 5 against a ChromaDB [18] vector store containing chun- 7 ked institutional admission documents. Retrieved chunks are 8 prepended to the query and passed to Phi-4 Mini (3.8B 10 parameters) via llama.cpp [19] with Q4 K M quantisation 11 and CUDA ofoading. The LLM system prompt enforces 13 Malayalam-only responses grounded in retrieved context, 14 explicitly instructing the model to acknowledge knowledge 16
  
  boundaries rather than speculate.
5. Speech Synthesis: W8A16 Quantised Indic Parler-TTS 19
  Malayalam responses are synthesised by AI4Bharat Indic 21 Parler-TTS [11], quantised using a W8A16 mixed-precision 23 scheme (weights at int8, activations at oat16) via Hug- 24 ging Face Optimum Quanto [21]. This reduces VRAM from 26 approximately 4.6 GB (FP16) to 2.2 GB while preserving 27 perceptual quality, as evaluated in Section VI. 28
6. Post-Call Information Extraction
Upon call termination, the full conversation transcript is passed asynchronously to a second Phi-4 Mini invocation with a structured extraction prompt, requesting a JSON object containing: caller name, programme of interest, primary intent category, unresolved queries, and a recommended follow-up ag. Records are validated against a Pydantic schema and persisted to a PostgreSQL table.

# 2. Intent Detection & RAG Retrieval intent = await cpu_scheduler.run(

detect_intent, text_en_clean

)

rag_topic = INTENT_TO_TOPIC.get(intent, None) rag_results = await cpu_scheduler.run(

rag.retrieve, text_en_clean, rag_topic

)

rag_docs = [item[text] for item in rag_results]
# 3. LLM Generation (GPU Offloaded via llama.cpp) history = get_history(caller_id)

snapshot = get_snapshot(caller_id, intent) prompt = build_prompt(

text_en_clean, rag_docs, history, snapshot

)

response_en = await gpu_scheduler.run( engine.generate, prompt

)

# 4. Translation Bridge: English -> Malyalam reply_ml = await cpu_scheduler.run(en_to_ml,

response_en) return reply_ml

Listing 1. Asynchronous NLP pipeline orchestrating the IndicTrans2 translation bridge, RAG retrieval, and LLM generation within a single conversational turn.

Intent detection precedes retrieval so only topically relevant document chunks are injected into the LLM context. Conver- sation history snapshots carry caller state across turns without re-embedding the full dialogue each time.

The source code for Zentry AI is available at https://github. com/Habel2005/zentry backend.
Experimental Results
A. Latency Evaluation

Table II reports per-stage latency measured on an NVIDIA RTX 3080-class mobile GPU.

TABLE II

Empirical Component-wise Pipeline Latency

Pipeline Stage Mean Latency (ms)

D. Post-Call Extraction: Preliminary Evaluation

A preliminary evaluation of the extraction module was conducted on a set of manually reviewed call transcripts. Ta- ble IV reports indicative eld-level results. Precision is highest for programme of interest, as course names are vocabulary- constrained and map directly to entries in the knowledge base. Recall is lowest for unresolved queries, since this eld requires the LLM to infer what was not adequately answered a subtler task than extracting explicitly stated information. A

VAD segmentation & buffering Faster-Whisper ASR

IndicTrans2 (MalayalamEnglish) ChromaDB retrieval

Phi-4 Mini (TTFT + generation) IndicTrans2 (EnglishMalayalam)

larger-scale formal evaluation is planned as part of future work.

TABLE IV

Post-Call LLM Extraction: Preliminary Field-level Results

Indic Parler-TTS (rst audio chunk) 1,600

End-to-end total 3,850

The 3.8-second total latency exceeds the 2.0-second threshold for natural telephonic conversation [9]. Two stages dominate the budget: Indic Parler-TTS (1,600 ms, 42%) and Phi-4 Mini LLM inference (1,200 ms, 31%). The remaining ve stages together contribute under 1,050 ms, conrming that ASR and retrieval impose acceptable overhead. The practical implication is that sub-2-second latency on this hardware class requires replacing the autoregressive TTS model, not optimising the remaining pipeline.

B. ASR Accuracy

Table III reports WER and CER on the Common Voice 11.0 Malayalam test split.

TABLE III

ASR Error Rates on Common Voice 11.0 Malayalam

Metric Score (%)

Extracted Field Precision (%) Recall (%)

Caller name 90 87

Programme of interest 94 92

Primary intent category 88 90

Unresolved queries 82 78

Recommended follow-up 85 84

Discussion

Advantages Demonstrated
Zentry AI demonstrates that a functionally capable, self- hosted multi-model Malayalam telephony agent is achievable on consumer hardware without cloud API dependencies. Key outcomes include: reliable end-to-end call handling via Twilio with no self-hosted SIP conguration; 11.49% normalised ASR WER sufcient for downstream intent detection; a 1.3- point MOS naturalness improvement over the prior MMS- TTS baseline through W8A16 quantised Indic Parler-TTS; and post-call extraction precision above 83% across all elds. The IndicTrans2 translation bridge is architecturally signicant

WER without text normalisation CER without text normalisation

by decoupling the language of interaction from the language of LLM reasoning, it enables compact English-optimised models

WER with text normalisation 11.49

The large gap between unnormalised WER (38.62%) and normalised WER (11.49%) reects Whispers known sensitiv- ity to Malayalam script normalisation conventions rather than acoustic recognition failures. The low CER of 7.32% conrms that character-level recognition is substantially more accurate than the raw WER implies. Post-normalisation, the 11.49% WER is sufcient for reliable downstream intent detection.

C. TTS Naturalness Evaluation

To compare the two TTS systems, an informal listening evaluation was conducted with native Malayalam speakers rating naturalness and intelligibility on synthesised admission- domain sentences. The Indic Parler-TTS model produces noticeably more natural prosody and smoother inter-word transitions than MMS-TTS, which exhibits robotic artefacts particularly on longer sentences. This quality improvement comes at the direct cost of the 1,600 ms autoregressive gen- eration time that constitutes the pipelines largest bottleneck.

to serve Malayalam callers at the cost of 500 ms of combined translation overhead per turn.
Limitations and Challenges
- Latency exceeds conversational threshold. At 3.8 seconds, the system exceeds the 2-second naturalness ceiling [9]. Callers experience a perceptible pause before each response, reducing interaction quality even when content is correct. This was directly observed during prototype testing on mobile GPU hardware.
- Autoregressive TTS is the dominant bottleneck. Indic Parler-TTS contributes 1,600 ms (42% of total latency) due to its token-by-token generation architecture. This is an inherent property of the model class, not a congura- tion issue addressable through quantisation alone.
- VRAM saturation on mobile hardware. The 9.7 GB combined VRAM requirement causes complete overow on 8 GB mobile GPUs, resulting in system RAM swap- ping and severe generation lag. Even on 12 GB desktop cards, headroom is minimal.
- Static knowledge base. The ChromaDB vector store is populated from manually curated documents and does not connect to live ERP data. Real-time seat availability and updated deadlines cannot be served without manual re- ingestion.
- Post-call recall on inferred elds. The 79.2% recall on unresolved queries reects the difculty of asking the LLM to reason about conversational gaps rather than extract explicit statements.
- Ngrok in production. The ngrok tunnel is appropriate for development and small-scale institutional use but is not suited for high-availability deployments requiring a stable, permanent webhook URL.

Comparison with Related Systems

Table V positions Zentry AI against representative prior systems.

TABLE V

System	Malay- alam	Local Infer.	Tele- phony	Post- Call
Generic IVR [1]	X			X
Cloud chatbot API	Partial	X	X	Partial
Thara et al. [2]			X	X
Lu et al. [12]	X	Partial	X	X
Zentry v1 (MMS + FreeSWITCH)			Partial	X
Zentry AI (this work)

Comparison with Related Systems

Generic IVR handles telephony but offers no open-ended conversational capability and no Malayalam support beyond pre-recorded prompts. Cloud APIs provide multi-tur dialogue but route student audio through external servers at per-call cost. Thara et al. [2] is a language identication system, not a dialogue agent. Lu et al. [12] implement a spoken English tutoring agent with partial local inference in a web environment. The rst version of Zentry used MMS-TTS and FreeSWITCH with robotic audio quality and fragile PBX integration. The current system combines all four properties simultaneously, at the cost of a latency that exceeds the conversational ideal on current hardware.

Future Enhancements
- Streaming TTS replacement. Replacing Indic Parler- TTS with a non-autoregressive or streaming architecture (such as VITS [10] or a distilled variant) would eliminate the 1,600 ms synthesis bottleneck and likely bring total latency below 2.5 seconds on equivalent hardware.
- Live ERP integration. Connecting the vector store to the institutions ERP API would enable real-time responses about seat availability, revised fees, and live application deadlines.
- Sentiment and escalation detection. A lightweight emo- tion classier inserted after ASR could detect distressed callers and either modulate the response tone or escalate to a human agent with a real-time conversation summary.
- Multi-language extension. The IndicTrans2 bridge is directly extensible to Tamil, Kannada, Telugu, and Hindi by substituting the language pair, with language-specic ASR ne-tuning as the primary additional requirement.
- Production hardening. Replacing ngrok with a stable cloud reverse proxy, evaluating concurrent call handling under load, and automating knowledge base re-ingestion from institutional document systems are prerequisites for institution-wide deployment.

Conclusion

This paper presented Zentry AI, a self-hosted Malay- alam telephony conversational agent for educational admission enquiries. The system integrates Twilio call management, FastAPI/ngrok webhook serving, a ne-tuned Faster-Whisper ASR model, a bidirectional IndicTrans2 translation bridge, a RAG-augmented Phi-4 Mini LLM, a W8A16 quantised Indic Parler-TTS synthesiser, and an automated post-call LLM extraction module all running locally on consumer GPU hardware without cloud API dependencies.

Prototype evaluation conrms that the system operates end-to-end and achieves 11.49% normalised ASR WER on Common Voice 11.0 Malayalam. The primary limitation iden-

tied is end-to-end latency of 3.8 seconds, driven by au-

toregressive TTS generation (1,600 ms) and LLM inference (1,200 ms), together consuming 73% of the pipeline budget. VRAM requirements of 9.7 GB also constrain deployment to mid-to-high-range consumer GPUs.

These ndings demonstrate that a fully local, privacy- preserving Malayalam telephony agent is functionally achiev- able with open-source components, while clearly identifying the architectural steps specically streaming TTS and ERP integration required for production-grade deployment. The modular design ensures these improvements can be pursued without structural changes to the rest of the pipeline, providing a practical blueprint for regional-language conversational AI in Indias educational sector.

References

S. Das and D. Das, Natural language processing (NLP) techniques: Usability in human-computer interactions, in Proc. 6th Int. Conf. Natural Language Processing (ICNLP), IEEE, 2024, pp. 783787.
S. Thara and P. Poornachandran, Transformer based language identi- cation for Malayalam-English code-mixed text, IEEE Access, vol. 9,
pp. 136,231136,243, 2021.
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, Robust speech recognition via large-scale weak supervision, in Proc. ICML, 2023, pp. 28,49228,518.
V. Brydinskyi et al., Enhancing automatic speech recognition with personalized models: Improving accuracy through individualized ne- tuning, IEEE Access, vol. 12, pp. 33,81033,822, 2024.
D. F. Ivan et al., Priority-encoder ensemble for speech recognition,
IEEE Access, vol. 12, pp. 24,50124,513, 2024.
J. Gala et al., IndicTrans2: Towards high-quality and accessible ma- chine translation models for all 22 scheduled Indian languages, Trans. Machine Learning Research (TMLR), 2023.
D. Kakwani et al., IndicNLPSuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages, in Proc. Findings of EMNLP, 2020, pp. 49484961.
P. Lewis et al., Retrieval-augmented generation for knowledge-intensive NLP tasks, in Advances in Neural Information Processing Systems (NeurIPS), vol. 33, 2020, pp. 94599474.
G. Skantze, Turn-taking in conversational systems and human- robot interaction: A review, Computer Speech and Language, vol. 67, art. 101178, 2021.
E. Casanova et al., A survey on neural speech synthesis for low- resource languages, in Proc. Interspeech, 2022, pp. 41824186.
S. Khan et al., Indic Parler-TTS: Instruction-conditioned multilingual TTS for Indian languages, AI4Bharat Technical Report, 2024. [Online]. Available: https://huggingface.co/ai4bharat/indic-parler- tts
C.-T. Lu et al., Implementation of an AI English-speaking
interactive training system using multi-model neural networks, IEEE Access, 2025.
Twilio, Twilio Programmable Voice: TwiML and Media Streams, 2024. [Online]. Available: https://www.twilio.com/docs/ voice
S. Ram´rez, FastAPI: Modern, fast web framework for building APIs
with Python, 2023. [Online]. Available: https://fastapi.tiangolo.com
ngrok Inc., ngrok: Unied ingress for developers, 2024. [Online].
Available: https://ngrok.com
G. Klein et al., Faster-Whisper: Optimized Whisper inference with CTranslate2, 2023. [Online]. Available: https://github.com/ SYSTRAN/faster-whisper
OpenNMT, CTranslate2: Efcient transformer inference engine, 2023. [Online]. Available: https://github.com/OpenNMT/CTranslate2
Chroma, ChromaDB: Open-source embedding database, 2024. [On-line]. Available: https://github.com/chroma-core/chroma
G. Gerganov, llama.cpp: Inference of LLMs in C/C++, 2024. [Online]. Available: https://github.com/ggerganov/llama.cpp
Silero Team, Silero VAD: Pre-trained enterprise-grade voice activity de-tector, 2021. [Online]. Available: https://github.com/snakers4/ silero-vad
Hugging Face, Optimum Quanto: Model quantization for trans-former architectures, 2024. [Online]. Available: https:// github.com/huggingface/optimum-quanto

Faster-Whisper Medium	INT8	1.2
IndicTrans2 (enc. + dec.)	FP16	3.1
Phi-4 Mini (3.8B)	Q4 K M	2.4
Indic Parler-TTS (1.2B)	W8A16	2.2
System / driver overhead		0.8
Total		9.7

Caller name	90	87
Programme of interest	94	92
Primary intent category	88	90
Unresolved queries	82	78
Recommended follow-up	85	84

Zentry AI: A Self-Hosted Malayalam Telephony Conversational Agent Using Open- Source Speech and Language Models

Component Precision VRAM (GB)

Pipeline Stage Mean Latency (ms)

End-to-end total 3,850

Metric Score (%)

Extracted Field Precision (%) Recall (%)

WER with text normalisation 11.49