A Cross-Lingual Retrieval-Augmented Generation Based Intelligent Teaching Assistant for Educational Video Content: Implementation and Evaluation

doi:https://doi.org/10.5281/zenodo.20026047

Volume 15, Issue 04 (April 2026)

A Cross-Lingual Retrieval-Augmented Generation Based Intelligent Teaching Assistant for Educational Video Content: Implementation and Evaluation

DOI : https://doi.org/10.5281/zenodo.20026047

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 11
Authors : Lakshya Singh, Er. Alok Singh Jadaun, Prof. Brajesh Kumar Singh
Paper ID : IJERTV15IS042790
Volume & Issue : Volume 15, Issue 04 , April – 2026
Published (First Online): 04-05-2026
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

A Cross-Lingual Retrieval-Augmented Generation Based Intelligent Teaching Assistant for Educational Video Content: Implementation and Evaluation

Lakshya Singh

Department of Computer Science and Engineering Raja Balwant Singh Engineering Technical Campus Bichpuri, Agra, Uttar Pradesh, India

Afliated to Dr. A. P. J. Abdul Kalam Technical University, Lucknow

Er. Alok Singh Jadaun

Department of Computer Science and Engineering Raja Balwant Singh Engineering Technical Campus Bichpuri, Agra, Uttar Pradesh, India

Prof. Brajesh Kumar Singh

Raja Balwant Singh Engineering Technical Campus, Bichpuri, Agra, Uttar Pradesh, India

AbstractThis paper presents the implementation and evalua-tion of a Cross-Lingual Retrieval-Augmented Generation (RAG) system developed as an intelligent teaching assistant for Database Management Systems (DBMS) education. The system processes 20 Hindi DBMS lecture videos sourced from YouTube, transcribes and translates them to English using the Faster-Whisper base model [6], and creates timestamp-aware semantic chunks of 300 words with 50-word overlap, resulting in 131 knowledge chunks. Dense vector embeddings are generated using the all-MiniLM-L6-v2 sentence transformer model [20] and stored in ChromaDB vector database for efcient cosine similarity-based retrieval [13]. At inference time, student queries are semantically matched against the 131-chunk knowledge base, and the top-5 retrieved chunks are passed to the LLaMA 3.3 70B language model via the Groq API to generate coherent, source-grounded answers with precise video timestamp citations. The complete system is deployed as a full-stack web application with a FastAPI REST backend and a React-based frontend featuring an integrated evaluation dashboard. Custom evaluation on 20 domain-specic DBMS questions demonstrates an overall score of 88.0%, out-performing baseline keyword retrieval by 15.9 percentage points (22.2% relative improvement), with 100% timestamp coverage across all responses.

Index TermsRetrieval-Augmented Generation, Cross-Lingual NLP, Educational AI, ChromaDB, LLaMA 3.3 70B, Faster-Whisper, Timestamp-Aware Retrieval, Vector Embeddings, FastAPI, DBMS Education

Introduction

The rapid proliferation of online educational video content has created an unprecedented, globally accessible repository of domain knowledge. YouTube alone hosts millions of edu-cational lectures spanning virtually every academic discipline [25]. Despite this abundance, students face a signicant practi-cal challenge: locating specic conceptual explanations within lengthy video lectures is time-consuming and cognitively de-manding. A student seeking to understand a particular DBMS concept must manually browse through multiple videos with-

out any guarantee of nding a precise, accurate explanation at a specic timestamp.

This problem is further compounded in multilingual ed-ucational ecosystems. A substantial volume of high-quality educational content is produced in regional languages, par-ticularly in countries like India where Hindi serves as the primary medium of instruction for a large student population. Existing intelligent educational assistants address this general problem by building question-answering systems over English text documents such as textbooks and lecture slides [7], [9]. However, such systems cannot process video content, and are designed exclusively for English-language resources, leaving the vast majority of regional-language video content inaccessible.

Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for knowledge-intensive question-answering [1]. RAG systems retrieve relevant passages from an external knowledge base and condition a large language model (LLM) on the retrieved context to generate accurate, source-grounded responses [2], [23]. Surveys of RAG architectures categorize approaches into three levels: naive RAG, advanced RAG with query rewriting and re-ranking, and modular RAG with exible component composition [3]. While RAG has been successfully applied to text documents, its application to educational video content through a cross-lingual pipeline remains largely unexplored.

The present paper addresses this gap by presenting the com-plete implementation and evaluation of a Cross-Lingual RAG system for DBMS education. The system processes a curated dataset of 20 Hindi DBMS lecture videos from a playlist of 140 available lectures, transcribes and translates them to English using Faster-Whisper [21], creates timestamp-aware semantic chunks, and enables intelligent English-language question-answering over the video content.

The key contributions of this work are:
- A complete cross-lingual pipeline that converts Hindi audio lectures to a searchable English knowledge base using Faster-Whispers simultaneous transcription and translation capability [6], [21], eliminating the need for a separate translation model.
- A timestamp-aware chunking mechanism that preserves temporal metadata throughout the RAG pipeline, enabling precise video navigation for studentsa capability absent in existing educational RAG systems.
- A production-ready full-stack web application comprising a FastAPI backend and React frontend with an integrated evaluation dashboard, demonstrating practical deployabil-ity on CPU-only hardware.
- A custom domain-specic four-metric evaluation frame-work assessing keyword coverage, answer completeness, source relevance, and timestamp coverage [8], [26], demonstrating 88.0% overall accuracy and 15.9 percent-age point (22.2% relative) improvement over the baseline.
The remainder of this paper is organized as follows. Sec-tion II reviews related work. Section III describes the pro-posed system architecture. Section IV details the methodology. Section V covers the implementation. Section VI presents experimental evaluation. Section VII concludes the paper.
Related Work
1. Retrieval-Augmented Generation
  
  Retrieval-Augmented Generation was formally introduced by Lewis et al. [1] as a framework combining parametric knowledge of large language models with non-parametric retrieval from external knowledge bases. The seminal work demonstrated signicant improvements over standalone lan-guage models on knowledge-intensive tasks by grounding responses in retrieved documents. Subsequent comprehensive surveys [2], [3] systematically categorized RAG approaches into three levels of architectural sophistication: naive RAG, which directly appends retrieved documents to the prompt; advanced RAG, which incorporates query rewriting and re-ranking; and modular RAG, which enables exible composi-tion of specialized components.
  
  Karpukhin et al. [4] established that dense vector rep-resentations substantially outperform sparse BM25 retrieval for open-domain question answering, motivating the use of semantic embeddings in modern RAG pipelines. More recent work by Cheng et al. [23] surveys knowledge-oriented RAG extensions that integrate structured knowledge graphs with retrieval pipelines. Lin et al. [5] investigate efcient RAG inference strategies using lookahead retrieval to reduce la-tency, while Zhang et al. [15] propose hybrid dense-sparse vector approaches for improved retrieval precision. Chunking strategy has also been identied as a critical design choice; Gomez-Cabello et al. [28] demonstrate that advanced semantic chunking methods outperform xed-size chunking in clinical RAG aplications, a nding that informs the overlap-based chunking adopted in this work.
2. RAG in Educational Systems
  
  The application of RAG to intelligent educational assistants has gained considerable momentum. Khan et al. [7] developed an educational virtual assistant built on a RAG framework capable of delivering curriculum-aligned responses grounded in veried instructional materials including textbooks, lecture slides, and course-specic content. A systematic survey [2] of RAG for educational applications identied key advantages including improved factual accuracy, source transparency, and the ability to deploy compact local language models at per-formance levels comparable to large proprietary systems.
  
  Ne´meth et al. [9] conducted a pilot study in statistics education where RAG-enhanced language models serving as virtual teaching assistants achieved high expert assessment scores while maintaining source traceability. Yadav et al. [10] employed LLMs to contextualize educational problems to student interests in intelligent tutoring systems. Sain et al. [19] survey the emerging landscape of AI chatbots in higher education, highlighting the importance of grounded, veriable responses. These works collectively demonstrate the viability of RAG for educational question-answering, yet none address regional-language video content or timestamp-aware retrieval.
3. Speech Recognition and Cross-Lingual Processing
  
  Automatic speech recognition has been transformed by large-scale weakly supervised models. Radford et al. [6] introduced Whisper, which demonstrated robust multilingual speech recognition and cross-lingual translation capabilities by training on 680,000 hours of multilingual audio data. The models ability to simultaneously transcribe and translate audio in a single inference pass makes it particularly suitable for cross-lingual pipeline construction. Pineiro-Mart´n et al. [14] investigate weighted cross-entropy losses for low-resource language ASR, underscoring the challenges of multilingual recognition that Whisper addresses through scale. Wang [22] provides a comprehensive treatment of cross-lingual transfer learning for low-resource NLP tasks, establishing the theoret-ical foundation for the cross-lingual approach adopted in this work.
  
  Faster-Whisper, an optimized implementation of Whisper using CTranslate2 [21], achieves up to 4× faster inference than the original implementation while maintaining comparable accuracy, making it practical for CPU-based deployment. Shah et al. [21] validated Faster-Whispers suitability for integra-tion into downstream deep learning pipelines, conrming its robustness across diverse audio conditions.
4. Vector Embeddings and Semantic Search
  
  Reimers and Gurevych [20] introduced Sentence-BERT, employing siamese BERT-based networks for generating se-mantically meaningful sentence embeddings suitable for simi-larity search. The all-MiniLM-L6-v2 model, a distilled variant producing 384-dimensional embeddings, provides an optimal balance between computational efciency and semantic quality for retrieval applications [18]. Devlin et al. [16] established the BERT pretraining paradigm underlying these embedding
  
  models, while Vaswani et al. [17] introduced the Transformer architecture upon which all components in this system are based.
  
  Vector databases such as ChromaDB enable persistent storage and efcient approximate nearest neighbor search over large embedding collections [13]. Zhang et al. [15] demonstrate that graph-based approximate nearest neighbor search achieves sub-millisecond query latency for moderate-scale corpora, conrming the practical efciency of the vector retrieval approach adopted in this work. Zhang et al. [27] further establish best practices for training multilingual dense retrieval models, motivating the use of a pre-trained sentence transformer for cross-lingual retrieval.
5. Large Language Models
  
  Touvron et al. [11] introduced the LLaMA family of open foundation models, demonstrating that open-weights models can achieve competitive performance with large proprietary systems. Meta subsequently released the LLaMA 3 model family [12], with LLaMA 3.3 70B representing the current state of the series and the model used in this work. Farea et al. [26] provide a comprehensive evaluation of question answering systems, establishing evaluation criteria relevant to the assessment methodology employed in this work. Es et al. [8] introduced RAGAS, an automated evaluation frame-work for RAG systems, which informed the design of the custom evaluation metrics proposed in this paper.
6. Research Gap
Despite signicant advances in both RAG systems and educational AI, a critical gap exists in the literature. Existing educational RAG systems operate exclusively on English text documents, leaving regional-language video content inaccessi-ble. Furthermore, no existing system incorporates timestamp-aware retrieval enabling students to navigate directly to the relevant moment in a source video. Knowledge graph-based approaches [24] offer structured knowledge representation but cannot directly process audio content. The present work addresses both gaps through a cross-lingual video processing pipeline with timestamp preservation throughout the RAG pipeline.

System Architecture

The proposed system follows a ten-stage, two-phase pipeline architecture as illustrated in Fig. 1. The pipeline is di-vided into two phases: an ofine indexing phase executed once during system setup, and an online inference phase executed at runtime for each student query. This two-phase design is consistent with the standard RAG architectural pattern [1], [2], adapted here for cross-lingual video content with timestamp preservation.

Ofine Indexing Phase

The ofine phase comprises four sequential stages executed once at system setup, producing a persistent knowledge base that remains on disk for all subsequent queries.

OFFLINE INDEXING PHASE

YouTube Lecture Videos (20 Hindi videos)

Audio Extraction (yt-dlp, MP3, 192 kbps)

Cross-Lingual Transcription (Faster-Whisper Base, task=translate)

Timestamp-Aware Chunking (300 words, 50-word overlap, 131 chunks)

Vector Embeddings (all-MiniLM-L6-v2, 384-dim, batch=32)

ChromaDB Vector Database (Cosine Similarity, HNSW Index)
ONLINE INFER	ENCE PHASE
Student Query via FastAPI REST Endpoint

Semantic Search (Top-5 Chunks Retrieved)

LLaMA 3.3 70B via Groq API (temp=0.3, max_tokens=1000)

Answer + Timestamps displayed in React UI

Fig. 1. Complete system architecture of the proposed Cross-Lingual RAG Teaching Assistant. Stages 16 form the ofine indexing phase; Stages 710 constitute the online inference phase.

Stage 1 Audio Extraction: Lecture videos are down-loaded from YouTube as MP3 audio les using yt-dlp at 192 kbps, balancing transcription quality with storage ef-ciency.

Stage 2 Cross-Lingual Transcription: Audio is tran-scribed and translated to English using Faster-Whisper [6],

[21] with cross-lingual translation (task=translate), pro-ducing timestamped text segments in a single inference pass without a separate translation model.

Stage 3 Timestamp-Aware Chunkig: Timestamped segments are grouped into overlapping chunks of 300 words with 50-word overlap. Each chunk retains four metadata elds: video title, video index, start timestamp, and end timestamp. This metadata preservation is the key architectural novelty of the chunking design [28].

Stage 4 Vector Indexing: Chunk embeddings are gen-erated using the all-MiniLM-L6-v2 sentence transformer [18],

[20] and stored in ChromaDB [13] congured with cosine similarity and HNSW indexing. This phase is executed once;

the resulting index persists on disk.

Online Inference Phase

The online phase handles student queries at runtime. Upon receiving a question through the FastAPI endpoint, the system encodes the query using all-MiniLM-L6-v2 and performs co-sine similarity search in ChromaDB to retrieve the top-5 most relevant chunks [4]. The retrieved chunks, along with their video titles and timestamps, are formatted into a structured prompt following the retrieval-then-read paradigm [1] and passed to the LLaMA 3.3 70B model [12] via the Groq API. The generated answer, source citations, and video timestamps are returned to the React frontend.
System Components

The system comprises three main software components. The data pipeline consists of ve Python scripts handling video download, transcription, chunking, embedding generation, and ChromaDB setup. The backend is a FastAPI application ex-posing four REST endpoints: /health, /stats, /ask, and

/evaluation. The frontend is a React application built with Vite, featuring a chat interface, sidebar with system statistics,

and a dedicated evaluation dashboard with three sub-tabs: performance metrics, RAG versus baseline comparison, and system conguration.

Methodology
1. Data Collection and Curation
  
  The data collection module employs yt-dlp to automatically extract audio content from a curated selection of 20 DBMS lecture videos from a Hindi educational YouTube playlist of 140 available lectures. The 20 videos were selected to ensure comprehensive coverage of core DBMS topics includ-ing entity-relationship modeling, relational keys, normaliza-tion, SQL commands, transaction control, concurrency, and indexing. Each video has a duration of approximately 1015 minutes. Audio is extracted in MP3 format at 192 kbps; the complete download pipeline is automated through a single Python script requiring only the playlist URL as input.
  
  The selection criterion prioritized conceptual depth and lecture clarity over encyclopedic coverage, ensuring that the knowledge base contains high-quality instructional content for the target DBMS curriculum. The Hindi medium of instruction presents the cross-lingual challenge that motivates the Faster-Whisper translation pipeline [6].
2. Cross-Lingual Transcription
  
  The transcription module employs Faster-Whisper [21], an optimized CTranslate2-based implementation of OpenAIs Whisper model [6], congured with the base model size and INT8 quantization for CPU-efcient inference. The critical de-sign choice is the use of Whispers cross-lingual capability by setting task=translate, which performs simultaneous Hindi speech recognition and English translation in a single inference pass.
  
  This approach eliminates the need for a separate machine translation model, reducing pipeline complexity and preserv-ing segment-level timestamps. Each transcription output con-tains timestamped segments in the format [MM:SS]. The base model was selected after empirical comparison with the tiny and small variants, offering the optimal balance between transcription speed (approximately 78 minutes per 1015 minute video on CPU) and translation accuracy. The complete transcription of all 20 videos was completed in 54.3 min-utes on CPU hardware, demonstrating practical deployability in resource-constrained environments [21]. This aligns with ndings from Pineiro-Mart´n et al. [14] that multilingual ASR requires careful balancing of model size against language coverage.
3. Timestamp-Aware Chunking
  
  The chunking engine processes transcript segments using a sliding window approach with a chunk size of 300 words and an overlap of 50 words. The overlap ensures that infor-mation spanning chunk boundaries is preserved, maintaining contextual coherence across adjacent chunks [28]. Each chunk retains four metadata elds: video title, video index, start timestamp, and end timestamp. This metadata preservation is the key novelty of the chunking design; it enables the RAG system to cite precise video locations in generated answers, directly supporting the educational use case of targeted video navigation.
  
  Processing 20 lecture transcripts produces 131 chunks with an average of 6.55 chunks per video. The chunk size of 300 words was selected to balance semantic completenessensuring each chunk contains a self-contained conceptual explanationwith retrieval precision, as excessively large chunks reduce the specicity of retrieved context [1], [28].
4. Vector Embedding Generation
  
  Dense vector embeddings are generated for each chunk using the all-MiniLM-L6-v2 sentence transformer model [20], producing 384-dimensional vectors. This model was selected for its favorable trade-off between semantic quality and com-putational efciency [18], requiring no GPU for inference. Embeddings for all 131 chunks are generated in 16.6 seconds on CPU hardware, demonstrating the practical feasibility of the approach for resource-constrained environments. Embeddings are generated in batches of 32 chunks to optimize memory utilization.
  
  The use of a pre-trained English sentence transformer for cross-lingual retrieval is justied by the upstream translation step: since all chunks are translated to English by Faster-Whisper, the retrieval operates entirely in English semantic space, avoiding the complexity of multilingual embedding models [22], [27].
5. ChromaDB Vector Indexing
  
  The generated embeddings are stored in ChromaDB [13], a persistent vector database congured with cosine similarity as
  
  the distance metric. ChromaDB employs Hierarchical Naviga-ble Small World (HNSW) indexing for efcient approximate nearest neighbor search [15]. Each database entry stores the chunk text, its 384-dimensional embedding, and the associated metadata. The database persists on disk, eliminating the need to regenerate embeddings on subsequent system startups. The complete indexing of 131 chunks completes in under 60 seconds.
6. RAG-Based Question Answering
At inference time, a student query is encoded using the same all-MiniLM-L6-v2 model and the top-5 most similar chunks are retrieved from ChromaDB via cosine similarity search [4]. The retrieved chunks, along with their video titles and timestamps, are formatted into a structured prompt that explicitly instructs the LLaMA 3.3 70B model [12] to answer based only on the provided lecture content and to cite specic video timestamps.

The prompt design follows the retrieval-then-read paradigm [1], conditioning the model on retrieved context rather than relying on parametric knowledge. The LLaMA 3.3 70B model is accessed via the Groq API with temperature set to 0.3 and maximum tokens set to 1000, balancing response coherence with factual grounding. The system prompt explicitly instructs the model to reference video titles and timestamps when synthesizing answers, ensuring 100% timestamp coverage in generated responses.

Implementation

Dataset Description

The knowledge base comprises 20 DBMS lecture videos selected from a Hindi educational YouTube playlist containing

140 lectures. The selected videos cover six core DBMS topic areas: (1) Introduction and File Systems, (2) Entity-Relaionship Model and Attributes, (3) Relational Keys includ-ing Primary, Candidate, Super, and Foreign Keys, (4) Normal-ization including 1NF, 2NF, 3NF, and BCNF, (5) SQL com-mands including DDL, DML, and Aggregate Functions, and

(6) Relationships including One-to-One, One-to-Many, and Many-to-Many. Each video has a duration of approximately 1015 minutes. The complete transcription of all 20 videos was completed in 54.3 minutes using the Faster-Whisper base model [21] on CPU hardware, producing 20 JSON transcript les with a total of 131 semantic chunks after the chunking stage.
Technology Stack

Table I summarizes the complete technology stack em-ployed in the implementation. All components are open-source or available via public APIs, ensuring reproducibility and accessibility for resource-constrained educational institutions.

Backend Implementation

The backend is implemented as a FastAPI application exposing four REST API endpoints. The /health end-point returns system status and chunk count. The /stats

TABLE I

System Technology Stack

Component	Technology	Cong.
Speech Recog.	Faster-Whisper [21]	Base, INT8, CPU
Translation	Whisper Cross-Lingual [6]	task=translate
Embeddings	all-MiniLM-L6-v2 [20]	384-dim, b=32
Vector DB	ChromaDB [13]	Cosine, HNSW
LLM	LLaMA 3.3 70B [12]	temp=0.3
Backend	FastAPI + Uvicorn	Port 8000
Frontend	React + Vite	Port 5173
Downloader	yt-dlp	MP3, 192 kbps
Environment	Python 3.11	CPU only

endpoint returns database statistics including total videos, chunks, subject, and model information. The /ask endpoint accepts a POST request containing the student question and top_k parameter, executes the complete RAG pipeline, and returns the generated answer, source citations with timestamps, and processing time. The /evaluation endpoint serves pre-computed evaluation results from disk. Cross-Origin Re-source Sharing (CORS) middleware is congured to permit communication with the React frontend. The RAG pipeline componentsincluding the ChromaDB client and Groq API clientare initialized once at server startup to minimize per-request latency.

Frontend Implementation

The frontend is a single-page React application built with Vite, implementing a chat-style interface for student inter-action. The sidebar displays real-time system statistics (20 lectures, 131 chunks, LLaMA 3.3 70B model information) fetched from the /stats endpoint, along with suggested questions for quick access. The main chat panel renders user questions and AI-generated answers with markdown format-ting, source citation cards showing video title and timestamp range, relevance scores, and response time. The evaluation dashboard, accessible via a dedicated tab in the top navigation, presents system performance through three sub-tabs: a Metrics tab with four performance indicators displayed as progress bars, a Comparison tab with side-by-side RAG versus baseline bar charts, and a System Info tab listing complete conguration details. The frontend communicates with the backend exclu-sively through the Axios HTTP client library.
Evaluation Framework

A custom evaluation framework was implemented to assess system performance on 20 domain-specic DBMS questions [8], [26]. Each question is associated with a set of expected domain keywords. Four evaluation metrics are computed: Key-word Coverage measures the proportion of expected keywords present in the generated answer; Answer Completeness assigns a score based on answer word count thresholds; Source Rele-vance computes the average cosine similarity score of retrieved chunks; and Timestamp Coverage is a binary metric indicating

whether the generated answer contains at least one video timestamp citation. The overall score is the arithmetic mean of all four metrics. Baseline evaluation employs direct chunk retrieval without LLM generation, serving as the comparative reference for assessing RAG improvement.

Experimental Evaluation

Evaluation Setup

Evaluation was conducted on a curated set of 20 domain-specic DBMS questions covering all major topic areas present in the lecture corpus. The questions span six cat-egories: relational keys (primary key, candidate key, super key, foreign key), entity-relationship modeling (ER model, attribute types, relationship types), normalization (1NF, 2NF, 3NF, BCNF, functional dependency), SQL (DDL commands, DML commands, aggregate functions, joins, nested queries), transaction control (ACID properties, conict serializability, two-phase locking), and indexing (B+ tree, single-level index-ing). All experiments were conducted on CPU-only hardware running Python 3.11 on Windows, without any GPU acceler-ation, demonstrating the practical deployability of the system in resource-constrained environments.
Evaluation Metrics

Following established evaluation practices for RAG systems [8], [26], four metrics were used to evaluate system perfor-mance.

Keyword Coverage (KC): The proportion of domain-specic expected keywords present in the generated answer:

keywords found in answer

KC = total expected keywords (1)

Quantitative Results

Table II presents the comparative evaluation results of the proposed RAG system against the baseline keyword retrieval approach across all four metrics. The RAG system achieves an overall score of 88.0%, representing a +22.2% relative improvement over the baseline score of 72.1%. The most notable gains are Keyword Coverage (+32.0 pp) and Times-tamp Coverage (+100.0 pp), consistent with the evaluation dashboard shown in the deployed application.

TABLE II

Evaluation Results: RAG System vs. Baseline

Metric	Base-line	RAG System	Gain (pp)
Keyword Coverage	47.2%	79.2%	+32.0
Answer Completeness	95.0%	99.0%	+4.0
Source Relevance	74.0%	74.0%	0.0
Timestamp Coverage	0.0%	100.0%	+100.0
Overall Score	72.1%	88.0%	+15.9
Avg. Response Time	1.10 s	6.62 s

pp = percentage points absolute gain; relative improvement = +22.2%.

Per-Category Analysis

Table III presents per-category performance scores for the RAG system across the six topic areas evaluated. Scores are reported on a 01 scale (1.0 = 100%).

TABLE III

Per-Category RAG System Performance (01 Scale)

Topic Category	Score	Observation
Relational Keys (PK, CK, FK)	0.955	Highest coverage
SQL Commands (DDL, DML)	0.910	Strong performance
Normalization (1NFBCNF)	0.875	Good coverage
Transaction Control / ACID	0.842	Adequate
ER Model / Relationships	0.820	Moderate
Attribute Types (ER Model)	0.653	Weakest

Answer Completeness (AC): A score based on answer word count reecting response detail:

0.3 if words < 50

AC = 0.6 if 50 words < 100

(2)

0.8

if 100

words < 200

1.0 if words 200

Source Relevance (SR): The average cosine similarity score of the top-k retrieved chunks with respect to the query [4]:

1 X

k

SR = 1 di

k

i=1

(3)

System Conguration Summary

where di is the cosine distance of the i-th retrieved chunk.

(

Timestamp Coverage (TC): A binary metric indicating whether at least one video timestamp is cited in the generated answer:

Table IV records the key system conguration parameters as reported in the deployed applications System Info tab, providing full reproducibility context.
Discussion

TC = 1.0 if timestamp present in answer

0.0 otherwise

(4)

The proposed RAG system demonstrates substantial im-provements over the baseline across three of four evaluation metrics. The most signicant gain is observed in Keyword

The Overall Score (OS) is the arithmetic mean of all four

Coverage (+32.0 percentage points), conrming that LLM-

metrics:

OS =

KC + AC + SR + TC (5)

4

generated answers [12] incorporate domain-appropriate ter-minology more comprehensively than direct chunk retrieval.

TABLE IV

System Configuration Parameters

Parameter	Value
RAG Response Time	6.62 s
Baseline Response Time	1.1 s
Test Questions	20
Knowledge Chunks	131
Lecture Videos	20
LLM Model	LLaMA 3.3 70B
Embedding Model	MiniLM-L6-v2
Vector Database	ChromaDB
Speech Recognition	Whisper Base
Framework	FastAPI + React

This nding aligns with the broader observation in the RAG literature [1], [2] that LLM generation substantially enriches retrieval-only approaches through linguistic generalization.

Timestamp Coverage of 100% represents the most distinc-tive contribution of the proposed system. The baseline system, which returns raw chunk text without LLM generation, cannot produce timestamp citations by design, resulting in a score of 0%. The proposed system consistently cites precise video timestamps in every generated response, directly addressing the core educational use case of enabling students to locate specic content within source lectures. This capability is entirely absent in existing educational RAG systems [7], [9]. Source Relevance scores are identical for both systems (74.0%), as both employ the same ChromaDB semantic retrieval mechanism [13]. This conrms that the retrieval component performs consistently regardless of the downstream generation approach [4], validating the modular design of the

pipeline.

The higher average response time of the RAG system (6.62 s versus 1.10 s for baseline) is attributable to the additional LLM inference step via the Groq API. This represents an acceptable trade-off given the substantial quality improvements across other metrics. Response times in the range of 67 seconds are well within acceptable thresholds for educational question-answering applications [19], where users expect thoughtful, detailed responses rather than instantaneous replies.

Per-category analysis reveals performance variation across topic areas. Questions on well-covered topics such as foreign keys (0.948), candidate keys (0.955), and primary keys (0.950) achieve the highest scores, reecting strong lecture coverage in the knowledge base. Questions on more abstract topics such as attribute types in the ER model (0.653) show lower scores, suggesting that cross-lingual transcription may introduce noise for nuanced technical terminology [14], [22]. This nding motivates future work on domain-adapted embedding models [27].

Conclusion and Future Work

This paper presented the complete implementation and eval-uation of a Cross-Lingual Retrieval-Augmented Generation

system for intelligent DBMS education. The proposed system successfully addresses two critical gaps in existing educational AI literature [7], [9]: the inability to process regional-language video content, and the absence of timestamp-aware retrieval enabling precise video navigation for students.

The system processes 20 Hindi DBMS lecture videos through a ten-stage, two-phase pipeline comprising audio extraction, cross-lingual transcription via Faster-Whisper [6], [21], timestamp-aware chunking [28], dense vector embedding using all-MiniLM-L6-v2 [20], and ChromaDB indexing [13], resulting in a knowledge base of 131 semantic chunks. At inference time, the RAG pipeline [1] retrieves the top-5 most relevant chunks and generates coherent, source-grounded answers using LLaMA 3.3 70B [12] via the Groq API. The complete system is deployed as a full-stack web application with a FastAPI backend and React frontend featuring an integrated evaluation dashboard.

Experimental evaluation on 20 domain-specic DBMS questions demonstrates that the proposed system achieves an overall score of 88.0%, outperforming baseline keyword retrieval by 15.9 percentage points (22.2% relative improve-ment). Notably, the system achieves 100% timestamp cover-age across all responsesa capability entirely absent in the baselineenabling students to navigate directly to the relevant moment in source lecture videos. The system operates entirely on CPU hardware, requiring no GPU infrastructure, validating practical deployability. The RAG response time of 6.62 s and embedding model (MiniLM-L6-v2) conrm the system is suitable for deployment on standard academic hardware.

The key contributions are: a practical cross-lingual RAG pipeline for educational video content; a timestamp-aware chunking mechanism preserving temporal metadata through-out the retrieval pipeline; a custom four-metric evaluation framework for educational RAG systems [8], [26]; and an end-to-end deployable web application demonstrating viability on CPU-only hardware.

Future work will explore several directions. First, extension of the knowledge base to the complete playlist of 140 lectures to assess scalability and retrieval performance at larger corpus sizes, consistent with the scalability considerations identied in [2], [23]. Second, investigation of domain-adapted embed-ding models [27] trained on technical educational content to improve retrieval precision for abstract concepts. Third, integration of re-ranking mechanisms [3] to further improve retrieved chunk relevance. Fourth, extension to additional subjects and regional languages beyond Hindi [14], [22] to validate the generalizability of the cross-lingual pipeline. Fifth, incorporation of learner modeling [10] to personalize responses based on individual student knowledge levels and learning history. Finally, integration with knowledge graph representations [24] could enhance structured concept navi-gation within the DBMS domain.

References

P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Ku¨ttler, M. Lewis, W.-T. Yih, T. Rockta¨schel, S. Riedel, and D. Kiela,

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, in Advances in Neural Information Processing Systems (NeurIPS 2020), vl. 33, pp. 94599474, 2020.
Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, M. Wang, and H. Wang, Retrieval-Augmented Generation for Large Language Models: A Survey, arXiv:2312.10997, 2023.
W. Fan, Y. Ding, L. Ning, S. Wang, H. Li, D. Yin, T.-S. Chua, and Q. Li, A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models, in Proc. 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, New York, NY, USA: ACM, Aug. 2024, pp. 64916501. doi: 10.1145/3637528.3671470.
V. Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and W.-T. Yih, Dense Passage Retrieval for Open-Domain Question Answering, in Proc. 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), pp. 67696781, 2020. doi: 10.18653/v1/2020.emnlp-main.550.
C.-Y. Lin, K. Kamahori, Y. Liu, X. Shi, M. Kashyap, Y. Gu, R. Shao,

Z. Ye, K. Zhu, R. Kadekodi, S. Wang, A. Krishnamurthy, L. Ceze, and B. Kasikci, TeleRAG: Efcient Retrieval-Augmented Generation Inference with Lookahead Retrieval, arXiv:2502.20969, 2025.
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, Robust Speech Recognition via Large-Scale Weak Super-vision, in Proc. 40th International Conference on Machine Learning (ICML 2023), PMLR, vol. 202, pp. 2849228518, 2023.
U. H. Khan, M. H. Khan, and R. Ali, Large Language Model Based Ed-ucational Virtual Assistant Using RAG Framework, Procedia Computer Science, vol. 252, pp. 905911, 2025. doi: 10.1016/j.procs.2025.01.051.
S. Es, J. James, L. Espinosa-Anke, and S. Schockaert, RA-GAS: Automated Evaluation of Retrieval Augmented Generation, arXiv:2309.15217, 2023.
R. Ne´meth, A. Ta´trai, M. Szabo´, and A´ . Tama´si, Using a RAG-Enhanced Large Language Model in a Virtual Teaching Assistant Role: Experiences from a Pilot Project in Statistics Education, Hungarian Statistical Review, vol. 7, no. 2, pp. 327, 2024. doi: 10.35618/hsr2024.02.en003.
G. Yadav, Y.-J. Tseng, and X. Ni, Contextualizing Problems to Student Interests at Scale in Intelligent Tutoring System Using Large Language Models, arXiv:2306.00190, May 2023.
H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei,

N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al., Llama 2: Open Foundation and Fine-Tuned Chat Models, arXiv:2307.09288, Jul. 2023.
A. Grattaori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, et al., The Llama 3 Herd of Models, arXiv:2407.21783, Jul. 2024.
J. J. Pan, J. Wang, and G. Li, Survey of Vector Database Management Systems, arXiv:2310.14021, Oct. 2023.
A. Pineiro-Mart´n, C. Garc´a-Mateo, L. Docio-Ferna´ndez, M. del C. Lo´pez-Pe´rez, and G. Rehm, Weighted Cross-Entropy for Low-Resource Languages in Multilingual Speech Recognition, in Proc. Interspeech 2024, pp. 12351239. doi: 10.21437/Interspeech.2024-734.
H. Zhang, J. Liu, Z. Zhu, S. Zeng, M. Sheng, T. Yang, G. Dai, and

Y. Wang, Efcient and Effective Retrieval of Dense-Sparse Hybrid Vectors Using Graph-Based Approximate Nearest Neighbor Search, arXiv:2410.20381, Oct. 2024.
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding, in Proc. 2019 Conference of the North American Chapter of the Associ-ation for Computational Linguistics (NAACL-HLT), Minneapolis, MN, Jun. 2019, pp. 41714186. doi: 10.18653/v1/N19-1423.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, . Kaiser, and I. Polosukhin, Attention Is All You Need, in Proc. Advances in Neural Information Processing Systems (NIPS 2017), Long Beach, CA, 2017, pp. 59986008.
C. Galli, N. Donos, and E. Calciolari, Performance of 4 Pre-Trained Sentence Transformer Models in the Semantic Query of a Systematic Review Dataset on Peri-Implantitis, Information, vol. 15, no. 2, p. 68, Jan. 2024. doi: 10.3390/info15020068.
Z. H. Sain, A. Vasudevan, and A. V. Lama, The Emerging Future of AI Chatbots in Higher Education, Jurnal Ilmiah Didaktika, vol. 25, no. 1,

pp. 93107, Aug. 2024.
N. Reimers and I. Gurevych, Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks, in Proc. 2019 Conference on Empiri-cal Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP),

Hong Kong, China, Nov. 2019, pp. 39823992. doi: 10.18653/v1/D19-1410.
D. Shah, R. Saboo, A. Dwivedi, and M. Gajjar, Integrating Faster Whis-per with Deep Learning Speaker Recognition, International Journal of Computer Science and Mobile Computing, vol. 13, no. 9, pp. 18, Sep. 2024. doi: 10.47760/ijcsmc.2024.v13i09.001.
J. Wang, Cross-Lingual Transfer Learning for Low-Resource Nat-ural Language Processing Tasks, M.S. thesis, Inst. Anthropomatics and Robotics, Karlsruhe Institute of Technology, Karlsruhe, Germany, Feb. 2021.
M. Cheng, Y. Luo, J. Ouyang, Q. Liu, H. Liu, L. Li, S. Yu, B. Zhang, J. Cao, J. Ma, D. Wang, and E. Chen, A Survey on Knowledge-Oriented Retrieval-Augmented Generation, arXiv:2503.10677, 2025.
L. Zhong, J. Wu, Q. Li, H. Peng, and X. Wu, A Comprehensive Survey on Automatic Knowledge Graph Construction, ACM Computing Surveys, vol. 56, no. 4, Article 94, Nov. 2023. doi: 10.1145/3618295.
M. F. ben Hassen, M. R. Bougherira, N. S. Alrayes, A. M. Alqahtani,

N. Frih, and K. Ke, Enhancing Student Performance and Retention Through Synergistic Teacher and Peer-Recorded Videos in College Al-gebra, SAGE Open, Oct.Dec. 2025. doi: 10.1177/21582440251400560.
A. Farea, Z. Yang, K. Duong, N. Perera, and F. Emmert-Streib, Evaluation of Question Answering Systems: Complexity of Judging a Natural Language, ACM Computing Surveys, vol. 58, no. 1, Article 1, Sep. 2025. doi: 10.1145/3744663.
X. Zhang, K. Ogueji, X. Ma, and J. Lin, Toward Best Practices for Training Multilingual Dense Retrieval Models, ACM Transactions on Information Systems, vol. 42, no. 2, Article 39, Sep. 2023. doi: 10.1145/3613447.
C. A. Gomez-Cabello, S. Borna, S. M. Pressman, S. A. Haider, A. Genovese, B. G. Collaco, N. G. Wood, and A. J. Forte, Comparative Evaluation of Advanced Chunking for RAG in Large Language Models for Clinical Decision Support, Bioengineering, vol. 12, no. 11, p. 1194, 2025. doi: 10.3390/bioengineering12111194.