DOI : https://doi.org/10.5281/zenodo.19369793
- Open Access

- Authors : Dr. J. Nafeesa Begum, Brightlin Blessy S, Harini M, Kaviya P
- Paper ID : IJERTV15IS031285
- Volume & Issue : Volume 15, Issue 03 , March – 2026
- Published (First Online): 01-04-2026
- ISSN (Online) : 2278-0181
- Publisher Name : IJERT
- License:
This work is licensed under a Creative Commons Attribution 4.0 International License
Adaptive AI Speech Therapy and Real-Time Correction Engine
A Unified NLP-Driven Architecture for Pronunciation Assessment, Grammar Correction, and Quantitative Performance Scoring
Dr. J. Nafeesa Begum, Brightlin Blessy S, Harini M, Kaviya P
Department of Computer Science and Engineering,
Government College of Engineering, Bargur 635104, Krishnagiri, Tamil Nadu, India Anna University, Chennai 600 025
Abstract – Speech disorders affect over 1.5 billion people globally, yet access to timely, affordable, and expert speech therapy remains severely limited, particularly in developing nations. This paper presents the Adaptive AI Speech Therapy and Real-Time Correction Engine a fully automated, web-accessible intelligent platform that bridges the gap between traditional clinical speech therapy and scalable, technology-mediated language rehabilitation. The system integrates four synergistic modules: (1) Automatic Speech Recognition (ASR) via the Google Web Speech API achieving a Word Error Rate (WER) of 4.2% under clean conditions; (2) Word-level Pronunciation Comparison using Pythons difflib SequenceMatcher, classifying each spoken word into one of four diagnostic categories correct, missing, extra, or mispronounced; (3) Grammar Error Detection via LanguageTool NLP, attaining 92.6% precision and 85.1% recall across five grammatical error categories; and (4) a Weighted Performance Scoring algorithm combining pronunciation accuracy (70%) and grammar correctness (30%) into a clinically interpretable 0100 composite score with five named performance bands. Implemented on a high-throughput asynchronous FastAPI backend with Pydub-driven format-agnostic audio preprocessing (supporting MP3, WAV, M4A, OGG, FLAC, WebM), the system delivers end-to-end feedback within 1.241.72 seconds. Experimental evaluation across 50 diverse audio samples spanning five speaker profiles, five audio quality conditions, and five sentence complexity levels demonstrates pronunciation accuracy of 95.8% on clean audio, F1-scores ranging from 0.764 to 0.954 across speaker profiles, and grammar detection performance surpassing all benchmarked commercial alternatives. The systems open-source architecture, self-hostable deployment model, and documented REST API collectively position it as a clinically viable, equitable, and extensible supplement to traditional speech therapy.
Keywords: Automatic Speech Recognition (ASR); Natural Language Processing (NLP); Speech Therapy Automation; Pronunciation Error Classification; Grammar Correction; FastAPI Microservices; LanguageTool; difflib SequenceMatcher; Real-Time Feedback; AI-Assisted Language Learning; Web-Based Rehabilitation; Word Error Rate (WER)
- INTRODUCTION
Spoken language is the primary medium through which humans build relationships, access education, and participate in professional life. Yet for millions of people worldwide, speech disorders represent a profound barrier to communication. According to the World Health Organization [WHO, 2023], over 1.5 billion people live with some degree of communication impairment, with speech and language disorders accounting for a significant and growing proportion of this burden. In low- and middle-income countries, the gap between the demand for speech-language therapy and the availability of qualified Speech-Language Pathologists (SLPs) is particularly acute: the WHO estimates a global shortage of over 2.5 million allied health professionals, of whom SLPs form a critical but underrepresented subset.
Traditional speech therapy is conducted through scheduled, in-person sessions in clinical settings. An SLP assesses the patient using standardized instruments such as the Goldman-Fristoe Test of Articulation, the Praxis of Speech Acts, the Stuttering Severity Instrument (SSI-4), and informal conversational analysis. While clinically rigorous, this model carries three systemic constraints that severely limit its scalability: (1) geographic inaccessibility, as qualified therapists are disproportionately concentrated in urban tertiary-care centers; (2) economic barriers, with per-session costs ranging from 1,500 to 5,000 in India and US$100300 in developed economies; and (3) temporal discontinuity, since patients practice independently between sessions without objective guidance, frequently reinforcing errors rather than correcting them.
The convergence of Artificial Intelligence (AI), Natural Language Processing (NLP), and cloud-based Automatic Speech Recognition (ASR) offers a transformative pathway to address these constraints. Modern ASR systems including Google Web Speech API, OpenAI Whisper, and Mozilla DeepSpeech have achieved near-human transcription accuracy on English speech
benchmarks [Radford et al., 2022]. When combined with NLP-driven error analysis and intelligent scoring, these technologies can replicate many of the analytical capabilities of a trained SLP within an automated, scalable, and accessible software framework.
However, the mere availability of these technologies does not automatically produce a therapeutically effective system. A clinically meaningful speech correction platform must satisfy four requirements that current commercial tools collectively fail to meet: (1) real-time, word-level identification of phonological errors with specific category labels; (2) simultaneous assessment of grammatical correctness alongside pronunciation accuracy; (3) a quantitative, reproducible performance metric enabling longitudinal progress tracking; and (4) open, extensible architecture deployable without expensive licensing in resource-constrained clinical settings.
This paper makes the following original contributions to the field of AI-assisted speech therapy:
- A unified, modular NLP pipeline integrating ASR transcription, word-level pronunciation comparison using difflib SequenceMatcher, rule-based grammar error detection via LanguageTool, and a dual-dimension performance scoring algorithm into a single, cohesive web service.
- A four-category pronunciation error taxonomy (correct, missing, extra, mispronounced) derived from sequence alignment opcodes, providing clinically actionable diagnostic specificity not present in any benchmarked commercial tool.
- Empirical evaluation of the system across 50 audio samples spanning five audio quality conditions, five speaker profiles, and five sentence complexity levels, establishing performance baselines for future research.
- A fully documented, open-source FastAPI REST API enabling integration with clinical management systems, mobile applications, and educational platforms.
- Demonstrated superiority over ELSA Speak, Pronunciation Coach, and Speechling across eight comparative dimensions, with particular differentiation in grammar detection capability (92.6% precision vs. N/A for all competitors).
- Expose all analytical capabilities via a FastAPI-powered RESTful backend with asynchronous request handling, supporting concurrent multi-user access.
- Deliver analysis results through a user-friendly web interface that renders annotated transcripts, error highlights, correction suggestions, and performance scores in real time.
Table 1. Feature Comparison of Speech Correction Approaches
Pronunciation Coach
Feature Manual SLP Therapy ELSA Speak
Speechling Proposed System Real-Time Feedback Yes No No (async) Yes (<1.72s) Word-Level Error Labels Yes Partial No No Yes 4 classes Grammar Analysis Yes No No No Yes (LanguageTool) Performance Score Yes Yes Partial No 0100, 5 bands Format-Agnostic Audio Yes WAV only MP3/WAV Any (Pydub) Offline Operation Yes No Partial No Planned (Whisper) Open-Source N/A No No No Yes REST API N/A Paid No No Yes (FastAPI) Pronunciation Accuracy SLP Eval ~88% ~82% ~79% ~95.8% (clean) Grammar Precision High N/A N/A N/A 92.6% Cost per Session High Subscription Medium Subscription Low / Free - BACKGROUND AND MOTIVATION
- The Global Speech Therapy Access Crisis
The global shortage of speech-language pathologists is not merely a logistical inconvenience but a public health crisis with measurable consequences. In India, the patient-to-SLP ratio is approximately 1:2,00,000 compared to the WHO-recommended ratio of 1:10,000 leaving the vast majority of the estimated 50 million Indians with speech and language disorders without access to professional care [ASHA India, 2022]. In sub-Saharan Africa, the ratio is even more extreme at approximately 1:10,00,000 [Mulwafu et al., 2016]. Even in the United States, the American Speech-Language-Hearing Association (ASHA) projects a shortage of 25,000 SLPs by 2030 as the aging population drives increasing demand for communication rehabilitation services.
The economic burden is equally significant. A standard course of speech therapy in India typically requires 2040 sessions at a cost of 1,5005,000 per session, placing the total cost (30,0002,00,000) far beyond the reach of most Indian families. In the United States, without insurance coverage, similar courses cost US$3,00010,000. The result is that speech disorders disproportionately affect outcomes for economically disadvantaged populations, creating a self-reinforcing cycle of limited communication skills and reduced economic mobility.
- Limitations of Current AI-Based Speech Tools
Commercial AI speech tools have made partial inroads into this access gap, but each suffers from architectural limitations that constrain their clinical utility. ELSA Speak, perhaps the most prominent AI pronunciation coach, provides real-time phoneme- level feedback using a proprietary deep learning model, but restricts its assessment to pronunciation only, leaving the equally important dimension of grammatical correctness entirely unaddressed. Its subscription pricing model (1,5002,500 per month) places it beyond the reach of exactly the populations that would benefit most from scalable speech therapy. Speechling uses a human-review model where language coaches provide asynchronous feedback, which while qualitatively rich is fundamentally incompatible with the real-time correction requirements of effective speech practice. Pronunciation Coach, while offering offline capability, accepts only WAV format input, creating compatibility barriers for the majority of users who record audio on mobile devices producing M4A, AAC, or OGG files.
More fundamentally, no existing commercial tool simultaneously addresses the two core dimensions of spoken language competence: phonological accuracy (pronunciation) and syntactic/morphological correctness (grammar). Speech therapy patients
particularly those recovering from neurological events such as stroke or traumatic brain injury require assessment of both dimensions, as their deficits often span phonological production and syntactic generation simultaneously [Goodglass & Kaplan, 1983]. The proposed system is the first to integrate both dimensions in a unified, automated, real-time assessment pipeline.
- Why This System, Why Now
Three technological developments converging in the mid-2020s make the proposed system feasible in a way that was not achievable even five years ago. First, cloud-based ASR APIs have crossed the near-human accuracy threshold: the Google Web Speech API now achieves WERs below 5% on clean English speech [Google AI, 2023], a level of accuracy sufficient to support reliable downstream linguistic analysis. Second, rule-based NLP engines such as LanguageTool have matured to provide 5,000+ linguistic rules for English with precision exceeding 90% on common error types, making them viable as production-grade grammar checkers. Third, the emergence of high-performance asynchronous Python web frameworks specifically FastAPI, which benchmarks consistently show to be comparable in throughput to Node.js enables the construction of responsive, concurrent- capable NLP services without specialized infrastructure.
- The Global Speech Therapy Access Crisis
- LITERATURE REVIEW
This review synthesizes research across five domains directly relevant to the proposed system: ASR-based language learning, pronunciation error detection, grammar correction, web-based and telemedicine therapy delivery, and performance scoring systems. The review is structured to demonstrate how each body of prior work motivates specific design decisions in the proposed architecture.
- Automatic Speech Recognition in Language Learning
The integration of ASR into Computer-Assisted Language Learning (CALL) systems was first systematically explored by Eskenazi [1999], who demonstrated that ASR-enabled real-time pronunciation feedback produced measurable improvement in non-
native speakers accuracy compared to unaided practice. Subsequent work by Neri et al. [2002] and Cucchiarini et al. [2009] corroborated these findings, with Cucchiarini et al.s large-scale study of Dutch learners showing that CALL systems with ASR feedback produced 20% greater pronunciation gains per hour of practice than equivalent human-tutored sessions. The key limitation of early systems was binary pass/fail feedback without error-type specificity, a gap the proposed system addresses through its four- category classification taxonomy.
More recent work has explored the use of neural ASR systems in therapeutic contexts. Shahamiri and Salim [2014] demonstrated that Microsoft SAPI-based ASR could achieve 89.3% word recognition accuracy for individuals with speech sound disorders when combined wth acoustic model adaptation. The emergence of transformer-based ASR systems particularly OpenAIs Whisper [Radford et al., 2022], which achieves WER of 2.7% on the LibriSpeech clean benchmark opens substantial opportunities for future enhancement of the proposed systems ASR layer, as identified in the future work roadmap.
- Pronunciation Error Detection and Classification
Witt and Young [2000] developed the seminal phone-level pronunciation scoring system using Hidden Markov Model (HMM) Goodness of Pronunciation (GOP) scores, achieving 78% agreement with human raters on segmental error labeling within the Spoken Language Educator (SLE) system. The GOP score measures the likelihood of a phoneme given the acoustic evidence relative to the likelihood of any phoneme, providing a continuous pronunciation quality measure. While phoneme-level analysis provides maximum diagnostic granularity, it requires language-specific acoustic models trained on large native-speaker corpora and is computationally prohibitive for real-time web deployment.
Franco et al. [2010] developed the EduSpeak SDK for pronunciation training, combining GOP scores with prosodic features (pitch, rhythm, speaking rate) to produce multi-dimensional pronunciation assessments. Their system achieved 82% correlation with expert SLP ratings on a balanced corpus of learner speech. Neri et al. [2008] proposed a segment-level pronunciation scoring approach using Support Vector Machine (SVM) classifiers trained on acoustic features, demonstrating that machine learning-based approaches could outperform GOP-based methods on non-standard accents. The proposed system adopts a word- level (rather than phoneme-level) comparison approach using difflib SequenceMatcher, trading phonetic granularity for computational efficiency, real-time responsiveness, and accent-robustness.
- Deep Learning Advances in ASR
Graves, Mohamed, and Hinton [2013] introduced the landmark deep bidirectional LSTM architecture with Connectionist Temporal Classification (CTC) loss for end-to-end ASR, eliminating the need for forced phoneme alignments and outperforming the best HMM-based systems on the TIMIT benchmark by over 17% relative WER reduction. This architecture became the conceptual blueprint for the modern ASR systems that underpin commercial APIs including Google Web Speech, Amazon Transcribe, and Microsoft Azure Speech. Chan et al. [2016] subsequently introduced the Listen, Attend and Spell (LAS) architecture, an attention-based encoder-decoder model that further improved long-form transcription quality. The Conformer architecture [Gulati et al., 2020] combined convolutional and transformer modules to achieve state-of-the-art performance on LibriSpeech, establishing the technical foundation for the near-human ASR accuracy available through cloud APIs today.
- Natural Language Processing for Grammar Correction
The CoNLL-2014 shared task on Grammatical Error Correction [Ng et al., 2014] benchmarked 13 systems on a learner English corpus, establishing that statistical phrase-based machine translation (SMT) approaches outperformed purely rule-based methods on complex contextual errors (preposition selection, article choice), while rule-based methods retained superiority for deterministic errors (subject-verb agreement, spelling). This complementarity directly informs the design rationale for using LanguageTool in the proposed system: for the spoken-language context of the system where transcription noise and sentence simplicity characterize the input deterministic rule-based detection is both more precise and more interpretable than neural alternatives.
Subsequent neural GEC systems including seq2seq [Sutskever et al., 2014] and transformer-based approaches [Zhao et al., 2019] achieved superior MaxMatch (M²) F0.5 scores on standard benchmarks. However, these models require substantial computational resources for inference, making real-time deployment challenging without GPU acceleration. The GECToR model [Omelianchuk et al., 2020] proposed a sequence tagging approach enabling fast CPU-based inference, suggesting a viable pathway for upgrading the grammar detection module in future system iterations.
- Telemedicine and Web-Based Speech Therapy
Grogan-Johnson et al. [2010] conducted the first randomized controlled pilot study comparing videoconference-delivered speech-language therapy against in-person therapy for 10 school-age children with speech sound disorders, finding no statistically significant difference in articulation improvement (p > 0.05) between delivery modes. Subsequent systematic reviews [Mashima & Doarn, 2008; Molini-Avejonas et al., 2015] confirmed these findings across a broader range of populations and disorder types, establishing telemedicine delivery as a clinically equivalent modality. The COVID-19 pandemic dramatically accelerated the adoption of telehealth in speech-language pathology, with ASHA reporting that telepractice delivery increased from 8% to over 60% of SLP service delivery within three months of the pandemic onset [ASHA, 2020].
Wales et al. [2017] conducted a systematic review of 24 digital health interventions for communication disorders, identifying self-management support, real-time biofeedback, and structured progress monitoring as the three features most consistently associated with positive therapy outcomes in digital platforms. The proposed system incorporates all three features: structured practice sessions, real-time error categorization and correction, and quantitative session-by-session score tracking.
- Automated Performance Scoring
Xi et al. [2008] developed SpeechRater ETSs automated spoken English scoring system which extracts features across five dimensions (pronunciation, fluency, vocabulary richness, grammar complexity, content relevance) to predict holistic proficiency scores. SpeechRater achieved correlations of r = 0.730.81 with human rater scores across different task types, demonstrating that automated multi-dimensional scoring can approach human-level reliability for proficiency assessment. The systems multi-dimensional scoring philosophy directly shaped the proposed systems weighted scoring algorithm, though the proposed systems simpler two-factor model (pronunciation and grammar) is better calibrated to the therapeutic rather than proficiency-assessment context.
Zechner et al. [2019] subsequently showed that adding fluency features (pause frequency, speaking rate, repair frequency) to pronunciation and grammar features substantially improved holistic score prediction, achieving r = 0.84 with expert raters. This finding motivates the inclusion of fluency metrics as a future enhancement in the proposed systems scoring module.
- String Matching and Edit-Distance Algorithms
Ratcliff and Metzener [1988] described the Gestalt Pattern Matching algorithm, implemented in Pythons difflib SequenceMatcher, which finds the longest common substring and recursively applies the process to unmatched fragments, computing a similarity ratio in O(n²) time where n is string length. The algorithm produces edit operation codes (opcodes) equal, insert, delete, replace that map directly to the proposed systems four pronunciation error categories. Unlike Levenshtein edit distance, which operates at character level, SequenceMatcher operates natively at word-token granularity, making it directly applicable to word-level pronunciation analysis without tokenization overhead.
- Adaptive Intelligent Tutoring Systems
Vandewaetere, Desmet, and Clarebout [2011] reviewed 32 empirical studies of Adaptive Intelligent Tutoring Systems (AITS), finding that systems adapting exercise difficulty and feedback type to individual performance data showed mean learning gains 23% higher than non-adaptive counterparts across ll measured disciplines. VanLehn [2011] conducted a meta-analysis of 76 tutoring experiments, showing that step-level adaptivity (feedback at each practice step) consistently outperformed summary-level adaptivity (end-of-session feedback) by 0.76 standard deviations. These findings directly motivate the proposed systems word- level immediate feedback design, which provides step-level corrective guidance at each spoken word rather than deferring to end- of-session summaries.
Table 2. Literature Survey Summary Contributions and Limitations
Ref. Authors Year Technique Key Contribution Limitation Addressed [1] Eskenazi 1999 ASR-CALL First ASR-based pronunciation feedback Binary feedback only; no word classification [2] Witt & Young 2000 HMM GOP Scoring
Phone-level pronunciation scoring Phoneme-only; no grammar; slow deployment [3] Graves et al. 2013 CTC-LSTM ASR End-to-end ASR, no forced alignment No therapy output; no NLP integration [4] Ng et al. 2014 Statistical GEC Standardised grammar error evaluation Not real-time; complex model inference [5] Grogan- Johnson 2010 Telemedicine Validates remote therapy equivalence Still requires human SLP clinician [6] Xi et al. 2008 SpeechRater Multi-dimensional spoken scoring Commercial; proprietary; not open- source [7] Ratcliff & Metzener 1988 Gestalt Matching Word-level similarity algorithm Algorithm only; no therapy integration [8] Vandewaetere et al. 2011 AITS Review Step-level adaptivity +23% learning gain
No speech analysis component [9] Newman 2015 Microservices Scalable AI service decomposition Architecture guide; no speech system [10] Jurafsky & Martin 2020 Audio Processing 16kHz standardisation pipeline Textbook; no working implementation
- Automatic Speech Recognition in Language Learning
- SYSTEM ANALYSIS
- Existing System Analysis
- Traditional Clinician-Led Therapy
Conventional speech therapy through scheduled SLP sessions employs standardized assessments including the Goldman- Fristoe Test of Articulation-3 (GFTA-3), the Praxis of Speech Acts (PROMPT), the Comprehensive Aphasia Test (CAT), and informal conversational analysis using the LARSP (Language Assessment Remediation and Screening Procedure) grammatical profile. While clinically definitive, this model faces three systemic constraints: (1) the appointment-based model creates temporal discontinuity, with patients practicing independently for days between 1-hour sessions; (2) per-session costs of 1,5005,000 in India and US$100300 internationally create economic exclusion; and (3) geographic concentration of SLPs in urban tertiary-care centers creates physical inaccessibility for rural populations.
- Commercial Software Tools
Contemporary AI speech tools can be categorized into three architectural paradigms, each with distinct strengths and limitations. Phoneme-scoring tools (ELSA Speak, Rosetta Stone, Pimsleur) employ deep acoustic models to evaluate phoneme- level production, achieving subjectively impressive feedback but ignoring the syntactic dimension of spoken language entirely. Transcription-plus-comparison tools (Google Pronunciation) provide broad-accuracy ASR but offer no error categorization or therapeutic guidance. Human-review tools (Speechling, italki) pair learners with human coaches, providing qualitatively rich but inherently asynchronous feedback that is incompatible with the immediate-correction requirements of effective practice sessions. No existing tool integrates all four therapeutic requirements: real-time response, word-level error classification, grammar analysis, and quantitative progress scoring.
- Traditional Clinician-Led Therapy
- Proposed System Design Philosophy
The proposed Adaptive AI Speech Therapy and Real-Time Correction Engine is designed around five foundational principles that collectively distinguish it from all benchmarked alternatives:
- Real-time analysis with sub-5-second end-to-end latency, ensuring that feedback arrives before the users short-term phonological memory of the utterance fades the 2-second threshold identified by Derwing et al. [2008] as critical for effective speech feedback.
- Multi-dimensional assessment covering both phonological accuracy (pronunciation) and syntactic/morphological correctness (grammar) the two pillars of spoken language competence as defined in the CEFR (Common European Framework of Reference for Languages).
- Word-level error specificity providing four-category diagnostic classification (correct, missing, extra, mispronounced) rather than binary pass/fail, enabling users to direct practice effort to specific error types and positions.
- Quantitative, reproducible scoring on a 0100 scale with five named performance bands, enabling objective session-to- session progress tracking analogous to physiotherapy outcome measurement tools such as the Functional Independence Measure (FIM).
- Open, extensible architecture using an open-source Python/FastAPI/LanguageTool stack that can be self-hosted by institutions, integrated with existing clinical systems via REST API, and extended without proprietary licensing constraints.
- Existing System Analysis
- SYSTEM ARCHITECTURE AND DESIGN
- Three-Tier Architectural Overview
The system follows a classic three-tier web application architecture with an internal composition analogous to a microservice pipeline, where each analytical concern is handled by a dedicated, independently testable and upgradeable module. The three tiers Presentation, Application, and External Services communicate exclusively through well-defined contracts: HTTP/JSON between Presentation and Application tiers, and HTTPS API calls between the Application tier and external services. This strict separation of concerns ensures that any individual tier can be replaced (e.g., swapping Google Web Speech API for OpenAI Whisper) without structural disruption to the overall system.
Table 3. Three-Tier Architecture Components and Communication Contracts</p
Tier Layer Key Components Interface Protocol Scalability Strategy Tier 1 Presentation Browser HTML/CSS/JS, Audio Upload Form, Target Sentence Input, Results Display, Score Visualisation HTTP POST multipart/form-data; JSON Response
CDN-hosted static assets; client-side rendering Tier 2 Application FastAPI Server, Pydub Preprocessor, SequenceMatcher Analyzer, LanguageTool Grammar Engine, Weighted Scorer Python async function calls; Pydantic models Uvicorn ASGI; horizontal scaling via Docker/K8s Tier 3 External Services Google Web Speech API (Cloud ASR); LanguageTool Java NLP Server (Local) HTTPS REST; XML/JSON Cloud API auto-scales; local JVM thread pooling - Module Architecture
Within the Application Layer (Tier 2), the system is decomposed into five functionally independent modules communicating through well-typed Python data structures: the Audio Preprocessing Module, the ASR Transcription Module, the Pronunciation Comparison Module, the Grammar Detection Module, and the Performance Scoring Module. This decomposition follows the Single Responsibility Principle: each module has exactly one analytical responsibility and exposes a clean interface allowing independent unit testing, replacement, and scaling.
- Data Flow Architecture
The data flow follows a sequential-then-parallel pattern designed to minimize end-to-end latency. Audio preprocessing (Module 1) and ASR transcription (Module 2) execute sequentially since transcription requires preprocessed audio. Once the transcript is available, pronunciation comparison (Module 3) and grammar detection (Module 4) execute in parallel using Pythons asyncio concurrency model, since they are independent operations on the same input text. Performance scoring (Module 5) executes after both parallel modules complete, combining their outputs. This parallel execution reduces end-to-end latency by approximately 3540% compared to a fully sequential pipeline, as confirmed by profiling experiments.
Table 4. Data Flow Level-1 Process Decomposition
Process Process Name Input Output Data Store Parallelism P1 Validate & Preprocess Audio Raw audio bytes, filename 16kHz mono WAV file DS1: Temp WAV
Sequential P2 Transcribe Speech (ASR) WAV file from DS1 Transcript text string DS2: Session
Sequential after P1 P3 Compare Pronunciation Transcript + target sentence Word classification list + accuracy % DS2: Session
Parallel with P4 P4 Detect Grammar Errors Transcript text Grammar error list + grammar score DS2: Session
Parallel with P3 P5 Calculate Performance Score Word list + error list from DS2 Score (0100), band, recommendation DS2: Session
Sequential after P3+P4 P6 Assemble Response All DS2 fields Structured JSON response object DS3: Audit Log
Sequential after P5 P7 Render Results JSON response
Annotated UI display (none) Client-side - Database and Session Management
The system employs a lightweight, stateless session model appropriate for its therapeutic use case. Each API call creates a transient session object (DS2) in memory containing all intermediate outputs, which is serialized to a JSON response and optionally persisted to an audit log (DS3) for longitudinal tracking. Temporary WAV files (DS1) are created using Pythons tempfile module with automatic cleanup on request completion, ensuring no sensitive audio data persists on the server beyond the transaction lifetime. For clinical deployments requiring HIPAA/DPDP (Indias Digital Personal Data Protection Act, 2023) compliance, the system supports encrypted at-rest storage of session records and audit logs using AES-256 encryption of the DS3 datastore.
- Three-Tier Architectural Overview
- DETAILED MODULE IMPLEMENTATION
- Audio Preprocessing Module
The Audio Preprocessing Module is the entry point of the analysis pipeline, responsible for transforming all incoming audio files into a standardized form suitable for ASR processing. The Google Web Speech API requires audio in 16-bit PCM WAV format at 16,000 Hz sample rate, single channel (mono). User-submitted audio may arrive in any of 12 common formats (MP3, M4A, OGG, FLAC, AAC, WebM, AMR, WMA, OPUS, AIFF, AU, WAV) with arbitrary sample rates (8,00048,000 Hz), bit depths (8, 16, or 24 bit), and channel configurations (mono, stereo, or multichannel). The module uses Pydubs AudioSegment class to perform three sequential transformations reliably across all input configurations:
- Format detection and loading using FFmpeg codec bindings. Pydub automatically identifies the audio codec from the file header (magic bytes) rather than relying on file extension, ensuring correct decoding even for misnamed files.
- Channel downmixing to mono. Stereo audio is downmixed using equal-power averaging (L+R)/2, which preserves speech intelligibility while reducing file size by 50% and eliminating channel-dependent ASR accuracy variations.
- Sample rate resampling to 16,000 Hz. Pydub uses FFmpegs high-quality SoX resampler with anti-aliasing filtering to prevent spectral aliasing artifacts that could degrade ASR accuracy.
Additionally, the module applies amplitude normalization to a target dBFS of 14 dBFS using peak normalization, ensuring
consistent signal levels across recordings from different microphones, recording environments, and device types. A voice activity
detection (VAD) heuristic rejects audio samples shorter than 500 ms or with signal-to-noise ratio (SNR) below 6 dB, returning a descriptive error message rather than submitting unprocessable audio to the ASR API, thus avoiding unnecessary API call costs.
- ASR Transcription Module
The ASR Transcription Module converts the preprocessed WAV audio into text using the SpeechRecognition librarys recognize_google() method, which submits audio data to the Google Web Speech API over HTTPS and returns the n-best transcription hypothesis. In the current implementation, only the top hypothesis is used; future work will explore hypothesis combination to improve robustness n noisy audio.
The module implements a robust error handling strategy with exponential backoff retry logic (delays: 1s, 2s, 4s; maximum 3 attempts) for transient network failures, a pattern well-established in distributed systems engineering [Nygard, 2007]. Three distinct exception types are handled: speech_recognition.UnknownValueError (audio contains no intelligible speech), speech_recognition.RequestError (API service unavailable or network error), and a custom AudioQualityError raised when the preprocessing modules VAD check fails. Each exception returns a structured error response with a descriptive message and HTTP status code appropriate to the failure type (422 Unprocessable Entity for audio quality failures; 503 Service Unavailable for API failures).
The module also implements transcript post-processing: converting the returned text to lowercase, stripping leading/trailing whitespace, and expanding common ASR contractions (e.g., gonna going to, gonna going to) when the target sentence uses the expanded form, reducing spurious mispronunciation flags for colloquial speech patterns.
- Pronunciation Comparison Module
The Pronunciation Comparison Module is the analytical core of the phonological assessment subsystem. The module normalizes both the ASR transcript and the users target sentence to lowercase word lists with punctuation stripped using Pythons re.sub(r[^\w\s], , text) pattern, then applies difflib.SequenceMatcher(None, transcript_words, target_words).get_opcodes() to obtain a sequence of edit operations with four opcode types that map directly to the four diagnostic categories:
Table 5. Pronunciation Error Classification Opcode-to-Category Mapping
Category SequenceMatcher Opcode Diagnostic Meaning Clinical Interpretation Score Impact CORRECT equal Spoken word matches target exactly Accurate phonological production + (counts toward accuracy) MISSING delete Target word absent from spoken output Word omission: possible apraxia, fluency disorder, or attention lapse (denominator word not produced) EXTRA insert Spoken word not present in target Intrusion/addition: possible paraphasic error or filler word habit (counted against fluency) MISPRONOUNCED replace Different word spoken where target word expected Phonological substitution: target-specific articulation error requiring focused practice (counted against accuracy) Pronunciation accuracy is computed as: accuracy = (count_equal / len(target_words)) × 100. This definition deliberately uses target word count as the denominator (rather than the union of target and spoken words) because therapeutic progress is measured by the proportion of the target successfully reproduced, not penalized additionally for extra words. The module returns a structured list of word result objects, each containing the word text, its classification category, its position in the target sentence, and (for MISPRONOUNCED words) both the spoken and expected word forms.
- Grammar Detection Module
The Grammar Detection Module leverages LanguageTool an open-source, rule-based NLP grammar and style checker supporting 25+ languages with over 5,000 linguistic rules for English to identify syntactic and morphological errors in the ASR
transcript. The module instantiates a LanguageTool(en-IN) checker (en-IN: Indian English variant) to correctly handle Indian English lexical and syntactic patterns that differ from British or American English norms.
Each grammar error is returned as a Match object containing: ruleId (the violated LanguageTool rule identifier), message (human-readable error description), offset and errorLength (character span of the error), and replacements (ordered list of correction candidates). The module applies a context-sensitive filtering stage to remove false positives specific to ASR output: UPPERCASE_SENTENCE_START (ASR produces lowercase output), PUNCTUATION_PARAGRAPH_END (ASR omits sentence-final punctuation), and MORFOLOGIK_RULE_EN_* rules for proper nouns that ASR may not capitalize. This filtering stage reduces false positive grammar alerts by approximately 28%, as measured on the test corpus.
Table 6. Grammar Error Categories LanguageTool Rules and Detection Performance
Error Category Example Error LanguageTool Rule ID Precision Recall Clinical Frequency Subject-Verb Agreement She go to school AGREEMENT_SENT_START 91.7% 91.7% Very High Article Usage I saw a elephant EN_A_VS_AN 88.2% 83.3% High Tense Consistency He go yesterday ENGLISH_WORD_REPEAT 100% 78.6% High Word Repetition I saw saw him ENGLISH_WORD_REPEAT 100% 100% Medium Preposition Selection Interested on AI PREP_INTERESTED 77.8% 70.0% Medium Number Agreement Five childs NON3PRS_VERB 88.5% 82.4% Medium Spelling/Morphology Tomarow MORFOLOGIK_RULE_EN_GB 95.2% 89.7% Low Overall (weighted avg) All categories 92.6% 85.1% All - Performance Scoring Module
The Performance Scoring Module aggregates the outputs of the pronunciation comparison and grammar detection modules into a single composite performance score using a weighted linear combination formula validated against expert SLP ratings on the test corpus:
Final Score = (0.70 × Pronunciation_Accuracy) + (0.30 × Grammar_Score)
Grammar_Score = max(0, 100 (Errors_Per_100_Words × 10))
Pronunciation_Accuracy = (correct_word_count / total_target_words) × 100
The 70:30 weighting was determined through a calibration study in which five experienced SLPs rated 30 speech samples on a 0100 holistic quality scale. Ridge regression optimization of the weighting coefficient to minimize root-mean-square error (RMSE) between model scores and SLP ratings yielded an optimal weight of 0.69 for pronunciation and 0.31 for grammar (rounded to 0.70/0.30). This weighting reflects the established clinial consensus that phonological accuracy is the primary determinant of
spoken language intelligibility [Derwing & Munro, 2015], while grammatical correctness constitutes an important secondary dimension of communicative competence.
Table 7. Performance Score Bands Thresholds, Labels, and Therapeutic Recommendations
Score Range Band Label Colour Clinical Interpretation Therapeutic Recommendation 90100 EXCELLENT Deep Green Near-native accuracy; minimal errors Advance to complex, multi-clause target sentences with low-frequency vocabulary 7589 GOOD Light Green Communicatively effective with minor lapses Focus targeted practice on the 12 mispronounced words identified; maintain current complexity 6074 SATISFACTORY Amber Acceptable intelligibility; noticeable errors Practise flagged words 5× minimum before re- attempting the full sentence 4059 NEEDS IMPROVEMENT Orange Impaired intelligibility; systematic errors Return to isolated word-level practice for all MISSING and MISPRONOUNCED categories 039 POOR Red Severely impaired intelligibility Regression to simpler 35 word target sentences; consider formal SLP referral - REST API Layer
The REST API Layer, implemented using FastAPI 0.104+ with Uvicorn 0.24+ as the ASGI server, orchestrates all modules and exposes the systems capabilities as documented HTTP endpoints. FastAPIs asynchronous request handling using Pythons asyncio event loop enables the server to process multiple concurrent analysis requests without thread-blocking, supporting classroom deployments with 2030 simultaneous users on a single server instance. The frameworks automatic OpenAPI 3.0 documentation generation (accessible at /docs) provides a self-describing, interactive API specification that facilitates integration by third-party developers and clinical IT teams.
Table 8. REST API Endpoint Definitions
Method Endpoint Request Parameters Response Schema HTTP Codes POST /analyze audio_file (UploadFile, binary); target_sentence (str, Form) JSON: {transcript, target, word_results[], grammar_errors[], score, band, recommendation, latency_ms} 200 OK; 422 Unprocessable; 503 Unavailable GET /health JSON: {status: ‘ok’, version: ‘1.0.0’, uptime_s: float}
200 OK POST /feedback session_id (str); rating (int, 15); comment (str, optional) JSON: {message: ‘Feedback received’, session_id} 200 OK; 404 Not Found GET /history/{user_id} user_id (path param, str); limit (query, int, default 20) JSON: [{session_id, score, band, timestamp, target_sentence}] 200 OK; 404 Not Found GET /stats period (query: ‘day’|’week’|’month’) JSON: {avg_score, total_sessions, score_distribution, top_errors[]} 200 OK GET /docs Auto-generated OpenAPI 3.0 interactive documentation UI 200 OK - Frontend Interface
The Presentation Layer is a single-page web application delivering two primary interaction modes: file-upload mode, which accepts pre-recorded audio files in any supported format, and live-recording mode, which uses the Web Audio APIs MediaRecorder interface to capture microphone input directly in the browser without requiring any additional software installation. The interface renders three result panels after analysis: an annotated word transcript with color-coded word classification (green for CORRECT, red for MISPRONOUNCED, amber for MISSING, blue for EXTRA); a grammar correction panel listing each identified error with its specific correction suggestion and the applicable linguistic rule; and a circular performance gauge displaying the composite score, band label, and personalized recommendation message. A session history chart (implemented using Chart.js) displays the users score trend across the last 10 sessions, providing visual evidence of progress that functions as a motivational reinforcement mechanism.
- Audio Preprocessing Module
- PERFORMANCE EVALUATION
- Evaluation Methodology
The system was evaluated using a purpose-built test corpus of 50 audio samples designed to provide systematic coverage across three independent variables: audio quality (5 conditions from clean studio recording to heavy background noise), speaker profile (5 categories spanning native English through Indian English learners, slow and fast speech rates), and sentence complexity (5 levels from simple 5-word sentences to complex 20-word multi-clause sentences with subordination and passive voice). Each sample was processed through the complete pipeline, and results were compared against ground-truth transcriptions and expert SLP error annotations provided by two certified SLPs who independently annotated each sample (Cohens kappa = 0.87, indicating strong inter-rater reliability).
- Evaluation Metrics
Six metrics were used to provide comprehensive performance characterization across all system modules:
Table 9. Evaluation Metrics Definitions, Formulas, and Clinical Relevance
Metric Formula Module Clinical Relevance Target Threshold Word Error Rate (WER) (S+D+I) / N × 100 (S=substitutions, D=deletions, I=insertions, N=reference words)
ASR Primary ASR quality indicator; WER < 10% is considered near- human performance < 10% (clean conditions) Pronunciation Accuracy correct_words / target_words × 100 Pronunciation Proportion of target words reproduced exactly; therapeutic progress indicator > 90% (proficiency target) Grammar Score max(0, 100 EPH × 10) (EPH
= errors per 100 words)
Grammar Error density penalisation; higher = more grammatically correct speech > 90% (proficiency target) Precision
TP / (TP + FP) Grammar Fraction of flagged errors that are genuine; guards against false alarms > 90% (clinical trust) Recall (Sensitivity) TP / (TP + FN) Grammar Fraction of actual errors successfully detected; guards against missed errors > 80% (clinical utility) Response Latency End-to-end wall-clock time from audio upload to JSON response delivery System Speech therapy requires feedback within 2 seconds of utterance [Derwing, 2008] < 2,000 ms - ASR Performance Results
Table 10. Word Error Rate Results by Audio Quality Condition
Audio Condition SNR (dB) WER (%) Pronunciation Accuracy (%) Avg. Latency (ms) System Usability Clean studio recording >40 4.2 95.8 1,240 Excellent Quiet indoor room 2540 7.8 92.2 1,380 Very Good Light background noise 1525 12.4 87.6 1,520 Good Moderate background noise 515 21.3 78.7 1,640 Acceptable quality warning Heavy background noise <5 38.6 61.4 1,720 Limited quality alert issued The WER of 4.2% under clean studio conditions places the system firmly in the near-human performance range widely recognized in ASR research. This result is comparable to reported WERs of 4.9% for the Google Web Speech API on LibriSpeech clean test [Google AI Blog, 2023] and outperforms the 6.3% WER reported by Radford et al. [2022] for OpenAI Whisper Small on the same benchmark. The graceful degradation under increasing noise from 4.2% (clean) to 38.6% (heavy noise) follows a predictable, near-linear pattern that allows users to interpret their score within the context of their recording environment. Critically, the system issues a proactive audio quality warning when WER climbs above 35%, alerting users to improve their recording setup rather than receiving misleadingly poor feedback.
- Speaker Profile Performance
Table 11. Pronunciation Comparison Performance by Speaker Profile
Speaker Profile Precision (%) Recall (%) F1 Score
Avg. Composite Score Key Challenge Native English Speaker 94.1 91.8 0.929 88.6 Near-ceiling performance Indian English (fluent) 91.3 88.7 0.899 85.2 Minor accent-related substitutions Indian English (learner) 84.6 81.2 0.828 78.4 Systematic phonological substitutions Slow speech rate 96.2 94.8 0.954 90.1 Best performance across all profiles Fast speech rate 78.4 74.6 0.764 73.8 Co-articulation causes word merging The speaker profile results reveal important clinical insights. The highest performance for slow speech (F1 = 0.954) is therapeutically significant: speech therapy practice universally encourages deliberate, slow articulation, and the systems strong performance in this regime confirms its suitability as a tool for supported practice. Fast speech (F1 = 0.764) presents the greatest challenge, primarily because rapid co-articulation causes the Google Web Speech API to merge adjacent words at the transcription stage, cascading errors through the pronunciation comparison module. This limitation motivates the Whisper-based ASR enhancement in the future work roadmap, as Whispers large-v2 model shows substantially better performance on fast speech [Radford et al., 2022].
Indian English learners achieve an average composite score of 78.4, demonstrating that the difflib SequenceMatcher algorithm generalises reasonably to non-native accent patterns. The precision-recall gap (84.6% vs. 81.2%) indicates that some accent-specific phonological substitutions particularly dental-alveolar confusion (/t/-/d/ vs. retroflex //-//) characteristic of Indian English are missed by the current word-level (rather than phoneme-level) comparison approach. Phoneme-level comparison using a G2P (grapheme-to-phoneme) converter, as planned in the future work roadmap, would directly address this gap.
- Grammar Detection Performance
Grammar detection results confirm that the LanguageTool integration achieves clinically meaningful precision (92.6% overall), meaning fewer than 1 in 13 grammar corrections presented to users is a false alarm. This false alarm rate is well within the tolerance range for therapeutic applications: false alarms in speech correction tools primarily cause user frustration rather than harm, and a 7.4% false alarm rate is substantially lower than the 1525% false alarm rates reported for neural GEC models deployed without post-processing [Napoles et al., 2019].
Word repetition errors achieve perfect recall (100%) because they are detected by deterministic string matching rather than probabilistic linguistic inference, making this category immune to acoustic model uncertainty. Preposition errors prove most challenging (precision 77.8%, recall 70.0%), reflecting the well-documented difficulty of preposition choice in NLP systems, where contextual ambiguity makes rule-based detection unreliable in approximately 2030% of cases [Tetreault & Chodorow, 2008]. Future integration of a neural GEC component specifically targeting preposition errors would directly improve this performance dimension.
- Response Latency Analysis
End-to-end response latency is a critical performance metric for speech therapy applications. Derwing et al. [2008] established that speech correction feedback must arrive within approximately 2 seconds of the utterance to maintain the association between motor production and error signal in the learners phonological working memory. The proposed systems worst-case latency of 1,720 ms under heavy noise conditions remains comfortably within this threshold. The dominant latency component in all conditions is the round-trip network time for Google Web Speech API calls (typically 8001,200 ms), with local pocessing (Pydub preprocessing + difflib + LanguageTool) contributing only 150300 ms. This architecture implies that the primary path to further latency reduction is either caching of frequent transcription requests or migration to a locally-hosted ASR model such as Whisper, which eliminates the network round-trip entirely at the cost of higher local compute requirements.
- Comparative System Evaluation
Table 12. Comprehensive System Comparison with Existing Speech Correction Tools
Feature / Metric ELSA Speak
Pronunciation Coach Speechling Proposed System Advantage Real-time feedback Yes No No (async) Yes Only 2 of 4 tools Word-level error labels Partial (2 types) No No Yes (4 categories) Unique feature Grammar checking No No No Yes (92.6% precision)
Unique feature Performance score Yes (opaque) Partial No Yes (0100, 5 bands, transparent formula) Highest transparency Audio format support MP3/WAV WAV only MP3/WAV 12+ formats via Pydub Broadest support Offline operation No Partial No Planned (Whisper) Planned parity Open source No No No Yes (MIT license) Unique feature Self-hostable No No No Yes (Docker) Unique feature REST API Paid No No Yes (FastAPI, OpenAPI 3.0) Unique feature Pronunciation accuracy ~88% (clean)
~82% (clean) ~79% (clean)
~95.8% (clean) Highest accuracy Grammar precision N/A N/A N/A 92.6% Unique capability Monthly cost (individual) ~1,500 2,500
~8001,200 ~US$1020 Free / self-hosted Lowest cost
- Evaluation Methodology
- SECURITY, PRIVACY, AND ETHICAL CONSIDERATIONS
- Data Privacy Architecture
The handling of audio data from speech therapy patients raises significant privacy considerations, particularly for paediatric patients and individuals with neurological conditions whose speech patterns may constitute sensitive health information. The proposed system implements a privacy-by-design architecture with three core principles: (1) minimal retention temporary audio files are deleted within 30 seconds of request completion using Pythons tempfile module with explicit cleanup; (2) local-first processing all NLP analysis (pronunciation comparison, grammar detection, scoring) executes locally within the application server, with only the audio transcription request sent to the external Google Web Speech API; and (3) configurable audit logging
session score records can be stored locally in an encrypted SQLite database (AES-256) or disabled entirely for zero-retention deployments.
- Compliance Framework
For deployment in Indian clinical settings, the system is designed for compatibility with Indias Digital Personal Data Protection Act (DPDP Act, 2023), which requires informed consent for personal data collection, data minimisation, and the right to erasure. The /feedback and /history endpoints, which involve optional persistent data storage, present a consent prompt on first use and provide a /history/{user_id}/delete endpoint for complete data erasure. For international deployments requiring HIPAA compliance (US), GDPR compliance (EU), or PIPEDA compliance (Canada), the systems self-hostable architecture enables deployment within compliant infrastructure without transmitting protected health information to third-party cloud services.
- Ethical Use and Clinical Boundary
It is critically important to position the proposed system correctly within the spectrum of speech therapy delivery. The system is designed and validated as a supplementary practice tool an intelligent homework assistant between formal therapy sessions not as a replacement for qualified clinical assessment and treatment planning by licensed SLPs. The systems feedback is intentionally framed as practice guidance rather than clinical diagnosis, and the user interface includes a prominent disclaimer directing users with suspected speech disorders to seek professional SLP evaluation. The performance score is presented as a relative improvement metric, not a diagnostic label. Users scoring in the POOR band (039) receive a recommendation to consult a speech- language pathologist, acknowledging the boundary between therapeutic AI tools and professional clinical care.
- Data Privacy Architecture
- CONCLUSION AND FUTURE WORK
- Conclusion
This paper presents the Adaptive AI Speech Therapy and Real-Time Correction Engine, a comprehensive, open-source, web-accessible platform that integrates Automatic Speech Recognition, NLP-driven pronunciation comparison, rule-based grammar error detection, and intelligent performance scoring into a unified real-time assessment pipeline. The system addresses the critical gap between the demand for accessible, effective speech therapy support and the supply of qualified clinicians in resource- constrained settings, offering a scalable, affordable, and clinically meaningful complement to traditional therapy.
Experimental evaluation on 50 diverse audio samples demonstrates that the system achieves near-human ASR accuracy (WER 4.2% on clean audio), precise grammar detection (92.6% precision, 85.1% recall), word-level pronunciation classification with F1-scores of 0.7640.954 across speaker profiles, and sub-1.72-second end-to-end response latency across all tested conditions. The systems modular microservice architecture enables independent component upgrades, while its open-source, self-hostable design removes both economic and privacy barriers to deployment in clinical and educational institutions.
Comparative evaluation against three leading commercial tools (ELSA Speak, Pronunciation Coach, Speechling) confirms that the proposed system is the only solution to simultaneously offer real-time response, four-category word-level error classification, integrated grammar checking, a quantitative composite performance score, unrestricted audi format support, and a documented REST API establishing a new performance baseline for AI-assisted speech therapy platforms.
- Key Achievements
- Unified ASR + pronunciation comparison + grammar detection + performance scoring pipeline delivering results in under
1.72 seconds.
- Novel four-category pronunciation error taxonomy (correct / missing / extra / mispronounced) derived from SequenceMatcher opcodes, with clinical interpretations for each category.
- Grammar detection achieving 92.6% precision and 85.1% recall across seven grammatical error categories, with accent- aware en-IN LanguageTool configuration.
- Weighted dual-dimension scoring formula calibrated against expert SLP ratings via ridge regression (R² = 0.82 with SLP holistic scores).
- Format-agnostic audio preprocessing pipeline supporting 12+ audio formats, eliminating compatibility barriers for mobile device users.
- Privacy-by-design architecture with minimal audio retention, configurable audit logging, and DPDP/HIPAA-compatible deployment configurations.
- Comprehensive REST API with OpenAPI 3.0 documentation enabling third-party clinical and educational system integration.
- Unified ASR + pronunciation comparison + grammar detection + performance scoring pipeline delivering results in under
- Future Enhancement Roadmap
- Near-Term Enhancements (612 months)
- Offline ASR with OpenAI Whisper: Replace the Google Web Speech API with Whisper Large-v3 (WER 2.7% on LibriSpeech clean) to eliminate external API dependency, reduce latency to under 800 ms on GPU hardware, and enable fully offline deployment in bandwidth-limited clinical settings with enhanced HIPAA/DPDP compliance.
- Phoneme-Level Pronunciation Analysis: Integrate a G2P (grapheme-to-phoneme) converter (e.g., CMU Pronouncing Dictionary or g2p-en library) with phoneme-level ASR alignment to provide articulation-specific feedback identifying which phonemes within a mispronounced word require correction, targeting conditions such as dysarthria, apraxia, and functional articulation disorders.
- Fluency Metrics: Add pause frequency analysis, speaking rate estimation, and repair detection (false starts, repetitions, revisions) to the performance scoring module, enabling assessment of fluency alongside pronunciation and grammar.
- Medium-Term Enhancements (1224 months)
- Multi-Language Support: Extend the system to regional Indian languages (Tamil, Telugu, Hindi, Kannada, Bengali) using language-specific ASR models and LanguageTool grammar engines, making the system accessible to the estimated 800 million speakers of these languages.
- Adaptive Exercise Engine: Implement a reinforcement learning (RL) based exercise selection algorithm that dynamically chooses practice sentences targeting the users historically weakest phoneme clusters and most frequent grammatical error types, personalizing the therapy programme for each individual through session-over-session performance data.
- Mobile Application: Develop a cross-platform mobile application (React Native or Flutter) integrating real-time microphone recording, push notifications for practice reminders, and local score caching for offline progress review.
- Neural Grammar Enhancement: Supplement LanguageTools rule-based detection with a GECToR-based [Omelianchuk et al., 2020] neural error correction layer for complex contextual errors (preposition selection, article usage in context) where rule-based methods show lower recall.
- Long-Term Clinical Translation (2448 months)
- EHR Integration via HL7 FHIR: Integrate with Electronic Health Record systems using the HL7 FHIR R4 standard to enable seamless transfer of session scores, error patterns, and progress trends to clinical records, supporting evidence-based therapy planning and insurance reimbursement documentation.
- Randomized Controlled Trial: Conduct a prospective RCT comparing therapy outcomes for patients using the proposed system as a supplementary practice tool against a control group using unguided self-practice, following the protocol of Grogan-Johnson et al. [2010] to generate evidence for clinical validation and potential CE/FDA regulatory pathways for the system as a Class I medical device software.
- Explainable AI Visualizations: Implement attention-weight visualizations over acoustic spectrograms highlighting which temporal-frequency regions contributed to each pronunciation error classification, providing both clinical interpretability for SLPs and educational insight for patients.
- Speaker Diarization and Group Therapy: Add a speaker diarization module to handle multi-speaker recordings, supporting group therapy sessions, classroom-based pronunciation assessment for language teachers, and dialogue-based practice scenarios requiring naturalistic conversational interaction.
- Near-Term Enhancements (612 months)
- Conclusion
REFERENCES
- Eskenazi, M. (1999). Using automatic speech recognition in foreign language teaching. CALICO Journal, 16(2), 4568.
- Witt, S. M., & Young, S. J. (2000). Phone-level pronunciation scoring and assessment for interactive language learning. Speech Communication, 30(23), 95 108.
- Graves, A., Mohamed, A., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. Proceedings of ICASSP 2013, IEEE, pp. 66456649.
- Ng, H. T., Wu, S. M., Briscoe, T., Hadiwinoto, C., Susanto, R. H., & Bryant, C. (2014). The CoNLL-2014 shared task on grammatical error correction. Proceedings of CoNLL 2014, pp. 114.
- Grogan-Johnson, S., Alvares, R., Rowan, L., & Creaghead, N. (2010). A pilot study comparing the effectiveness of speech language therapy provided by telemedicine with conventional on-site therapy. Journal of Telemedicine and Telecare, 16(3), 134139.
- Xi, X., Higgins, D., Zechner, K., & Williamson, D. M. (2008). Automated scoring of spontaneous speech using SpeechRater. Proceedings of Interspeech 2008,
pp. 20822085.
- Ratcliff, J., & Metzener, D. (1988). Pattern matching: the Gestalt approach. Dr. Dobbs Journal, 13(7), 4651.
- Vandewaetere, M., Desmet, P., & Clarebout, G. (2011). The contribution of learner characteristics in the development of computer-based adaptive learning environments. Computers in Human Behavior, 27(1), 118130.
- Newman, S. (2015). Building Microservices: Designing Fine-Grained Systems. OReilly Media, Sebastopol, CA.
- Jurafsky, D., & Martin, J. H. (2020). Speech and Language Processing (3rd ed. draft). Stanford University Press.
- Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2022). Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356.
- Neri, A., Cucchiarini, C., & Strik, H. (2002). Feedback in computer assisted pronunciation training: When technology meets pedagogy. Proceedings of CALL Conference, pp. 179188.
- Franco, H., Abrash, V., Precoda, K., Bratt, H., Rao, R., Butzberger, J., & Cesari, R. (2010). The SRI EduSpeakTM system: Recognition and pronunciation scoring for language leaning. Proceedings of InSTIL/ICALL Symposium, pp. 123128.
- Omelianchuk, K., Atrasevych, V., Chernodub, A., & Skurzhanskyi, O. (2020). GECToR Grammatical Error Correction: Tag, Not Rewrite. Proceedings of the 15th Workshop on Innovative Use of NLP for Building Educational Applications, pp. 163170.
- Derwing, T. M., & Munro, M. J. (2015). Pronunciation Fundamentals: Evidence-Based Perspectives for L2 Teaching and Research. John Benjamins Publishing.
- Wales, D., Skinner, L., & Hayman, M. (2017). The efficacy of telehealth-delivered speech and language intervention for primary school-age children: A systematic review. International Journal of Telerehabilitation, 9(1), 5570.
- Molini-Avejonas, D. R., Rondon-Melo, S., de La Higuera Amato, C. A., & Samelli, A. G. (2015). A systematic review of the use of telehealth in speech, language and hearing sciences. Journal of Telemedicine and Telecare, 21(7), 367376.
- VanLehn, K. (2011). The relative effectiveness of human tutoring, intelligent tutoring systems, and other tutoring systems. Educational Psychologist, 46(4), 197221.
- Tetreault, J., & Chodorow, M. (2008). The ups and downs of preposition error detection in ESL writing. Proceedings of COLING 2008, pp. 865872.
- Gulati, A., Qin, J., Chiu, C. C., Parmar, N., Zhang, Y., Yu, J., … & Pang, R. (2020). Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100.
© 2026 Dr. J. Nafeesa Begum, Brightlin Blessy S, Harini M & Kaviya P. Published under Creative Commons Attribution 4.0 International (CC BY 4.0) License.
Manuscript received: March 2026 | Revised: March 2026 | Accepted: March 2026
