Adaptive AI Speech Therapy and Real-Time Correction Engine

doi:10.5281/zenodo.19369793

Volume 15, Issue 03 (March 2026)

Adaptive AI Speech Therapy and Real-Time Correction Engine

DOI : 10.5281/zenodo.19369793

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 98
Authors : Dr. J. Nafeesa Begum, Brightlin Blessy S, Harini M, Kaviya P
Paper ID : IJERTV15IS031285
Volume & Issue : Volume 15, Issue 03 , March – 2026
Published (First Online): 01-04-2026
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Adaptive AI Speech Therapy and Real-Time Correction Engine

A Unified NLP-Driven Architecture for Pronunciation Assessment, Grammar Correction, and Quantitative Performance Scoring

Dr. J. Nafeesa Begum, Brightlin Blessy S, Harini M, Kaviya P

Department of Computer Science and Engineering,

Government College of Engineering, Bargur 635104, Krishnagiri, Tamil Nadu, India Anna University, Chennai 600 025

Abstract – Speech disorders affect over 1.5 billion people globally, yet access to timely, affordable, and expert speech therapy remains severely limited, particularly in developing nations. This paper presents the Adaptive AI Speech Therapy and Real-Time Correction Engine a fully automated, web-accessible intelligent platform that bridges the gap between traditional clinical speech therapy and scalable, technology-mediated language rehabilitation. The system integrates four synergistic modules: (1) Automatic Speech Recognition (ASR) via the Google Web Speech API achieving a Word Error Rate (WER) of 4.2% under clean conditions; (2) Word-level Pronunciation Comparison using Pythons difflib SequenceMatcher, classifying each spoken word into one of four diagnostic categories correct, missing, extra, or mispronounced; (3) Grammar Error Detection via LanguageTool NLP, attaining 92.6% precision and 85.1% recall across five grammatical error categories; and (4) a Weighted Performance Scoring algorithm combining pronunciation accuracy (70%) and grammar correctness (30%) into a clinically interpretable 0100 composite score with five named performance bands. Implemented on a high-throughput asynchronous FastAPI backend with Pydub-driven format-agnostic audio preprocessing (supporting MP3, WAV, M4A, OGG, FLAC, WebM), the system delivers end-to-end feedback within 1.241.72 seconds. Experimental evaluation across 50 diverse audio samples spanning five speaker profiles, five audio quality conditions, and five sentence complexity levels demonstrates pronunciation accuracy of 95.8% on clean audio, F1-scores ranging from 0.764 to 0.954 across speaker profiles, and grammar detection performance surpassing all benchmarked commercial alternatives. The systems open-source architecture, self-hostable deployment model, and documented REST API collectively position it as a clinically viable, equitable, and extensible supplement to traditional speech therapy.

Keywords: Automatic Speech Recognition (ASR); Natural Language Processing (NLP); Speech Therapy Automation; Pronunciation Error Classification; Grammar Correction; FastAPI Microservices; LanguageTool; difflib SequenceMatcher; Real-Time Feedback; AI-Assisted Language Learning; Web-Based Rehabilitation; Word Error Rate (WER)

INTRODUCTION

Spoken language is the primary medium through which humans build relationships, access education, and participate in professional life. Yet for millions of people worldwide, speech disorders represent a profound barrier to communication. According to the World Health Organization [WHO, 2023], over 1.5 billion people live with some degree of communication impairment, with speech and language disorders accounting for a significant and growing proportion of this burden. In low- and middle-income countries, the gap between the demand for speech-language therapy and the availability of qualified Speech-Language Pathologists (SLPs) is particularly acute: the WHO estimates a global shortage of over 2.5 million allied health professionals, of whom SLPs form a critical but underrepresented subset.

Traditional speech therapy is conducted through scheduled, in-person sessions in clinical settings. An SLP assesses the patient using standardized instruments such as the Goldman-Fristoe Test of Articulation, the Praxis of Speech Acts, the Stuttering Severity Instrument (SSI-4), and informal conversational analysis. While clinically rigorous, this model carries three systemic constraints that severely limit its scalability: (1) geographic inaccessibility, as qualified therapists are disproportionately concentrated in urban tertiary-care centers; (2) economic barriers, with per-session costs ranging from 1,500 to 5,000 in India and US$100300 in developed economies; and (3) temporal discontinuity, since patients practice independently between sessions without objective guidance, frequently reinforcing errors rather than correcting them.

The convergence of Artificial Intelligence (AI), Natural Language Processing (NLP), and cloud-based Automatic Speech Recognition (ASR) offers a transformative pathway to address these constraints. Modern ASR systems including Google Web Speech API, OpenAI Whisper, and Mozilla DeepSpeech have achieved near-human transcription accuracy on English speech

benchmarks [Radford et al., 2022]. When combined with NLP-driven error analysis and intelligent scoring, these technologies can replicate many of the analytical capabilities of a trained SLP within an automated, scalable, and accessible software framework.

However, the mere availability of these technologies does not automatically produce a therapeutically effective system. A clinically meaningful speech correction platform must satisfy four requirements that current commercial tools collectively fail to meet: (1) real-time, word-level identification of phonological errors with specific category labels; (2) simultaneous assessment of grammatical correctness alongside pronunciation accuracy; (3) a quantitative, reproducible performance metric enabling longitudinal progress tracking; and (4) open, extensible architecture deployable without expensive licensing in resource-constrained clinical settings.

This paper makes the following original contributions to the field of AI-assisted speech therapy:

A unified, modular NLP pipeline integrating ASR transcription, word-level pronunciation comparison using difflib SequenceMatcher, rule-based grammar error detection via LanguageTool, and a dual-dimension performance scoring algorithm into a single, cohesive web service.
A four-category pronunciation error taxonomy (correct, missing, extra, mispronounced) derived from sequence alignment opcodes, providing clinically actionable diagnostic specificity not present in any benchmarked commercial tool.
Empirical evaluation of the system across 50 audio samples spanning five audio quality conditions, five speaker profiles, and five sentence complexity levels, establishing performance baselines for future research.
A fully documented, open-source FastAPI REST API enabling integration with clinical management systems, mobile applications, and educational platforms.
Demonstrated superiority over ELSA Speak, Pronunciation Coach, and Speechling across eight comparative dimensions, with particular differentiation in grammar detection capability (92.6% precision vs. N/A for all competitors).
Expose all analytical capabilities via a FastAPI-powered RESTful backend with asynchronous request handling, supporting concurrent multi-user access.
Deliver analysis results through a user-friendly web interface that renders annotated transcripts, error highlights, correction suggestions, and performance scores in real time.

Table 1. Feature Comparison of Speech Correction Approaches

Pronunciation Coach

Feature	Manual SLP Therapy	ELSA Speak	Speechling	Proposed System
Real-Time Feedback		Yes	No	No (async)	Yes (<1.72s)
Word-Level Error Labels	Yes	Partial	No	No	Yes 4 classes
Grammar Analysis	Yes	No	No	No	Yes (LanguageTool)
Performance Score	Yes	Yes	Partial	No	0100, 5 bands
Format-Agnostic Audio	Yes		WAV only	MP3/WAV	Any (Pydub)
Offline Operation	Yes	No	Partial	No	Planned (Whisper)
Open-Source	N/A	No	No	No	Yes
REST API	N/A	Paid	No	No	Yes (FastAPI)
Pronunciation Accuracy	SLP Eval	~88%	~82%	~79%	~95.8% (clean)
Grammar Precision	High	N/A	N/A	N/A	92.6%
Cost per Session	High	Subscription	Medium	Subscription	Low / Free

BACKGROUND AND MOTIVATION
1. The Global Speech Therapy Access Crisis
  The global shortage of speech-language pathologists is not merely a logistical inconvenience but a public health crisis with measurable consequences. In India, the patient-to-SLP ratio is approximately 1:2,00,000 compared to the WHO-recommended ratio of 1:10,000 leaving the vast majority of the estimated 50 million Indians with speech and language disorders without access to professional care [ASHA India, 2022]. In sub-Saharan Africa, the ratio is even more extreme at approximately 1:10,00,000 [Mulwafu et al., 2016]. Even in the United States, the American Speech-Language-Hearing Association (ASHA) projects a shortage of 25,000 SLPs by 2030 as the aging population drives increasing demand for communication rehabilitation services.
  
  The economic burden is equally significant. A standard course of speech therapy in India typically requires 2040 sessions at a cost of 1,5005,000 per session, placing the total cost (30,0002,00,000) far beyond the reach of most Indian families. In the United States, without insurance coverage, similar courses cost US$3,00010,000. The result is that speech disorders disproportionately affect outcomes for economically disadvantaged populations, creating a self-reinforcing cycle of limited communication skills and reduced economic mobility.
2. Limitations of Current AI-Based Speech Tools
  Commercial AI speech tools have made partial inroads into this access gap, but each suffers from architectural limitations that constrain their clinical utility. ELSA Speak, perhaps the most prominent AI pronunciation coach, provides real-time phoneme- level feedback using a proprietary deep learning model, but restricts its assessment to pronunciation only, leaving the equally important dimension of grammatical correctness entirely unaddressed. Its subscription pricing model (1,5002,500 per month) places it beyond the reach of exactly the populations that would benefit most from scalable speech therapy. Speechling uses a human-review model where language coaches provide asynchronous feedback, which while qualitatively rich is fundamentally incompatible with the real-time correction requirements of effective speech practice. Pronunciation Coach, while offering offline capability, accepts only WAV format input, creating compatibility barriers for the majority of users who record audio on mobile devices producing M4A, AAC, or OGG files.
  
  More fundamentally, no existing commercial tool simultaneously addresses the two core dimensions of spoken language competence: phonological accuracy (pronunciation) and syntactic/morphological correctness (grammar). Speech therapy patients
  
  particularly those recovering from neurological events such as stroke or traumatic brain injury require assessment of both dimensions, as their deficits often span phonological production and syntactic generation simultaneously [Goodglass & Kaplan, 1983]. The proposed system is the first to integrate both dimensions in a unified, automated, real-time assessment pipeline.
3. Why This System, Why Now
  Three technological developments converging in the mid-2020s make the proposed system feasible in a way that was not achievable even five years ago. First, cloud-based ASR APIs have crossed the near-human accuracy threshold: the Google Web Speech API now achieves WERs below 5% on clean English speech [Google AI, 2023], a level of accuracy sufficient to support reliable downstream linguistic analysis. Second, rule-based NLP engines such as LanguageTool have matured to provide 5,000+ linguistic rules for English with precision exceeding 90% on common error types, making them viable as production-grade grammar checkers. Third, the emergence of high-performance asynchronous Python web frameworks specifically FastAPI, which benchmarks consistently show to be comparable in throughput to Node.js enables the construction of responsive, concurrent- capable NLP services without specialized infrastructure.

LITERATURE REVIEW

This review synthesizes research across five domains directly relevant to the proposed system: ASR-based language learning, pronunciation error detection, grammar correction, web-based and telemedicine therapy delivery, and performance scoring systems. The review is structured to demonstrate how each body of prior work motivates specific design decisions in the proposed architecture.

Automatic Speech Recognition in Language Learning
The integration of ASR into Computer-Assisted Language Learning (CALL) systems was first systematically explored by Eskenazi [1999], who demonstrated that ASR-enabled real-time pronunciation feedback produced measurable improvement in non-

native speakers accuracy compared to unaided practice. Subsequent work by Neri et al. [2002] and Cucchiarini et al. [2009] corroborated these findings, with Cucchiarini et al.s large-scale study of Dutch learners showing that CALL systems with ASR feedback produced 20% greater pronunciation gains per hour of practice than equivalent human-tutored sessions. The key limitation of early systems was binary pass/fail feedback without error-type specificity, a gap the proposed system addresses through its four- category classification taxonomy.

More recent work has explored the use of neural ASR systems in therapeutic contexts. Shahamiri and Salim [2014] demonstrated that Microsoft SAPI-based ASR could achieve 89.3% word recognition accuracy for individuals with speech sound disorders when combined wth acoustic model adaptation. The emergence of transformer-based ASR systems particularly OpenAIs Whisper [Radford et al., 2022], which achieves WER of 2.7% on the LibriSpeech clean benchmark opens substantial opportunities for future enhancement of the proposed systems ASR layer, as identified in the future work roadmap.
Pronunciation Error Detection and Classification
Witt and Young [2000] developed the seminal phone-level pronunciation scoring system using Hidden Markov Model (HMM) Goodness of Pronunciation (GOP) scores, achieving 78% agreement with human raters on segmental error labeling within the Spoken Language Educator (SLE) system. The GOP score measures the likelihood of a phoneme given the acoustic evidence relative to the likelihood of any phoneme, providing a continuous pronunciation quality measure. While phoneme-level analysis provides maximum diagnostic granularity, it requires language-specific acoustic models trained on large native-speaker corpora and is computationally prohibitive for real-time web deployment.

Franco et al. [2010] developed the EduSpeak SDK for pronunciation training, combining GOP scores with prosodic features (pitch, rhythm, speaking rate) to produce multi-dimensional pronunciation assessments. Their system achieved 82% correlation with expert SLP ratings on a balanced corpus of learner speech. Neri et al. [2008] proposed a segment-level pronunciation scoring approach using Support Vector Machine (SVM) classifiers trained on acoustic features, demonstrating that machine learning-based approaches could outperform GOP-based methods on non-standard accents. The proposed system adopts a word- level (rather than phoneme-level) comparison approach using difflib SequenceMatcher, trading phonetic granularity for computational efficiency, real-time responsiveness, and accent-robustness.
Deep Learning Advances in ASR
Graves, Mohamed, and Hinton [2013] introduced the landmark deep bidirectional LSTM architecture with Connectionist Temporal Classification (CTC) loss for end-to-end ASR, eliminating the need for forced phoneme alignments and outperforming the best HMM-based systems on the TIMIT benchmark by over 17% relative WER reduction. This architecture became the conceptual blueprint for the modern ASR systems that underpin commercial APIs including Google Web Speech, Amazon Transcribe, and Microsoft Azure Speech. Chan et al. [2016] subsequently introduced the Listen, Attend and Spell (LAS) architecture, an attention-based encoder-decoder model that further improved long-form transcription quality. The Conformer architecture [Gulati et al., 2020] combined convolutional and transformer modules to achieve state-of-the-art performance on LibriSpeech, establishing the technical foundation for the near-human ASR accuracy available through cloud APIs today.
Natural Language Processing for Grammar Correction
The CoNLL-2014 shared task on Grammatical Error Correction [Ng et al., 2014] benchmarked 13 systems on a learner English corpus, establishing that statistical phrase-based machine translation (SMT) approaches outperformed purely rule-based methods on complex contextual errors (preposition selection, article choice), while rule-based methods retained superiority for deterministic errors (subject-verb agreement, spelling). This complementarity directly informs the design rationale for using LanguageTool in the proposed system: for the spoken-language context of the system where transcription noise and sentence simplicity characterize the input deterministic rule-based detection is both more precise and more interpretable than neural alternatives.

Subsequent neural GEC systems including seq2seq [Sutskever et al., 2014] and transformer-based approaches [Zhao et al., 2019] achieved superior MaxMatch (M²) F0.5 scores on standard benchmarks. However, these models require substantial computational resources for inference, making real-time deployment challenging without GPU acceleration. The GECToR model [Omelianchuk et al., 2020] proposed a sequence tagging approach enabling fast CPU-based inference, suggesting a viable pathway for upgrading the grammar detection module in future system iterations.
Telemedicine and Web-Based Speech Therapy
Grogan-Johnson et al. [2010] conducted the first randomized controlled pilot study comparing videoconference-delivered speech-language therapy against in-person therapy for 10 school-age children with speech sound disorders, finding no statistically significant difference in articulation improvement (p > 0.05) between delivery modes. Subsequent systematic reviews [Mashima & Doarn, 2008; Molini-Avejonas et al., 2015] confirmed these findings across a broader range of populations and disorder types, establishing telemedicine delivery as a clinically equivalent modality. The COVID-19 pandemic dramatically accelerated the adoption of telehealth in speech-language pathology, with ASHA reporting that telepractice delivery increased from 8% to over 60% of SLP service delivery within three months of the pandemic onset [ASHA, 2020].

Wales et al. [2017] conducted a systematic review of 24 digital health interventions for communication disorders, identifying self-management support, real-time biofeedback, and structured progress monitoring as the three features most consistently associated with positive therapy outcomes in digital platforms. The proposed system incorporates all three features: structured practice sessions, real-time error categorization and correction, and quantitative session-by-session score tracking.
Automated Performance Scoring
Xi et al. [2008] developed SpeechRater ETSs automated spoken English scoring system which extracts features across five dimensions (pronunciation, fluency, vocabulary richness, grammar complexity, content relevance) to predict holistic proficiency scores. SpeechRater achieved correlations of r = 0.730.81 with human rater scores across different task types, demonstrating that automated multi-dimensional scoring can approach human-level reliability for proficiency assessment. The systems multi-dimensional scoring philosophy directly shaped the proposed systems weighted scoring algorithm, though the proposed systems simpler two-factor model (pronunciation and grammar) is better calibrated to the therapeutic rather than proficiency-assessment context.

Zechner et al. [2019] subsequently showed that adding fluency features (pause frequency, speaking rate, repair frequency) to pronunciation and grammar features substantially improved holistic score prediction, achieving r = 0.84 with expert raters. This finding motivates the inclusion of fluency metrics as a future enhancement in the proposed systems scoring module.
String Matching and Edit-Distance Algorithms
Ratcliff and Metzener [1988] described the Gestalt Pattern Matching algorithm, implemented in Pythons difflib SequenceMatcher, which finds the longest common substring and recursively applies the process to unmatched fragments, computing a similarity ratio in O(n²) time where n is string length. The algorithm produces edit operation codes (opcodes) equal, insert, delete, replace that map directly to the proposed systems four pronunciation error categories. Unlike Levenshtein edit distance, which operates at character level, SequenceMatcher operates natively at word-token granularity, making it directly applicable to word-level pronunciation analysis without tokenization overhead.

Adaptive Intelligent Tutoring Systems

Vandewaetere, Desmet, and Clarebout [2011] reviewed 32 empirical studies of Adaptive Intelligent Tutoring Systems (AITS), finding that systems adapting exercise difficulty and feedback type to individual performance data showed mean learning gains 23% higher than non-adaptive counterparts across ll measured disciplines. VanLehn [2011] conducted a meta-analysis of 76 tutoring experiments, showing that step-level adaptivity (feedback at each practice step) consistently outperformed summary-level adaptivity (end-of-session feedback) by 0.76 standard deviations. These findings directly motivate the proposed systems word- level immediate feedback design, which provides step-level corrective guidance at each spoken word rather than deferring to end- of-session summaries.

Table 2. Literature Survey Summary Contributions and Limitations

Ref.

Authors

Year

Technique

Key Contribution

Limitation Addressed

[1]

Eskenazi

1999

ASR-CALL

First ASR-based pronunciation feedback

Binary feedback only; no word classification

[2]

Witt & Young

2000

HMM GOP

Scoring

Phone-level pronunciation scoring

Phoneme-only; no grammar; slow deployment

[3]	Graves et al.	2013	CTC-LSTM ASR	End-to-end ASR, no forced alignment	No therapy output; no NLP integration
[4]	Ng et al.	2014	Statistical GEC	Standardised grammar error evaluation	Not real-time; complex model inference
[5]	Grogan- Johnson	2010	Telemedicine	Validates remote therapy equivalence	Still requires human SLP clinician
[6]	Xi et al.	2008	SpeechRater	Multi-dimensional spoken scoring	Commercial; proprietary; not open- source
[7]	Ratcliff & Metzener	1988	Gestalt Matching	Word-level similarity algorithm	Algorithm only; no therapy integration
[8]	Vandewaetere et al.	2011	AITS Review	Step-level adaptivity +23% learning gain	No speech analysis component
[9]	Newman	2015	Microservices	Scalable AI service decomposition	Architecture guide; no speech system
[10]	Jurafsky & Martin	2020	Audio Processing	16kHz standardisation pipeline	Textbook; no working implementation

SYSTEM ANALYSIS
1. Existing System Analysis
  1. Traditional Clinician-Led Therapy
    Conventional speech therapy through scheduled SLP sessions employs standardized assessments including the Goldman- Fristoe Test of Articulation-3 (GFTA-3), the Praxis of Speech Acts (PROMPT), the Comprehensive Aphasia Test (CAT), and informal conversational analysis using the LARSP (Language Assessment Remediation and Screening Procedure) grammatical profile. While clinically definitive, this model faces three systemic constraints: (1) the appointment-based model creates temporal discontinuity, with patients practicing independently for days between 1-hour sessions; (2) per-session costs of 1,5005,000 in India and US$100300 internationally create economic exclusion; and (3) geographic concentration of SLPs in urban tertiary-care centers creates physical inaccessibility for rural populations.
  2. Commercial Software Tools
    Contemporary AI speech tools can be categorized into three architectural paradigms, each with distinct strengths and limitations. Phoneme-scoring tools (ELSA Speak, Rosetta Stone, Pimsleur) employ deep acoustic models to evaluate phoneme- level production, achieving subjectively impressive feedback but ignoring the syntactic dimension of spoken language entirely. Transcription-plus-comparison tools (Google Pronunciation) provide broad-accuracy ASR but offer no error categorization or therapeutic guidance. Human-review tools (Speechling, italki) pair learners with human coaches, providing qualitatively rich but inherently asynchronous feedback that is incompatible with the immediate-correction requirements of effective practice sessions. No existing tool integrates all four therapeutic requirements: real-time response, word-level error classification, grammar analysis, and quantitative progress scoring.
2. Proposed System Design Philosophy
  The proposed Adaptive AI Speech Therapy and Real-Time Correction Engine is designed around five foundational principles that collectively distinguish it from all benchmarked alternatives:
  - Real-time analysis with sub-5-second end-to-end latency, ensuring that feedback arrives before the users short-term phonological memory of the utterance fades the 2-second threshold identified by Derwing et al. [2008] as critical for effective speech feedback.
  - Multi-dimensional assessment covering both phonological accuracy (pronunciation) and syntactic/morphological correctness (grammar) the two pillars of spoken language competence as defined in the CEFR (Common European Framework of Reference for Languages).
  - Word-level error specificity providing four-category diagnostic classification (correct, missing, extra, mispronounced) rather than binary pass/fail, enabling users to direct practice effort to specific error types and positions.
  - Quantitative, reproducible scoring on a 0100 scale with five named performance bands, enabling objective session-to- session progress tracking analogous to physiotherapy outcome measurement tools such as the Functional Independence Measure (FIM).
  - Open, extensible architecture using an open-source Python/FastAPI/LanguageTool stack that can be self-hosted by institutions, integrated with existing clinical systems via REST API, and extended without proprietary licensing constraints.

SYSTEM ARCHITECTURE AND DESIGN

Three-Tier Architectural Overview

The system follows a classic three-tier web application architecture with an internal composition analogous to a microservice pipeline, where each analytical concern is handled by a dedicated, independently testable and upgradeable module. The three tiers Presentation, Application, and External Services communicate exclusively through well-defined contracts: HTTP/JSON between Presentation and Application tiers, and HTTPS API calls between the Application tier and external services. This strict separation of concerns ensures that any individual tier can be replaced (e.g., swapping Google Web Speech API for OpenAI Whisper) without structural disruption to the overall system.

Table 3. Three-Tier Architecture Components and Communication Contracts</p

Tier	Layer	Key Components	Interface Protocol	Scalability Strategy
Tier 1	Presentation	Browser HTML/CSS/JS, Audio Upload Form, Target Sentence Input, Results Display, Score Visualisation	HTTP POST multipart/form-data; JSON Response	CDN-hosted static assets; client-side rendering
Tier 2	Application	FastAPI Server, Pydub Preprocessor, SequenceMatcher Analyzer, LanguageTool Grammar Engine, Weighted Scorer	Python async function calls; Pydantic models	Uvicorn ASGI; horizontal scaling via Docker/K8s
Tier 3	External Services	Google Web Speech API (Cloud ASR); LanguageTool Java NLP Server (Local)	HTTPS REST; XML/JSON	Cloud API auto-scales; local JVM thread pooling

Module Architecture
Within the Application Layer (Tier 2), the system is decomposed into five functionally independent modules communicating through well-typed Python data structures: the Audio Preprocessing Module, the ASR Transcription Module, the Pronunciation Comparison Module, the Grammar Detection Module, and the Performance Scoring Module. This decomposition follows the Single Responsibility Principle: each module has exactly one analytical responsibility and exposes a clean interface allowing independent unit testing, replacement, and scaling.

Data Flow Architecture

The data flow follows a sequential-then-parallel pattern designed to minimize end-to-end latency. Audio preprocessing (Module 1) and ASR transcription (Module 2) execute sequentially since transcription requires preprocessed audio. Once the transcript is available, pronunciation comparison (Module 3) and grammar detection (Module 4) execute in parallel using Pythons asyncio concurrency model, since they are independent operations on the same input text. Performance scoring (Module 5) executes after both parallel modules complete, combining their outputs. This parallel execution reduces end-to-end latency by approximately 3540% compared to a fully sequential pipeline, as confirmed by profiling experiments.

Table 4. Data Flow Level-1 Process Decomposition

Process	Process Name	Input	Output	Data Store	Parallelism
P1	Validate & Preprocess Audio	Raw audio bytes, filename	16kHz mono WAV file	DS1: Temp WAV	Sequential
P2	Transcribe Speech (ASR)	WAV file from DS1	Transcript text string	DS2: Session	Sequential after P1
P3	Compare Pronunciation	Transcript + target sentence	Word classification list + accuracy %	DS2: Session	Parallel with P4
P4	Detect Grammar Errors	Transcript text	Grammar error list + grammar score	DS2: Session	Parallel with P3
P5	Calculate Performance Score	Word list + error list from DS2	Score (0100), band, recommendation	DS2: Session	Sequential after P3+P4
P6	Assemble Response	All DS2 fields	Structured JSON response object	DS3: Audit Log	Sequential after P5
P7	Render Results	JSON response	Annotated UI display	(none)	Client-side

Database and Session Management
The system employs a lightweight, stateless session model appropriate for its therapeutic use case. Each API call creates a transient session object (DS2) in memory containing all intermediate outputs, which is serialized to a JSON response and optionally persisted to an audit log (DS3) for longitudinal tracking. Temporary WAV files (DS1) are created using Pythons tempfile module with automatic cleanup on request completion, ensuring no sensitive audio data persists on the server beyond the transaction lifetime. For clinical deployments requiring HIPAA/DPDP (Indias Digital Personal Data Protection Act, 2023) compliance, the system supports encrypted at-rest storage of session records and audit logs using AES-256 encryption of the DS3 datastore.

DETAILED MODULE IMPLEMENTATION

Audio Preprocessing Module
The Audio Preprocessing Module is the entry point of the analysis pipeline, responsible for transforming all incoming audio files into a standardized form suitable for ASR processing. The Google Web Speech API requires audio in 16-bit PCM WAV format at 16,000 Hz sample rate, single channel (mono). User-submitted audio may arrive in any of 12 common formats (MP3, M4A, OGG, FLAC, AAC, WebM, AMR, WMA, OPUS, AIFF, AU, WAV) with arbitrary sample rates (8,00048,000 Hz), bit depths (8, 16, or 24 bit), and channel configurations (mono, stereo, or multichannel). The module uses Pydubs AudioSegment class to perform three sequential transformations reliably across all input configurations:
1. Format detection and loading using FFmpeg codec bindings. Pydub automatically identifies the audio codec from the file header (magic bytes) rather than relying on file extension, ensuring correct decoding even for misnamed files.
2. Channel downmixing to mono. Stereo audio is downmixed using equal-power averaging (L+R)/2, which preserves speech intelligibility while reducing file size by 50% and eliminating channel-dependent ASR accuracy variations.
3. Sample rate resampling to 16,000 Hz. Pydub uses FFmpegs high-quality SoX resampler with anti-aliasing filtering to prevent spectral aliasing artifacts that could degrade ASR accuracy.
Additionally, the module applies amplitude normalization to a target dBFS of 14 dBFS using peak normalization, ensuring

consistent signal levels across recordings from different microphones, recording environments, and device types. A voice activity

detection (VAD) heuristic rejects audio samples shorter than 500 ms or with signal-to-noise ratio (SNR) below 6 dB, returning a descriptive error message rather than submitting unprocessable audio to the ASR API, thus avoiding unnecessary API call costs.
ASR Transcription Module
The ASR Transcription Module converts the preprocessed WAV audio into text using the SpeechRecognition librarys recognize_google() method, which submits audio data to the Google Web Speech API over HTTPS and returns the n-best transcription hypothesis. In the current implementation, only the top hypothesis is used; future work will explore hypothesis combination to improve robustness n noisy audio.

The module implements a robust error handling strategy with exponential backoff retry logic (delays: 1s, 2s, 4s; maximum 3 attempts) for transient network failures, a pattern well-established in distributed systems engineering [Nygard, 2007]. Three distinct exception types are handled: speech_recognition.UnknownValueError (audio contains no intelligible speech), speech_recognition.RequestError (API service unavailable or network error), and a custom AudioQualityError raised when the preprocessing modules VAD check fails. Each exception returns a structured error response with a descriptive message and HTTP status code appropriate to the failure type (422 Unprocessable Entity for audio quality failures; 503 Service Unavailable for API failures).

The module also implements transcript post-processing: converting the returned text to lowercase, stripping leading/trailing whitespace, and expanding common ASR contractions (e.g., gonna going to, gonna going to) when the target sentence uses the expanded form, reducing spurious mispronunciation flags for colloquial speech patterns.

Pronunciation Comparison Module

The Pronunciation Comparison Module is the analytical core of the phonological assessment subsystem. The module normalizes both the ASR transcript and the users target sentence to lowercase word lists with punctuation stripped using Pythons re.sub(r[^\w\s], , text) pattern, then applies difflib.SequenceMatcher(None, transcript_words, target_words).get_opcodes() to obtain a sequence of edit operations with four opcode types that map directly to the four diagnostic categories:

Table 5. Pronunciation Error Classification Opcode-to-Category Mapping

Category	SequenceMatcher Opcode	Diagnostic Meaning	Clinical Interpretation	Score Impact
CORRECT	equal	Spoken word matches target exactly	Accurate phonological production	+ (counts toward accuracy)
MISSING	delete	Target word absent from spoken output	Word omission: possible apraxia, fluency disorder, or attention lapse	(denominator word not produced)
EXTRA	insert	Spoken word not present in target	Intrusion/addition: possible paraphasic error or filler word habit	(counted against fluency)
MISPRONOUNCED	replace	Different word spoken where target word expected	Phonological substitution: target-specific articulation error requiring focused practice	(counted against accuracy)

Pronunciation accuracy is computed as: accuracy = (count_equal / len(target_words)) × 100. This definition deliberately uses target word count as the denominator (rather than the union of target and spoken words) because therapeutic progress is measured by the proportion of the target successfully reproduced, not penalized additionally for extra words. The module returns a structured list of word result objects, each containing the word text, its classification category, its position in the target sentence, and (for MISPRONOUNCED words) both the spoken and expected word forms.

Grammar Detection Module

The Grammar Detection Module leverages LanguageTool an open-source, rule-based NLP grammar and style checker supporting 25+ languages with over 5,000 linguistic rules for English to identify syntactic and morphological errors in the ASR

transcript. The module instantiates a LanguageTool(en-IN) checker (en-IN: Indian English variant) to correctly handle Indian English lexical and syntactic patterns that differ from British or American English norms.

Each grammar error is returned as a Match object containing: ruleId (the violated LanguageTool rule identifier), message (human-readable error description), offset and errorLength (character span of the error), and replacements (ordered list of correction candidates). The module applies a context-sensitive filtering stage to remove false positives specific to ASR output: UPPERCASE_SENTENCE_START (ASR produces lowercase output), PUNCTUATION_PARAGRAPH_END (ASR omits sentence-final punctuation), and MORFOLOGIK_RULE_EN_* rules for proper nouns that ASR may not capitalize. This filtering stage reduces false positive grammar alerts by approximately 28%, as measured on the test corpus.

Table 6. Grammar Error Categories LanguageTool Rules and Detection Performance

Error Category	Example Error	LanguageTool Rule ID	Precision	Recall	Clinical Frequency
Subject-Verb Agreement	She go to school	AGREEMENT_SENT_START	91.7%	91.7%	Very High
Article Usage	I saw a elephant	EN_A_VS_AN	88.2%	83.3%	High
Tense Consistency	He go yesterday	ENGLISH_WORD_REPEAT	100%	78.6%	High
Word Repetition	I saw saw him	ENGLISH_WORD_REPEAT	100%	100%	Medium
Preposition Selection	Interested on AI	PREP_INTERESTED	77.8%	70.0%	Medium
Number Agreement	Five childs	NON3PRS_VERB	88.5%	82.4%	Medium
Spelling/Morphology	Tomarow	MORFOLOGIK_RULE_EN_GB	95.2%	89.7%	Low
Overall (weighted avg)	All categories		92.6%	85.1%	All

Performance Scoring Module

The Performance Scoring Module aggregates the outputs of the pronunciation comparison and grammar detection modules into a single composite performance score using a weighted linear combination formula validated against expert SLP ratings on the test corpus:

Final Score = (0.70 × Pronunciation_Accuracy) + (0.30 × Grammar_Score)

Grammar_Score = max(0, 100 (Errors_Per_100_Words × 10))

Pronunciation_Accuracy = (correct_word_count / total_target_words) × 100

The 70:30 weighting was determined through a calibration study in which five experienced SLPs rated 30 speech samples on a 0100 holistic quality scale. Ridge regression optimization of the weighting coefficient to minimize root-mean-square error (RMSE) between model scores and SLP ratings yielded an optimal weight of 0.69 for pronunciation and 0.31 for grammar (rounded to 0.70/0.30). This weighting reflects the established clinial consensus that phonological accuracy is the primary determinant of

spoken language intelligibility [Derwing & Munro, 2015], while grammatical correctness constitutes an important secondary dimension of communicative competence.

Table 7. Performance Score Bands Thresholds, Labels, and Therapeutic Recommendations

Score Range	Band Label	Colour	Clinical Interpretation	Therapeutic Recommendation
90100	EXCELLENT	Deep Green	Near-native accuracy; minimal errors	Advance to complex, multi-clause target sentences with low-frequency vocabulary
7589	GOOD	Light Green	Communicatively effective with minor lapses	Focus targeted practice on the 12 mispronounced words identified; maintain current complexity
6074	SATISFACTORY	Amber	Acceptable intelligibility; noticeable errors	Practise flagged words 5× minimum before re- attempting the full sentence
4059	NEEDS IMPROVEMENT	Orange	Impaired intelligibility; systematic errors	Return to isolated word-level practice for all MISSING and MISPRONOUNCED categories
039	POOR	Red	Severely impaired intelligibility	Regression to simpler 35 word target sentences; consider formal SLP referral

REST API Layer

The REST API Layer, implemented using FastAPI 0.104+ with Uvicorn 0.24+ as the ASGI server, orchestrates all modules and exposes the systems capabilities as documented HTTP endpoints. FastAPIs asynchronous request handling using Pythons asyncio event loop enables the server to process multiple concurrent analysis requests without thread-blocking, supporting classroom deployments with 2030 simultaneous users on a single server instance. The frameworks automatic OpenAPI 3.0 documentation generation (accessible at /docs) provides a self-describing, interactive API specification that facilitates integration by third-party developers and clinical IT teams.

Table 8. REST API Endpoint Definitions

Method	Endpoint	Request Parameters	Response Schema	HTTP Codes
POST	/analyze	audio_file (UploadFile, binary); target_sentence (str, Form)	JSON: {transcript, target, word_results[], grammar_errors[], score, band, recommendation, latency_ms}	200 OK; 422 Unprocessable; 503 Unavailable
GET	/health		JSON: {status: ‘ok’, version: ‘1.0.0’, uptime_s: float}	200 OK
POST	/feedback	session_id (str); rating (int, 15); comment (str, optional)	JSON: {message: ‘Feedback received’, session_id}	200 OK; 404 Not Found
GET	/history/{user_id}	user_id (path param, str); limit (query, int, default 20)	JSON: [{session_id, score, band, timestamp, target_sentence}]	200 OK; 404 Not Found

GET	/stats	period (query: ‘day’\|’week’\|’month’)	JSON: {avg_score, total_sessions, score_distribution, top_errors[]}	200 OK
GET	/docs		Auto-generated OpenAPI 3.0 interactive documentation UI	200 OK

Frontend Interface
The Presentation Layer is a single-page web application delivering two primary interaction modes: file-upload mode, which accepts pre-recorded audio files in any supported format, and live-recording mode, which uses the Web Audio APIs MediaRecorder interface to capture microphone input directly in the browser without requiring any additional software installation. The interface renders three result panels after analysis: an annotated word transcript with color-coded word classification (green for CORRECT, red for MISPRONOUNCED, amber for MISSING, blue for EXTRA); a grammar correction panel listing each identified error with its specific correction suggestion and the applicable linguistic rule; and a circular performance gauge displaying the composite score, band label, and personalized recommendation message. A session history chart (implemented using Chart.js) displays the users score trend across the last 10 sessions, providing visual evidence of progress that functions as a motivational reinforcement mechanism.

PERFORMANCE EVALUATION

Evaluation Methodology
The system was evaluated using a purpose-built test corpus of 50 audio samples designed to provide systematic coverage across three independent variables: audio quality (5 conditions from clean studio recording to heavy background noise), speaker profile (5 categories spanning native English through Indian English learners, slow and fast speech rates), and sentence complexity (5 levels from simple 5-word sentences to complex 20-word multi-clause sentences with subordination and passive voice). Each sample was processed through the complete pipeline, and results were compared against ground-truth transcriptions and expert SLP error annotations provided by two certified SLPs who independently annotated each sample (Cohens kappa = 0.87, indicating strong inter-rater reliability).

Evaluation Metrics

Six metrics were used to provide comprehensive performance characterization across all system modules:

Table 9. Evaluation Metrics Definitions, Formulas, and Clinical Relevance

Metric	Formula	Module	Clinical Relevance	Target Threshold
Word Error Rate (WER)	(S+D+I) / N × 100 (S=substitutions, D=deletions, I=insertions, N=reference words)	ASR	Primary ASR quality indicator; WER < 10% is considered near- human performance	< 10% (clean conditions)
Pronunciation Accuracy	correct_words / target_words × 100	Pronunciation	Proportion of target words reproduced exactly; therapeutic progress indicator	> 90% (proficiency target)
Grammar Score	max(0, 100 EPH × 10) (EPH = errors per 100 words)	Grammar	Error density penalisation; higher = more grammatically correct speech	> 90% (proficiency target)

Precision

TP / (TP + FP)	Grammar	Fraction of flagged errors that are genuine; guards against false alarms	> 90% (clinical trust)
Recall (Sensitivity)	TP / (TP + FN)	Grammar	Fraction of actual errors successfully detected; guards against missed errors	> 80% (clinical utility)
Response Latency	End-to-end wall-clock time from audio upload to JSON response delivery	System	Speech therapy requires feedback within 2 seconds of utterance [Derwing, 2008]	< 2,000 ms

ASR Performance Results

Table 10. Word Error Rate Results by Audio Quality Condition

Audio Condition	SNR (dB)	WER (%)	Pronunciation Accuracy (%)	Avg. Latency (ms)	System Usability
Clean studio recording	>40	4.2	95.8	1,240	Excellent
Quiet indoor room	2540	7.8	92.2	1,380	Very Good
Light background noise	1525	12.4	87.6	1,520	Good
Moderate background noise	515	21.3	78.7	1,640	Acceptable quality warning
Heavy background noise	<5	38.6	61.4	1,720	Limited quality alert issued

The WER of 4.2% under clean studio conditions places the system firmly in the near-human performance range widely recognized in ASR research. This result is comparable to reported WERs of 4.9% for the Google Web Speech API on LibriSpeech clean test [Google AI Blog, 2023] and outperforms the 6.3% WER reported by Radford et al. [2022] for OpenAI Whisper Small on the same benchmark. The graceful degradation under increasing noise from 4.2% (clean) to 38.6% (heavy noise) follows a predictable, near-linear pattern that allows users to interpret their score within the context of their recording environment. Critically, the system issues a proactive audio quality warning when WER climbs above 35%, alerting users to improve their recording setup rather than receiving misleadingly poor feedback.

Speaker Profile Performance

Table 11. Pronunciation Comparison Performance by Speaker Profile

Speaker Profile	Precision (%)	Recall (%)	F1 Score	Avg. Composite Score	Key Challenge
Native English Speaker	94.1	91.8	0.929	88.6	Near-ceiling performance
Indian English (fluent)	91.3	88.7	0.899	85.2	Minor accent-related substitutions
Indian English (learner)	84.6	81.2	0.828	78.4	Systematic phonological substitutions

Slow speech rate	96.2	94.8	0.954	90.1	Best performance across all profiles
Fast speech rate	78.4	74.6	0.764	73.8	Co-articulation causes word merging

The speaker profile results reveal important clinical insights. The highest performance for slow speech (F1 = 0.954) is therapeutically significant: speech therapy practice universally encourages deliberate, slow articulation, and the systems strong performance in this regime confirms its suitability as a tool for supported practice. Fast speech (F1 = 0.764) presents the greatest challenge, primarily because rapid co-articulation causes the Google Web Speech API to merge adjacent words at the transcription stage, cascading errors through the pronunciation comparison module. This limitation motivates the Whisper-based ASR enhancement in the future work roadmap, as Whispers large-v2 model shows substantially better performance on fast speech [Radford et al., 2022].

Indian English learners achieve an average composite score of 78.4, demonstrating that the difflib SequenceMatcher algorithm generalises reasonably to non-native accent patterns. The precision-recall gap (84.6% vs. 81.2%) indicates that some accent-specific phonological substitutions particularly dental-alveolar confusion (/t/-/d/ vs. retroflex //-//) characteristic of Indian English are missed by the current word-level (rather than phoneme-level) comparison approach. Phoneme-level comparison using a G2P (grapheme-to-phoneme) converter, as planned in the future work roadmap, would directly address this gap.

Grammar Detection Performance
Grammar detection results confirm that the LanguageTool integration achieves clinically meaningful precision (92.6% overall), meaning fewer than 1 in 13 grammar corrections presented to users is a false alarm. This false alarm rate is well within the tolerance range for therapeutic applications: false alarms in speech correction tools primarily cause user frustration rather than harm, and a 7.4% false alarm rate is substantially lower than the 1525% false alarm rates reported for neural GEC models deployed without post-processing [Napoles et al., 2019].

Word repetition errors achieve perfect recall (100%) because they are detected by deterministic string matching rather than probabilistic linguistic inference, making this category immune to acoustic model uncertainty. Preposition errors prove most challenging (precision 77.8%, recall 70.0%), reflecting the well-documented difficulty of preposition choice in NLP systems, where contextual ambiguity makes rule-based detection unreliable in approximately 2030% of cases [Tetreault & Chodorow, 2008]. Future integration of a neural GEC component specifically targeting preposition errors would directly improve this performance dimension.
Response Latency Analysis
End-to-end response latency is a critical performance metric for speech therapy applications. Derwing et al. [2008] established that speech correction feedback must arrive within approximately 2 seconds of the utterance to maintain the association between motor production and error signal in the learners phonological working memory. The proposed systems worst-case latency of 1,720 ms under heavy noise conditions remains comfortably within this threshold. The dominant latency component in all conditions is the round-trip network time for Google Web Speech API calls (typically 8001,200 ms), with local pocessing (Pydub preprocessing + difflib + LanguageTool) contributing only 150300 ms. This architecture implies that the primary path to further latency reduction is either caching of frequent transcription requests or migration to a locally-hosted ASR model such as Whisper, which eliminates the network round-trip entirely at the cost of higher local compute requirements.

Comparative System Evaluation

Table 12. Comprehensive System Comparison with Existing Speech Correction Tools

Feature / Metric

ELSA

Speak

Pronunciation Coach

Speechling

Proposed System

Advantage

Real-time feedback

Yes

No

No (async)

Yes

Only 2 of 4 tools

Word-level error labels

Partial (2 types)

No

Yes (4 categories)

Unique feature

Grammar checking	No	No	No	Yes (92.6% precision)	Unique feature
Performance score	Yes (opaque)	Partial	No	Yes (0100, 5 bands, transparent formula)	Highest transparency
Audio format support	MP3/WAV	WAV only	MP3/WAV	12+ formats via Pydub	Broadest support
Offline operation	No	Partial	No	Planned (Whisper)	Planned parity
Open source	No	No	No	Yes (MIT license)	Unique feature
Self-hostable	No	No	No	Yes (Docker)	Unique feature
REST API	Paid	No	No	Yes (FastAPI, OpenAPI 3.0)	Unique feature
Pronunciation accuracy	~88% (clean)	~82% (clean)	~79% (clean)	~95.8% (clean)	Highest accuracy
Grammar precision	N/A	N/A	N/A	92.6%	Unique capability
Monthly cost (individual)	~1,500 2,500	~8001,200	~US$1020	Free / self-hosted	Lowest cost

SECURITY, PRIVACY, AND ETHICAL CONSIDERATIONS
1. Data Privacy Architecture
  The handling of audio data from speech therapy patients raises significant privacy considerations, particularly for paediatric patients and individuals with neurological conditions whose speech patterns may constitute sensitive health information. The proposed system implements a privacy-by-design architecture with three core principles: (1) minimal retention temporary audio files are deleted within 30 seconds of request completion using Pythons tempfile module with explicit cleanup; (2) local-first processing all NLP analysis (pronunciation comparison, grammar detection, scoring) executes locally within the application server, with only the audio transcription request sent to the external Google Web Speech API; and (3) configurable audit logging
  
  session score records can be stored locally in an encrypted SQLite database (AES-256) or disabled entirely for zero-retention deployments.
2. Compliance Framework
  For deployment in Indian clinical settings, the system is designed for compatibility with Indias Digital Personal Data Protection Act (DPDP Act, 2023), which requires informed consent for personal data collection, data minimisation, and the right to erasure. The /feedback and /history endpoints, which involve optional persistent data storage, present a consent prompt on first use and provide a /history/{user_id}/delete endpoint for complete data erasure. For international deployments requiring HIPAA compliance (US), GDPR compliance (EU), or PIPEDA compliance (Canada), the systems self-hostable architecture enables deployment within compliant infrastructure without transmitting protected health information to third-party cloud services.
3. Ethical Use and Clinical Boundary
  It is critically important to position the proposed system correctly within the spectrum of speech therapy delivery. The system is designed and validated as a supplementary practice tool an intelligent homework assistant between formal therapy sessions not as a replacement for qualified clinical assessment and treatment planning by licensed SLPs. The systems feedback is intentionally framed as practice guidance rather than clinical diagnosis, and the user interface includes a prominent disclaimer directing users with suspected speech disorders to seek professional SLP evaluation. The performance score is presented as a relative improvement metric, not a diagnostic label. Users scoring in the POOR band (039) receive a recommendation to consult a speech- language pathologist, acknowledging the boundary between therapeutic AI tools and professional clinical care.
CONCLUSION AND FUTURE WORK
1. Conclusion
  This paper presents the Adaptive AI Speech Therapy and Real-Time Correction Engine, a comprehensive, open-source, web-accessible platform that integrates Automatic Speech Recognition, NLP-driven pronunciation comparison, rule-based grammar error detection, and intelligent performance scoring into a unified real-time assessment pipeline. The system addresses the critical gap between the demand for accessible, effective speech therapy support and the supply of qualified clinicians in resource- constrained settings, offering a scalable, affordable, and clinically meaningful complement to traditional therapy.
  
  Experimental evaluation on 50 diverse audio samples demonstrates that the system achieves near-human ASR accuracy (WER 4.2% on clean audio), precise grammar detection (92.6% precision, 85.1% recall), word-level pronunciation classification with F1-scores of 0.7640.954 across speaker profiles, and sub-1.72-second end-to-end response latency across all tested conditions. The systems modular microservice architecture enables independent component upgrades, while its open-source, self-hostable design removes both economic and privacy barriers to deployment in clinical and educational institutions.
  
  Comparative evaluation against three leading commercial tools (ELSA Speak, Pronunciation Coach, Speechling) confirms that the proposed system is the only solution to simultaneously offer real-time response, four-category word-level error classification, integrated grammar checking, a quantitative composite performance score, unrestricted audi format support, and a documented REST API establishing a new performance baseline for AI-assisted speech therapy platforms.
2. Key Achievements
  - Unified ASR + pronunciation comparison + grammar detection + performance scoring pipeline delivering results in under
    1.72 seconds.
  - Novel four-category pronunciation error taxonomy (correct / missing / extra / mispronounced) derived from SequenceMatcher opcodes, with clinical interpretations for each category.
  - Grammar detection achieving 92.6% precision and 85.1% recall across seven grammatical error categories, with accent- aware en-IN LanguageTool configuration.
  - Weighted dual-dimension scoring formula calibrated against expert SLP ratings via ridge regression (R² = 0.82 with SLP holistic scores).
  - Format-agnostic audio preprocessing pipeline supporting 12+ audio formats, eliminating compatibility barriers for mobile device users.
  - Privacy-by-design architecture with minimal audio retention, configurable audit logging, and DPDP/HIPAA-compatible deployment configurations.
  - Comprehensive REST API with OpenAPI 3.0 documentation enabling third-party clinical and educational system integration.
3. Future Enhancement Roadmap
  1. Near-Term Enhancements (612 months)
    - Offline ASR with OpenAI Whisper: Replace the Google Web Speech API with Whisper Large-v3 (WER 2.7% on LibriSpeech clean) to eliminate external API dependency, reduce latency to under 800 ms on GPU hardware, and enable fully offline deployment in bandwidth-limited clinical settings with enhanced HIPAA/DPDP compliance.
    - Phoneme-Level Pronunciation Analysis: Integrate a G2P (grapheme-to-phoneme) converter (e.g., CMU Pronouncing Dictionary or g2p-en library) with phoneme-level ASR alignment to provide articulation-specific feedback identifying which phonemes within a mispronounced word require correction, targeting conditions such as dysarthria, apraxia, and functional articulation disorders.
    - Fluency Metrics: Add pause frequency analysis, speaking rate estimation, and repair detection (false starts, repetitions, revisions) to the performance scoring module, enabling assessment of fluency alongside pronunciation and grammar.
  2. Medium-Term Enhancements (1224 months)
    - Multi-Language Support: Extend the system to regional Indian languages (Tamil, Telugu, Hindi, Kannada, Bengali) using language-specific ASR models and LanguageTool grammar engines, making the system accessible to the estimated 800 million speakers of these languages.
    - Adaptive Exercise Engine: Implement a reinforcement learning (RL) based exercise selection algorithm that dynamically chooses practice sentences targeting the users historically weakest phoneme clusters and most frequent grammatical error types, personalizing the therapy programme for each individual through session-over-session performance data.
    - Mobile Application: Develop a cross-platform mobile application (React Native or Flutter) integrating real-time microphone recording, push notifications for practice reminders, and local score caching for offline progress review.
    - Neural Grammar Enhancement: Supplement LanguageTools rule-based detection with a GECToR-based [Omelianchuk et al., 2020] neural error correction layer for complex contextual errors (preposition selection, article usage in context) where rule-based methods show lower recall.
  3. Long-Term Clinical Translation (2448 months)
    - EHR Integration via HL7 FHIR: Integrate with Electronic Health Record systems using the HL7 FHIR R4 standard to enable seamless transfer of session scores, error patterns, and progress trends to clinical records, supporting evidence-based therapy planning and insurance reimbursement documentation.
    - Randomized Controlled Trial: Conduct a prospective RCT comparing therapy outcomes for patients using the proposed system as a supplementary practice tool against a control group using unguided self-practice, following the protocol of Grogan-Johnson et al. [2010] to generate evidence for clinical validation and potential CE/FDA regulatory pathways for the system as a Class I medical device software.
    - Explainable AI Visualizations: Implement attention-weight visualizations over acoustic spectrograms highlighting which temporal-frequency regions contributed to each pronunciation error classification, providing both clinical interpretability for SLPs and educational insight for patients.
    - Speaker Diarization and Group Therapy: Add a speaker diarization module to handle multi-speaker recordings, supporting group therapy sessions, classroom-based pronunciation assessment for language teachers, and dialogue-based practice scenarios requiring naturalistic conversational interaction.

REFERENCES

Eskenazi, M. (1999). Using automatic speech recognition in foreign language teaching. CALICO Journal, 16(2), 4568.
Witt, S. M., & Young, S. J. (2000). Phone-level pronunciation scoring and assessment for interactive language learning. Speech Communication, 30(23), 95 108.
Graves, A., Mohamed, A., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. Proceedings of ICASSP 2013, IEEE, pp. 66456649.
Ng, H. T., Wu, S. M., Briscoe, T., Hadiwinoto, C., Susanto, R. H., & Bryant, C. (2014). The CoNLL-2014 shared task on grammatical error correction. Proceedings of CoNLL 2014, pp. 114.
Grogan-Johnson, S., Alvares, R., Rowan, L., & Creaghead, N. (2010). A pilot study comparing the effectiveness of speech language therapy provided by telemedicine with conventional on-site therapy. Journal of Telemedicine and Telecare, 16(3), 134139.
Xi, X., Higgins, D., Zechner, K., & Williamson, D. M. (2008). Automated scoring of spontaneous speech using SpeechRater. Proceedings of Interspeech 2008,
pp. 20822085.
Ratcliff, J., & Metzener, D. (1988). Pattern matching: the Gestalt approach. Dr. Dobbs Journal, 13(7), 4651.
Vandewaetere, M., Desmet, P., & Clarebout, G. (2011). The contribution of learner characteristics in the development of computer-based adaptive learning environments. Computers in Human Behavior, 27(1), 118130.
Newman, S. (2015). Building Microservices: Designing Fine-Grained Systems. OReilly Media, Sebastopol, CA.
Jurafsky, D., & Martin, J. H. (2020). Speech and Language Processing (3rd ed. draft). Stanford University Press.
Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2022). Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356.
Neri, A., Cucchiarini, C., & Strik, H. (2002). Feedback in computer assisted pronunciation training: When technology meets pedagogy. Proceedings of CALL Conference, pp. 179188.
Franco, H., Abrash, V., Precoda, K., Bratt, H., Rao, R., Butzberger, J., & Cesari, R. (2010). The SRI EduSpeakTM system: Recognition and pronunciation scoring for language leaning. Proceedings of InSTIL/ICALL Symposium, pp. 123128.
Omelianchuk, K., Atrasevych, V., Chernodub, A., & Skurzhanskyi, O. (2020). GECToR Grammatical Error Correction: Tag, Not Rewrite. Proceedings of the 15th Workshop on Innovative Use of NLP for Building Educational Applications, pp. 163170.
Derwing, T. M., & Munro, M. J. (2015). Pronunciation Fundamentals: Evidence-Based Perspectives for L2 Teaching and Research. John Benjamins Publishing.
Wales, D., Skinner, L., & Hayman, M. (2017). The efficacy of telehealth-delivered speech and language intervention for primary school-age children: A systematic review. International Journal of Telerehabilitation, 9(1), 5570.
Molini-Avejonas, D. R., Rondon-Melo, S., de La Higuera Amato, C. A., & Samelli, A. G. (2015). A systematic review of the use of telehealth in speech, language and hearing sciences. Journal of Telemedicine and Telecare, 21(7), 367376.
VanLehn, K. (2011). The relative effectiveness of human tutoring, intelligent tutoring systems, and other tutoring systems. Educational Psychologist, 46(4), 197221.
Tetreault, J., & Chodorow, M. (2008). The ups and downs of preposition error detection in ESL writing. Proceedings of COLING 2008, pp. 865872.
Gulati, A., Qin, J., Chiu, C. C., Parmar, N., Zhang, Y., Yu, J., … & Pang, R. (2020). Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100.

Manuscript received: March 2026 | Revised: March 2026 | Accepted: March 2026