🏆
Global Scientific Platform
Serving Researchers Since 2012

Echo Learn: An AAC-Based Assistive App for Autism and Down Syndrome

DOI : 10.17577/IJERTCONV14IS060083
Download Full-Text PDF Cite this Publication

Text Only Version

Echo Learn: An AAC-Based Assistive App for Autism and Down Syndrome

Ist Rakshith P Department of Computer Science and

Engineering, JSS Science and Technology University, Mysuru,India Email: rakshith@jssstuniv.in

4th Nagaveni L S Department of Computer Science and

Engineering, JSS Science and Technology University, Mysuru,India Email: nagavenils59@gmail.com

2nd Deepika M Department of Computer Science and Engineering, JSS Science and Technology University, Mysuru,

India Email: murthydeepika611@gmail.com

5th Sanjana M C Department of Computer Science and

Engineering, JSS Science and Technology University, Mysuru,India Email:sanjanamcOI@gmail.com

3rd Suhas M Department of Computer Science and

Engineering, JSS Science and Technology University, Mysuru, India Email:suhasmanishshetty@gmail.com

Abstract Some children find it difficult to speak clearly and communicate their needs. They may use AAC tools to form sen- tences using symbols, but these tools do not help them improve how they pronounce words. Because of this, children may con- tinue making the same mistakes in speech without understanding how to correct them. This project presents Echo Learn, a web- based system designed to support both communication and basic speech practice. The application allows children to select symbols to form sentences and also try speaking simple words or sentences. The spoken input is checked using a speech analysis service, which identifies where mistakes occur in pronunciation. To help the child improve, the system provides simple visual hints that show how to move the lips and tongue while speaking a particular sound.. The system also saves results from each attempt and shows progress in an easy format. Overall, the proposed system provides a simple and practical way to support speech improvement along with communication in a single platform.

Keywords AAC, Speech Practice, Pronunciation Error, Assis- tive System, Speech Support, Web-Based Tool, Learning Aid.

  1. INTRODUCTION

    Clear speech communication plays a very important role in everyday human interaction and social development. For chil- dren with neurodevelopmental conditions, articulation disor- ders, or other speech impairments, developing functional com- munication skills is often a long-term and ongoing challenge. These children frequently depend on Augmentative and Alter- native Communication (AAC) systems to express basic needs and to interact with their environment.

    Although AAC platforms have been helpful in supporting expressive communication, they mainly focus on enabling out- put through symbol selection and text-to-speech mechanisms and do not fully address the underlying pronunciation difficul- ties that many users experience. Traditional speech therapy usually requires regular sessions with a therapist, fixed sched- ules, and subjective evaluation of progress. These methods are time-consuming, difficult to extend into home settings, and give parents limited objective data about how their child is ac- tually improving.

    A major gap in current assistive technology is the absence of visual articulation guidance. Children with speech difficulties cannot easily correct sounds they cannot see or understand physically. Simply knowing that a sound is wrong is not enough; they also need to know where the tongue should go, how the lips should be shaped, and how the breath should flow. This visual aspect of articulation learning is almost entirely missing from existing AAC and pronunciation-training plat- forms.

    Recent advances in cloud-based speech recognition have made it possible to perform phoneme-level pronunciation as- sessment within standard web architectures, allowing real-time detection of specific articulation errors without the need for specialized hardware. This paper presents Echo Learn, a web-based assistive application that combines AAC communi- cation support with an integrated pronunciation learning mod- ule. The system offers symbol-based sentence construction, phoneme-level error detection using the Microsoft Azure Pro- nunciation Assessment API, and a dedicated articulation visu- alization system that shows lip configuration, tongue position, and airflow direction for each target phoneme. Session-based progress tracking and graphical analytics provide clear, objec- tive visibility into pronunciation development over time.

    The main contributions of this work are:

    1. a unified platform that integrates AAC communica- tion and structured pronunciation practice;

    2. real-time phoneme-level error detection and single-er- ror-first corrective feedback;

    3. an articulation visualization system that demonstrates lip shape, tongue position, and airflow guidance for each phoneme; and

    4. session-based analytics that enable objective monitor- ing of pronunciation improvement trends.

  2. MOTIVATION AND PROBLEM DEFINITION

    Children with speech and communication difficulties form a diverse group with complex and varied needs. While AAC plat- forms successfully meet immediate communication demands,

    they do not actively support the deeper goal of improving a childs ability to produce sounds accurately. This gap between communication assistance and pronunciation development rep- resents a significant unmet need in current assistive technology. Between therapy sessions, children and caregivers often lack practical tools that can provide clear, real-time feedback on pronunciation attempts. Without this kind of ongoing support, incorrect articulation patterns can persist for long periods with- out correction. A particularly important limitation is the ab- sence of visual articulation guidance in existing systems. Pro- nunciation correction is not just about noticing that a sound is wrong; it is about understanding the exact physical mechanism of correct sound production. Lip position, tongue placement, and airflow direction form the physical basis of articulation, yet

    no current AAC platform offers this kind of visual feedback. The key problems addressed in this work are:

    • existing AAC systems provide no structured pronun- ciation practice or phoneme-level correction;

    • there is no visual articulation guidance in current AAC platforms that shows lip shape, tongue position, and airflow for specific sounds;

    • children do not have access to objective, real-time feedback on specific articulation errors between ther- apy sessions;

    • caregivers lack data-driven insight into pronunciation progress or recurring weak sound patterns over time; and

    • communication support and pronunciation develop- ment remain separate functions with no integrated platform that addresses both together.

    Echo Learn addresses these limitations by providing a sin- gle, accessible platform where a child can communicate using AAC symbols and at the same time receive structured, pho- neme-specific pronunciation guidance with visual articulation support.

  3. LITERATURE REVIEW

    Augmentative and Alternative Communication systems have been extensively studied and deployed to support individuals with speech and communication impairments [8]. Established platforms such as Tobii Dynavax and Proloquo2Go enable ex- pressive communication through image-based symbol selec- tion and text-to-speech output [9]. These systems significantly improve communication ability but do not incorporate struc- tured pronunciaion assessment or articulation improvement mechanisms [10].

    Research in automatic pronunciation evaluation has pro- gressed considerably [3]. Early approaches relied on rule-based phoneme matching and word-level scoring, lacking the granu- larity required for precise identification of individual articula- tion errors [3]. Subsequent work introduced phoneme-level evaluation using hidden Markov models and acoustic feature analysis, enabling more detailed mispronunciation detection

    [5]. These advances improved feedback specificity but re- mained largely confined to research environments [5].

    The emergence of deep learning-based speech recognition has significantly enhanced pronunciation assessment capabili- ties [6]. Cloud-based services now expose phoneme-level ac- curacy scoring, error classification, and prosody evaluation through accessible APIs [1]. Microsoft Azure Pronunciation Assessment provides per-phoneme accuracy scores, error type classifications including omission, insertion and mispronunci- ation, and word-level metadata through a standard API inter- face [1].

    Children's speech presents additional complexity due to higher fundamental frequency, variable articulation patterns, and developing phonological systems that differ substantially from adult speech [2]. Pronunciation assessment accuracy for children remains an active area of development with limited practical deployment in assistive technology contexts [11].

    Visual articulation guidance has been demonstrated as an ef- fective complement to auditory feedback in pronunciation learning [10]. Research indicates that combining visual mouth position demonstrations with auditory correction improves ar- ticulation outcomes, particularly for learners who struggle to internalize sound production from audio alone [10]. Despite this evidence, integration of lip shape, tongue position, and air- flow visualization within AAC platforms remains almost en- tirely absent from existing implementations [7]

    A clear research gap exists in combining AAC communica- tion support, phoneme-level pronunciation assessment, visual articulation guidance showing lip configuration, tongue posi- tion and airflow, and session-based analytics within a unified web-based platform. Echo Learn addresses this gap directly.

    A comparative analysis of existing AAC systems and Echo Learn is presented in TABLE I

    TABLE I

    COMPARISON OF ECHOLEARN WITH EXISTING AAC SYSTEMS

    Feature

    Tobii

    Dynavox

    Prolo-

    quo2Go

    Traditional

    AAC

    Echo

    Learn

    Communication

    Support

    Yes

    Yes

    Yes

    Yes

    Phoneme-Level

    Feedback

    No

    No

    No

    Yes

    Articulation Vis-

    ualization

    No

    No

    No

    Yes

    Tongue and Air-

    flow Guidance

    No

    No

    No

    Yes

    Progress Analyt-

    ics

    No

    No

    No

    Yes

    Web-Based Ar-

    chitecture

    No

    No

    No

    Yes

  4. PROPOSED SYSTEM OVERVIEW

    Echo Learn is designed as a unified web-based assistive ap- plication that combines two core functional modulesan AAC communication module and a pronunciation practice module within a single child-facing interface accessed through a dedi- cated child login.

    1. AAC Communication Module

      The AAC module provides a symbol-based communication board organized into categories such as needs, emotions, ac- tions, objects, and social expressions. Each tile, when selected, speaks the corresponding word aloud using the Web Speech API and adds it to a sentence construction bar The child creates sentences by adding words one at a time and then triggers the full sentence to be spoken using a dedicated speak button. Some commonly used response tiles work as direct-speak but- tons that skip the category navigation completely. The board also allows tiles to be added or removed, so that the layout can be adjusted to match each childs specific communication needs.

    2. Pronunciation Practice Module

      The pronunciation practice module provides structured speech practice through word- and sentence-based exercises that are grouped by difficulty level. The child attempts to re- produce a target word or sentence by speaking into the micro- phone. The Microsoft Azure Pronunciation Assessment per- forms phoneme-level analysis, and the system identifies the most problematic phoneme in that attempt.

      A central contribution of Echo Learn is its articulation visu- alization system. When a phoneme error is detected, the system displays a focused visual showing the correct lip configuration, tongue position, and airflow direction for that specific sound. This helps children understand not only that a sound is incor- rect but also exactly how to produce it physicallywhere to place the tongue, how to shape the lips, and how to direct the breath. The visual is shown alongside a simple, child-friendly correction hint in encouraging language. The child retries the full word or sentence, with up to five attempts allowed before the system advances with a positive message and no negative reinforcement.

    3. Design Principles

    Echo Learn is built around three guiding principles: integra- tion (AAC and pronunciation practice in one platform), speci- ficity (single-phoneme correction per attempt with precise vis- ual articulation guidance), and accessibility (child-appropriate language, visual feedback, and simple interaction throughout).

  5. SYSTEM ARCHITECTURE

    Echo Learn follows a modular client-server architecture built on the MERN stack, comprising React.js on the frontend, Node.js and Express.js on the backend, and MongoDB as the database, integrated with the Microsoft Azure Pronunciation Assessment API. The overall system architecture of Echo Learn is illustrated in Fig. 1.

    Fig. 1: System Architecture Workflow

    1. Authentication Module

      The authentication module manages child login verification. Credentials are checked on the Express.js backend by compar- ing them with the data stored in MongoDB, which connects each childs practice history and progress records to their indi- vidual profile across dierent sessions.

    2. AAC Interaction Module

      The AAC module handles the symbol-based communication board, tile selection, sentence buffer construction, and speech synthesis via the Web Speech API using a locally defined pro- nunciation mapping table that corrects Indian-English TTS in- consistencies

    3. Pronunciation Assessment Module

      The pronunciation assessment module captures microphone input via the Media Recorder API and transmits the audio to the Node.js backend as a binary blob along with the reference text. The backend forwards both to the Microsoft Azure Pro- nunciation Assessment API, configured at phoneme-level gran- ularity. The returned JSON response is parsed to identify the phoneme with the lowest accuracy score, which is mapped to the articulation visualization system to return lip configuration, tongue position, airflow guidance, and a correction hint for dis- play. The detailed pronunciation assessment workflow is shown in Fig. 2.

      Fig. 2: Pronunciation Assessment Workflow

    4. Articulation Visualization System

      A structured phoneme-to-articulation mapping system associ- ates each Azure-returned phoneme idenifier with a dedicated visual that demonstrates the correct lip shape, tongue position, and airflow direction for that sound. This system constitutes a core technical contribution of Echo Learn, providing children with precise physical guidance for sound correction that text hints alone cannot convey.

    5. Analytics Module

    Processes stored attempt history from MongoDB to compute average phoneme accuracy across sessions and per-item score progression over sequential attempts. Results are rendered as a phoneme accuracy bar chart and a progress line chart using the Recharts library. The core system components and technolo- gies used in Echo Learn are summarized in Table II.

    TABLE II

    ECHOLEARN SYSTEM COMPONENTS

    Component

    Technology

    Description

    Frontend

    React.js

    AAC board, practice UI, ana- lytics

    Backend

    Node.js, Express.js

    REST API, speech pro- cessing

    Database

    MongoDB

    Profiles, attempt history

    Speech Assess- ment

    Azure Pronuncia- tion Assessment

    Phoneme scoring and error detection

    Articulation Visu- alization

    Local mapping system

    Lip, tongue, airflow guidance per phoneme

    Speech Synthesis

    Web Speech API

    Reference pronunciation playback

    Analytics

    Recharts

    Bar chart and line chart ren- dering

  6. METHODOLOGY

    The EchoLearn system operates through two integrated workflows AAC communication and pronunciation practice

    supported by a session-based analytics layer.

    1. AAC Communication Workflow

      Communication tiles are organized into predefined cate- gories and rendered as a symbol-based grid. Each tile selection triggers speech synthesis via the Web Speech API using a lo- cally defined pronunciation mapping table constructed to pro- duce accurate phonetic output under Indian-English TTS con- ditions. Selected words are appended sequentially to a sentence buffer. Full-sentence playback is initiated through a dedicated speak control. High-frequency response tiles are configured as direct-speak elements that bypass category navigation to mini- mize interaction steps. The AAC communication process flow is illustrated in Fig. 3.

      Fig. 3: AAC Communication Workflow

    2. Dataset Structure

      The word dataset is categorized by communication domain and difficulty level, determined by syllable count and phoneme complexity. The sentence dataset consists of a fixed set of es- sential daily-life communication phrases selected for practical relevance to the target user group, organized by category into needs, health, feelings, and social expressions.

    3. Pronunciation Assessment

      Audio is captured via the MediaRecorder API and trans- mitted to the Node.js backend as a binary blob along with the reference text. The backend forwards both to the Microsoft Az- ure Pronunciation Assessment API, configured at pho- neme-level granularity with miscue detection enabled. The API returns per-phoneme accuracy scores in the range, along with error-type classifications (mispronunciation, omission, inser- tion) and word-level metadata.

    4. Phoneme Error Identification and Articulation Guidance

      The system reads the returned response and picks out the phoneme that has the lowest accuracy score in that attempt. This phoneme is treated as the sole correction target for the current feedback cycle. The identified phoneme is matched with the articulation visualization system, which then returns a visual that shows the correct lip shape, tongue position, and airflow direction for that particular sound. This visual is shown together with a simple, child-friendly correction hint. By giv- ing only one correction per attempt, along with clear physical articulation guidance, the system keeps the childs mental load low and focuses directly on the specific way the sound is being produced, instead of offering general or vague feedback.

    5. Retry Evaluation

      Each attempt is evaluated with a new call to the Azure API, independent of earlier attempts. This way, the feedback is based only on the childs current speech and does not carry for- ward any earlier errors or history. A maximum of five attempts is permitted per practice item. If all attempts are exhausted, the system advances automatically with an encouraging message and no negative reinforcement.

    6. Progress Tracking and Analytics

    Each attempt is persisted in MongoDB, including the child identifier, target item, attempt number, overall accuracy score, per-phoneme scores, error classifications, articulation guidance delivered, and a timestamp. The average phoneme accuracy across sessions is computed as:

    N

    AvgPhonemeScore_p = ( Score_{p,i}) / N i=1

    where Score_{p,i} is the accuracy score of phoneme p in attempt i, and N is the total evaluated attempts for that pho- neme. The system generates two visualizationsa pho- neme-accuracy bar chart and a progress line chartrendered using the Recharts library within the React frontend.

  7. IMPLEMENTATION DETAILS

    The system is built as a web-based clientserver applica- tion using the MERN stack. The overall design is kept modular so that the code stays organized, easy to understand, and simple to extend when new features are added. The frontend is devel- oped with React.js, the backend runs on Node.js with Ex- press.js, and all data is stored in MongoDB using Mongoose as the object-document mapper. The system also connects to the Microsoft Azure Pronunciation Assessment API to evaluate speech and give feedback to the user.

    The frontend handles the user interface, including the AAC communication board, the pronunciation practice con- trols, the articulation visualization, and the analytics dash- boards. React components are grouped around main features, so there is one section for the symbol board, another for build- ing sentences, one for audio recording, one for showing feed- back, and one for displaying progress charts. Basic React hooks such as useState, useEffect, and useContext are used to manage the application state across different parts of the interface, which helps keep the screen updated as the child selects tiles, records speech, or checks progress. This makes it possible to match the UI with the childs current activityfor example, updating the sentence bar when tiles are picked, showing feed- back right after each pronunciation attempt, and refreshing the analytics charts when the progress page is opened. The inter- face is kept simple and child-friendly, with large tiles, clear icons, and very little text, so that children can use the system with little or no adult help.

    The backend runs on Node.js with the Express.js frame- work and provides API endpoints for different tasks. When a child logs in, the backend checks the entered credentials against the records stored in MongoDB and links the session to

    that childs profile. Separate routes handle AAC-related actions such as loading tiles and saving custom layouts, pronuncia- tion-practice operations such as sending audio for assessment and storing attempt results, and analytics requests such as re- trieving score history for a particular phoneme or word. The modular design of these routes helps separate one part of the system from another, so that changes in one modulefor ex- ample, adding a new visualization or changing how scores are computeddo not affect the rest of the system. The backend also includes asic error handling and simple logging to help spot and fix problems during testing and while preparing the system for deployment.

    Speech pronunciation evaluation is handled using the Mi- crosoft Azure Cognitive Services Pronunciation Assessment API. When the child speaks into the microphone, the browser records the audio using the MediaRecorder API and sends the captured file as a binary blob to the Node.js backend. Along with the audio, it also transmits the matching reference text for that word or sentence so the backend can pass both to the pro- nunciation-assessment service for analysis. The backend for- wards both the audio and the reference text to Azure, which returns a JSON response with overall accuracy, per-phoneme scores, error types (such as mispronunciation, omission, or in- sertion), and word-level information. The backend processes this response to find the phoneme with the lowest accuracy score, which is treated as the main sound that needs correc- tion.When the child finishes speaking, the system prepares a clear and simple response that includes the overall pronuncia- tion score, highlights the problematic phoneme, and adds any extra details the frontend needs to display feedback and update the analytics. In the React frontend, the articulation visualiza- tion system is built as a basic JavaScript mapping object. Each phoneme identifier received from Azure is linked to a specific visual element that shows how the lips and tongue should move and how the airflow should flow to produce that sound cor- rectly. For example, the th sound is linked to a picture where the tongue is placed between the teeth, the lips are slightly open, and the airflow goes through the small gap. When the backend returns the lowest-scoring phoneme, the frontend looks it up in this mapping and retrieves the matching visual and a short, child-friendly hint (such as bite your tongue or lips together, then release). This visual-feedback pair is shown in the feedback area, giving the child a clear physical idea of how to correct the sound instead of just hearing the right audio.

    The database layer uses MongoDB, with Mongoose to de- fine schemas and manage data operations. Child profiles are stored as individual documents that include basic details such as the childs name, age, and login information. Practice ses- sions and pronunciation attempts are stored as nested docu- ments under each childs profile, so that related data stays grouped together. Each attempt record includes the child iden- tifier, the target item (word or sentence), the attempt number, the overall score, an array of per-phoneme scores, error classi- fications, the articulation guidance shown, and a timestamp.

    This structure keeps the data rich and detailed, while still sim- ple to query and work with. For instance, the analytics engine can quickly retrieve all pronunciation attempts made by a spe- cific child or for a particular word, then calculate averages, track progress over time, or extract other useful metrics from that dataset.

    The analytics engine processes the stored attempt history and turns it into meaningful summaries. On the backend, the system reads the attempt records from MongoDB, groups the scores by phoneme and by item, and calculates averages across sessions. For each phoneme, it finds the average accuracy across all attempts and marks any phonemes that fall below a set threshold as weak sounds that need extra practice. The en- gine also watches how the scores for individual words or sen- tences change over time, which helps identify gradual improve- ment or ongoing difficulties. The computed results are sent to the frontend as a simple JSON object that can be easily dis- played using the Recharts library. The frontend shows the re- sults either as a bar chart of phoneme-level accuracy or as a line chart of progress across attempts, depending on which view the user picks.

    Through this mix of interactive frontend design, backend processing, cloud-based speech assessment, and a flexible da- tabase structure, Echo Learn delivers a practical and usable platform for pronunciation practice and communication sup- port. The implementation is modular enough to allow future additions, such as new visualizations, offline models, or sup- port for different languages. At the same time, the current de- sign keeps the interface simple and focused so that children can follow the guidance without getting confused by technical de- tails.

  8. RESULTS AND PERFORMANCE EVALUATION

    The system was tested through multiple practice sessions under conditions that loosely mimic real user interaction. The focus was on checking whether the platform could correctly record pronunciation scores, identify weak words, compute an- alytics efficiently, and update the visualizations in a timely way. The user interface and system outputs are shown in Figs. 48.

    Fig. 4: Child Login Interface

    Fig. 5: AAC Communication Board

    Fig. 6: Category Tile Selection and Sentence Construction showing "I have pain in the heel"

    Fig. 7: Pronunciation Practice with Articulation Visualization

    Fig. 8: Progress Tracking Dashboard

    1. AAC Module Evaluation

      The AAC communication board worked correctly in all predefined categories. Tile selection, sentence building, and playback operated without noticeable delay. The locally de- fined pronunciation mapping table successfully corrected sev- eral Indian-English TTS quirks, so that even commonly mis- pronounced words sounded closer to standard speech. Tiles that were set as high-frequency response elements functioned as di- rect-speak buttons and did not require navigating through cat- egory menus, which sped up the communication process for common phrases. Adding and removing tiles also worked reli- ably, allowing basic customization of the board for different children.

    2. Pronunciation Assessment Results

      Table III summarizes pronunciation assessment results ob- tained from practice sessions in both word and sentence modes.

      TABLE III PRONUNCIATION ASSESSMENT RESULTS

      Item

      Type

      Difficulty

      Score (%)

      Weak

      Pho- nemes

      food

      Word

      Easy

      82.8

      help

      Word

      Easy

      82.7

      bathroom

      Word

      Hard

      70.6

      b, ae, th

      breakfast

      Word

      Hard

      63.4

      b, r,eh, k, f

      I need help

      Sentence

      Medium

      52.5

      d, h, eh, p

      Please call for help

      Sentence

      Hard

      58.1

      p, l, iy, k, aa

      Easy-level words generally scored above 80%, which sug- gests that simpler vocabulary was often produced with rela- tively clear articulation, especially when visual guidance was given. Hard-level words and multi-word sentences scored lower, reflecting the increased difficulty of longer or more complex utterances. The articulation visualization system de- livered the correct lip configuration, tongue position, and air- flow direction for each identified weak phoneme, giving chil- dren specific physical cues instead of only general feedback

    3. Per-Phoneme Accuracy Analysis

      The analytics module aggregated phoneme scores across all sessions and produced a phoneme-accuracy bar chart. The chart showed that certain phonemes such as uw, ao, r, and f tended to stay above 75% accuracy, indicating that chil- dren were usually able to produce them correctly. Other pho- neme such as b, ae, l, aa, and k often scored below 40%, marking them as clear targets for focused articulation-guided practice. This breakdown at the phoneme level helps pinpoint

      exactly which sounds are causing the most trouble, which a simple word-level score would not reveal.

    4. Progress Tracking Evaluation

      Table IV shows how the score changed for the word bath- room over five successive attempts.

      TABLE IV

      PROGRESSIVE ATTEMPT SCORES BATHROOM

      Attempt

      Score

      (%)

      Weak Pho-

      neme

      Articulation Guidance Deliv-

      ered

      1

      70.6

      b

      Lip closure and release

      2

      73.2

      ae

      Wide mouth open position

      3

      76.8

      th

      Tongue between teeth airflow

      4

      79.1

      th

      Tongue between teeth airflow

      5

      82.4

      r

      Tongue curl back position

      In each attempt, the system identified the weakest pho- neme and showed the corresponding visual articulation guid- ance. This allowed the child to gradually improve the produc- tion of the word as the focus shifted from one sound to another. By the fifth attempt, the overall score had increased from 70.6% to 82.4%, which suggests that repeated, guided practice can lead to measurable improvement.

    5. System Performance

    The system remained stable during testing. Azure API calls returned responses quickly, so feedback appeared almost in real time without obvious lag. AAC interactions, sentence construction, and speech playback were smooth, with no no- ticeable freezing or delay. The analytics dashboard loaded cor- rectly and updated charts as soon as the user switched to the progress view. Overall, the platform demonstrated that it can reliably record and process pronunciation data, deliver targeted articulation feedback, and show progress in a clear, visual way.

  9. ADVANTAGES AND LIMITATIONS

    1. Advantages

      Echo Learn offers several important improvements over existing AAC platforms. By combining AAC communication and structured pronunciation practice in a single web-based ap- plication, it reduces the need for separate tools and keeps eve- rything in one place for the child, parents, and therapists. The use of Microsoft Azure Pronunciation Assessment at the pho- neme level allows very specific identification of which sounds are being misarticulated, something that word-level scoring cannot do.

      The articulation visualization system is one of the main strengths of the platform. By showing how the lips move, where the tongue should be placed, and how the airflow should flow for each phoneme, it helps children understand the physi- cal process of sound production, not just the sound itself. The single-error-first feedback strategy keeps instructions simple and avoids overwhelming the child with too many corrections at once. Each attempt is checked independently, so the feed-

      back always reflects the childs current articulation. The ses- sion-based analytics also give caregivers a clear, visual record of how pronunciation is changing over time, which can be use- ful for planning further practice. Finally, the web-based MERN stack allows the system to run on multiple devices without in- stalling special software, making it easier to use at home or in a classroom.

    2. Limitations

    Some limitations come from the systems dependence on the Microsoft Azure cloud service. If the network connection is weak or unstable, there can be noticeable delays in feedback or even occasional failures to process the audio. The Pronunci- ation Assessment API is mainly trained on adult speech, so its scoring for children may not always match the way a speech therapist would judge the same sounds.

    At present, the system only supports English-language as- sessment, so it is not suitable for children who primarily use other languages. The articulation visuals are currently based on static images, which gives a good idea of how the mouth should look but does not show the full movement over time. Finally, the feedback is fully automated, so it cannot replace the nu- anced judgment and emotional support that a human therapist can provide in real sessions. These factors mean that Echo Learn works best as a supplementary tool between therapy vis- its, not as a full replacement for professional speech support.

  10. FUTURE SCOPE

    The current design of Echo Learn can be extended in sev- eral directions. One possibility is to integrate offline speech-recognition models so that at least basic phoneme-level assessment can continue even when the internet connection is poor or absent. This would also reduce the time between speak- ing and feedback, making the practice feel more immediate.

    The pronunciation models themselves could be fine-tuned using childrens speech data, which would likely improve their accuracy for the target user group. The articulation visualiza- tion system could be expanded to include simple animations or 3D-style models that show how the tongue moves and how the lips change shape over time, rather than just a single still image. This kind of dynamic feedback would give children a more re- alistic picture of how sounds are produced.

    Supporting more than one language would broaden the platforms usefulness across different communities. The sys- tem could also be made more adaptive by using the stored at- tempt history to select words and sentences that best match the childs current weak sounds, instead of following a fixed list. Machine learningbased analytics could try to predict upcom- ing difficulties and suggest extra practice before errors become strongly established.

    Adding small gamification elementssuch as points, badges, or simple rewards for completing practice sessions might help keep children motivated during longer or repeated sessions. Finally, deploying the application to the cloud would allow usage across multiple devices and make it easier to share progress data between home, school, and therapy rooms.

  11. CONCLUSION

This paper introduced Echo Learn, a web-based assistive application that combines Augmentative and Alternative Com- munication (AAC) support with structured pronunciation prac- tice in one platform. The system fills a clear gap in current AAC tools, which often focus only on expression without of- fering phoneme-level feedback or visual guidance on how the lips, tongue, and airflow should move. Echo Learn gives chil- dren a simple symbol-based board to build sentences and ex- press basic needs, supported by clear speech synthesis tuned for Indian-English conditions, while the pronunciation module uses Microsoft Azure Pronunciation Assessment to detect mis- pronounced phonemes and guide correction.

The articulation visualization system helps children see how each sound should be physically produced, going beyond what they can learn by listening alone. Testing showed that the system can reliably capture speech, compute phoneme-level scores, and provide targeted guidance that leads to measurable improvements over repeated sessions.

REFERENCES

  1. Microsoft Corporation, "Pronunciation Assessment – Azure AI Services," Microsoft Learn, 2024.

  2. S. Dudy, S. Bedrick, M. Asgari, and A. Kain, "Automatic Analysis of Pronunciations for Children with Speech Sound Disor- ders," Computer Speech & Language, vol. 50, pp. 62-84, Jul. 2018.

  3. S. M. Witt and S. J. Young, "Phone-Lvel Pronunciation Scor- ing and Assessment for Interactive Language Learning," Speech Com- munication, vol. 30, no. 2-3, pp. 95-108, Feb. 2000.

  4. Y. Zhang et al., "speechocean762: An Open-Source Non-Na- tive English Speech Corpus for Pronunciation Assessment," in Proc. Interspeech, 2021, pp. 456-460.

  5. D. Povey et al., "The Kaldi Speech Recognition Toolkit," in IEEE Workshop on Automatic Speech Recognition and Understand- ing, 2011, pp. 1-4.

  6. A. Radford et al., "Robust Speech Recognition via Large- Scale Weak Supervision," in Proc. ICML, 2023, pp. 1-15.

  7. American Speech-Language-Hearing Association (ASHA), "Speech Sound Disorders: Articulation and Phonology," ASHA.org, 2024.

  8. D. R. Beukelman and P. Mirenda, Augmentative and Alterna- tive Communication: Supporting Children and Adults with Complex Communication Needs, 4th ed. Baltimore, MD: Paul H. Brookes Pub- lishing, 2013

  9. J. Light and D. McNaughton, "Communicative Competence for Individuals Who Require Augmentative and Alternative Commu- nication," Augmentative and Alternative Communication, vol. 30, no. 1, pp. 1-18, Mar. 2014.

  10. J. A. Gierut, "Treatment Efficacy: Functional Phonological Disorders in Children," Journal of Speech, Language, and Hearing Re- search, vol. 41, pp. S85-S100, Feb. 1998.

  11. H. Franco et al., "The SRI TOPIX System: Automatic Pro- nunciation Assessment for Children," in Proc. Interspeech, 2010, pp. 1234-1237.