GestureSpeak: Converting Voice Inputs to Hand Gestures for Accessibility

Thushara; Priyadarshini P

doi:10.17577/IJERTCONV14IS010071

Techprints 9.0 - 2026 (Volume 14 - Issue 01)

GestureSpeak: Converting Voice Inputs to Hand Gestures for Accessibility

DOI : 10.17577/IJERTCONV14IS010071

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 16
Authors : Thushara, Priyadarshini P
Paper ID : IJERTCONV14IS010071
Volume & Issue : Volume 14, Issue 01, Techprints 9.0
Published (First Online) : 01-03-2026
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

GestureSpeak: Converting Voice Inputs to Hand Gestures for Accessibility

Thushara

Department of Computer Applications St Joseph Engineering College, Vamanjoor

Mangalore, Karnataka

Priyadarshini P

Assistant Professor Department of Computer Applications

St Joseph Engineering College, Vamanjoor Mangalore, Karnataka

Abstract- Communication challenges continue to affect individuals with hearing impairments, primarily because most people are unfamiliar with sign language. To overcome this issue, we introduce GestureSpeak, a smart assistive technology designed to convert spoken words into matching visual hand gestures. This system processes both live and recorded audio inputs by leveraging Googles Speech-to-Text API in combination with Python's Speech Recognition toolkit, ensuring reliable and precise voice transcription.

GestureSpeak employs a custom-built gesture dataset to map transcribed text to culturally accurate sign language gestures, which are rendered visually through an intuitive user interface. To facilitate continuous improvement, the platform integrates Lexicon-Based sentiment analysis to evaluate user feedback, enabling adaptive refinements to recognition accuracy and gesture mapping.

Empirical evaluations demonstrate that GestureSpeak significantly improves comprehension for hearing-impaired users across diverse application domains such as education and healthcare. By enhancing inclusivity and communication accessibility, GestureSpeak offers a scalable solution aimed at bridging the communicative divide between hearing and hearing-impaired communities.

INTRODUCTION

In a world where connectivity and communication are fundamental to societal engagement, individuals with hearing or speech impairments often encounter notable obstacles especially in settings dominated by spoken interaction. While sign language is a crucial tool within the deaf community, its lack of widespread comprehension among the general public significantly hinders accessibility and inclusive participation.

To bridge this communication divide, we introduce GestureSpeak, a cutting-edge assistive system designed to transform spoken language into equivalent visual hand gestures. The platform accommodates both real-time voice input and pre-recorded audio by combining advanced speech recognition, gesture translation, and natural language

processing methods, all tailored to improve interaction for individuals with hearing impairments.

What differentiates GestureSpeak from existing solutions is its adaptive design that incorporates real user feedback and supports multiple input modalities. The system harnesses Googles Speech-to-Text API alongside Pythons Speech Recognition library for robust audio transcription, maps recognized text to a custom gesture dataset, and utilizes sentiment analysis through natural language processing to analyze user feedback. This feedback-driven, iterative framework facilitates continuous refinement and personalization of the system.

GestureSpeak is built using a modular and scalable architecture, with React powering the frontend to ensure a responsive experience across web and mobile platforms. By translating spoken input into visual sign language gestures, the system aims to foster inclusive communication, particularly in key areas such as education, healthcare, and public services.
LITERATURE REVIEW

A variety of systems have been designed to support communication for individuals with hearing and speech impairments by converting hand gestures into either spoken language or written text. Common strategies employed in these systems include:

Sensor-Based Gloves: Abraham et al. (2018) used flex sensor gloves connected to Arduino, combined with a Backpropagation ANN trained in MATLAB, enabling adaptive gesture recognition transmitted via GSM to mobile devices.

Real-Time Gesture Recognition: Mopidevi et al. (2019) leveraged MediaPipe, TensorFlow, and OpenCV with hand landmark detection to achieve 95.7% accuracy on ten common gestures, optimized for mobile devices.

CNN-Based Gesture Classification: Dixit et al. (2020) implemented a MobileNetV2 CNN on Raspberry Pi to classify gestures from webcam input, supporting both American and Indian Sign Languages with Google Text-to- Speech output.

Low-Cost Assistive Devices: Ambavane et al. (2017) developed a portable glove using flex sensors and Bluetooth to convert analog signals into speech via an Android app, emphasizing affordability.

Deep Learning with Transfer Learning: Thakar et al. (2021) applied CNNs with VGG16 and transfer learning for real- time ASL-to-text conversion, achieving 98.7% accuracy using camera input and Django REST APIs.

Indian Sign Language Recognition: Rao et al. (2020) trained CNNs for ISL character-level gesture recognition, reaching 99% accuracy with standard webcams.

Feature-Based Gesture Recognition: Tripathi et al. (2018) combined SURF and Bag-of-Visual-Words with classifiers like SVM and CNN, yielding 7992% accuracy.

Neural Machine Translation: Zhu et al. (2021) employed transformer-based NMT models enhanced with transfer and semi-supervised learning to convert spoken language into sign language glosses, improving translation quality for low- resource settings.

Animated Sign Language Avatars: Recent research integrates NLP and 3D animated avatars for ISL, using databases of sign videos and Blender animations for real-time, visual sign language generation on platforms like Raspberry Pi.

Comprehensive Analyses: According to Kumar and Das (2019), machine translation models for sign language face challenges such as grammatical structure and word order differences. Their review emphasized that convolutional neural networks (CNNs), when applied in real-time gesture recognition, offer a highly effective method for translating American Sign Language (ASL) into both text and speech.
METHODOLOGY

Data Sources

For the development of the GestureSpeak system, two primary categories of data were essential: one comprising audio recordings for speech-to-text conversion, and the other consisting of visual samples for gesture identification and rendering. To build the gesture dataset, we integrated multiple open-access sources, including contributions from Kaggle and academic repositories. The resulting dataset contains visual representations of hand signs corresponding to the English alphabet (AZ), digits (09), and frequently used

expressions such as Hi, Okay, Hello, Good, Bad, and Dislike. These particular gestures were selected for their relevance in daily conversations, ensuring that the system supports practical and commonly needed communication. Moreover, the design remains extensible, allowing future inclusion of additional phrases and gesture types. Visual references, such as Figure 1 and Figure 2, demonstrate sample hand poses for common words and alphabet letters, respectively.

Figure 1: Example of hand gesture poses for common words and phrases.

Figure 2: Sample poses representing letters
System Design

GestureSpeak is a user-friendly platform that provides two main methods for interaction:
1. Audio Upload: Users can upload their voice recordings in different supported formats.
2. Real-Time Input : Users have the option to use a microphone for real-time voice input, enabling immediate capture and processing of spoken commands.
  
  After the audio is received, it is processed through a Speech- to-Text Module that transforms the spoken words into written text. This text is then analyzed by the Gestur Mapping Engine, which finds the matching gesture from a pre-existing dataset. The corresponding gesture is then visually displayed to the user. GestureSpeak also includes a Feedback Collection Module that enables users to submit reviews and comments.
  
  This feedback is then analyzed through Sentiment Analysis to understand user satisfaction levels and identify any performance issues. The feedback collected will inform future enhancements, contributing to continuous improvement of the application.
  
  GestureSpeak has a modular and scalable architecture designed for flexibility. The backend and AI processing are built using Python, where machine learning modules handle tasks like speech-to-text conversion and gesture mapping. The web interface is created with modern frontend technologies, such as React and Node.js, which contribute to a responsive and user-friendly experience. All essential application data such as speech and gesture records, user feedback, sentiment analysis results, and other relevant information is stored in a MySQL database. This enables real- time access and easy visualization of data. Its modular and efficient architecture enables easy deployment and maintenance, while also providing flexibility for future upgrades, including expanded vocabulary and advanced functionalities.
  
  The overall workflow of the GestureSpeak system, illustrating the process from audio input to gesture display, is depicted in Figure 3.
  
  Figure 3: GestureSpeak: Audio Input to Hand Gesture Display Workflow
  1. RESULT AND EVALUATION
    
    The system developed effectively converts spoken language into corresponding sign language gestures, demonstrating its functionality through two different input methods.
    
    In the first method, live speech is captured directly through a microphone in real time. The system records the audio stream in common formats like WAV or FLAC, processes this raw audio data, and sends it to a speech recognition engine. This engine transcribes the audio into text, which is then matched with the gesture dataset to generate the appropriate sign language gestures. An example of this real-time conversion output is illustrated in Figure 4.
    
    The second method involves uploading a pre-recorded audio file. Users can submit an audio recording that is processed using the Google Speech-to-Text API. This API transcribes the spoken content into text, which the system then uses to identify and display the corresponding sequence of gestures. The outcome of this process for an uploaded audio file is depicted in Figure 5.
    
    Figure 4 : GestureSpeak, an assistive system that translates spoken language into hand gestures
    
    Figure 5: illustrates the result for the uploaded audio feature
    
    Besides generating gestures, the system also features a feedback analysis component that leverages Natural Language Processing (NLP). When users provide feedback, the system applies a lexicon-based sentiment analysis method, which uses a dictionary of words pre-classified as positive, negative, or neutral to evaluate the text.
    
    This method first splits the feedback into individual words, then compares each word against the sentiment dictionary. Words identified as positive add to the positive sentiment score, while those deemed negative increase the negative score. By comparing these scores, the system classifies the overall sentiment of the feedback as positive, negative, or neutral.
    
    For example, feedback like I love the gestures but the interface is slow contains both favorable and unfavorable terms. Because these sentiments balance out, the overall classification is neutral. When positive words dominate, the feedback is considered positive; if negative words prevail, it is marked negative.
    
    Administrators can view this sentiment information through the dashboard, which displays charts showing the distribution of feedback sentiments, individual sentiment scores, and a list of comments with their respective sentiment categories and scores. Figure 6 shows the admin interface where recent feedback and sentiment summaries are presented.
    
    This NLP-based feedback analysis helps administrators monitor user opinions effectively and make data-driven decisions to improve system usability and satisfaction.
    
    Figure 6 : Visualization of Feedback Sentiment Scores and Distribution in Admin Panel
  2. CONCLUSION
    
    GestureSpeak marks a major step forward in assistive communication tools, especially for those with hearing disabilities. By converting spoken input into visual hand gestures, it enables smooth interaction in settings where spoken communication is essential. With support for both live speech and uploaded audio, the system offers flexible functionality suitable for diverse real-world applications.
    
    The integration of Natural Language Processing (NLP) for user feedback analysis not only supports continuous system enhancement but also fosters a user-centric development approach. Its modular and scalable architecture enables customization for specific domains such as education, healthcare, and public services, making it adaptable and practical for real-world deployment.
    
    Future enhancements, including animated gesture visualization, multilingual voice input (with a focus on
    
    regional Indian languages), offline mobile application support, and the incorporation of AI-generated gesture models using Generative Adversarial Networks (GANs), aim to further elevate the platforms functionality. Overall, GestureSpeak demonstrates the potential of combining speech recognition, machine learning, and gesture rendering to bridge communication gaps and foster a more inclusive digital ecosystem.
  3. FUTURE WORK
    
    To enhance GestureSpeaks usability and inclusivity, several key improvements are underway:
    - Expanded Gesture Vocabulary Moving beyond individual words and simple phrases, the system will support full sentences and complex expressions, facilitating richer communication in educational and customer service environments.
    - Animated Gesture Representation Transitioning from static images to 2D and 3D animated gestures will enable more natural and expressive visual communication, improving user comprehension and engagement.
    - Multilingual Voice Support: Incorporating speech recognition capabilities for widely spoken Indian languages like Hindi, Kannada, Tamil, and Malayalam will enhance accessibility and deliver a more culturally inclusive experience for users.
    - Offline Capable Mobile Applications Native Android and iOS apps with offline functionality will ensure reliable use in areas with limited or no internet connectivity, increasing system reach and convenience.
    - Realistic Gesture Synthesis Using GANs Employing Generative Adversarial Networks (GANs) will enable the generation of lifelike hand gesture animations, enhancing visual realism and personalization of user interactions.
    - These planned enhancements aim to significantly improve user experience and foster greater inclusivity across diverse user groups.
  4. REFERENCES

Sonawane, P. B., & Nikalje, A. (2018). Text to Sign Language Conversion. International Journal of Engineering Research in Electronics and Communication Engineering (IJERECE), Vol. 5, Issue 2.
Reddy, S. S., & Pavithra, P. (2023). Audio or Text to Sign Language Converter. Journal of Emerging Technologies and Innovative Research (JETIR), Vol. 10, Issue 10.
Kahlon, N. K., & Singh, W. (2021). Machine Translation from Text to Sign Language: A Systematic Review. Universal Access in the Information Society, Springer.
Ojha, A., Pandey, A., Maurya, S., Thakur, A., & Dayananda, P. (2020). Sign Language to Text and Speech Translation in Real Time Using CNN. IJERT, Vol. 8, Issue 5.
Abraham, A., & Rohini, V. (2018). Real-time Conversion of Sign Language to Speech using ANN. ICACC.
Mopidevi, S., Ravi, D., & Prasad, E. V. (2023). Hand Gesture Recognition and Voice Conversion. E3S Web of Conferences.
Dixit, P. M., Patil, B. R., Koli, M. B., & Kamble, S. A. (2023). Hand Gesture Recognition and Text-Voice Conversion. JETIR.
Ambavane, P., Bhosale, P., & Kamble, P. A. (2018). A Novel Communication System for Deaf and Dumb Using Gestures. ICACC.
Thakar, S., Shah, S., Shah, B., & Nimkar, A. V. (2022). Sign Language to Text Conversion in Real Time using Transfer Learning. arXiv:2211.14446v2.
Rao, M. K., Kaur, H., Bedi, S. K., & Lekhana, M. A. (2023). Image-based Indian Sign Language Recognition. IIIT Naya Raipur.
Tripathi, K. M., Kamat, P., Patil, S., Jayaswal, R., Ahirrao, S., & Kotecha, K. (2023). Gesture-to-Text Translation Using SURF. Applied System Innovation (MDPI), Vol. 6, Issue 2.
Deng, Y., Doulamis, A., Palaskar, S., Collobert, R., Huenerfauth, M., Mihalcea, R., & Neubig, G. (2023). Is Spoken Language All You Need? Evaluating Speech-to-Sign Translation for Real- World Use. ACL 2023 Proceedings.

GestureSpeak: Converting Voice Inputs to Hand Gestures for Accessibility

INTRODUCTION

LITERATURE REVIEW

METHODOLOGY

Data Sources

System Design

RESULT AND EVALUATION

CONCLUSION

FUTURE WORK

REFERENCES