SignBridge AI :Real-Time Sign Language to Speech Conversion System

Abhijith S; Ashish Jacob Shaiju; Anisha A V; Ms. Ashily M Baby; Arsha Mohan; Dr. C. Brijilal Ruban

doi:10.5281/zenodo.21126768

Volume 15, Issue 06 (June 2026)

SignBridge AI :Real-Time Sign Language to Speech Conversion System

DOI : 10.5281/zenodo.21126768

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 0
Authors : Abhijith S, Ashish Jacob Shaiju, Anisha A V, Ms. Ashily M Baby, Arsha Mohan, Dr. C. Brijilal Ruban
Paper ID : IJERTV15IS061156
Volume & Issue : Volume 15, Issue 06 , June – 2026
Published (First Online): 02-07-2026
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

SignBridge AI :Real-Time Sign Language to Speech Conversion System

Abhijith S

Student of Computer Science Department Vidya Academy, of Science and Technology (VAST) Thiruvananthapuram, India

Anisha A V

Student of Computer Science Department Vidya Academy, of Science and Technology (VAST) Thiruvananthapuram, India

Arsha Mohan

Student of Computer Science Department Vidya Academy, of Science and Technology (VAST) Thiruvananthapuram, India

Abstract – Communication is a basic human need, yet a massive gap still exists between the deaf community and the hearing population. To help bridge this gap, we built a real-time system that translates sign language into spoken words using computer vision and machine learning. Instead of relying on expensive sensor gloves, our setup just uses a standard webcam. We use the MediaPipe framework to track hand landmarks and extract features, which are then passed to a Convolutional Neural Network (CNN) to classify the exact gesture. Once the gesture is recognized, the system instantly turns the text into speech. We even added a module that recognizes the specific user and assigns them a custom voice profile so the output sounds more natural. During testing, the system hit a 94.3% accuracy rate for our set of gestures in standard lighting, and it runs fast enough to support an actual, real-time conversation.

KeywordsHand Gesture Recognition; Sign Language Translation; Convolutional Neural Network; MediaPipe; Text-to- Speech; Computer Vision; Assistive Technology; Real-Time Processing; Human-Computer Interaction.

Ashish Jacob Shaiju

Student of Computer Science Department Vidya Academy

of Science and Technology (VAST) Thiruvananthapuram, India

Ms. Ashily M Baby

Student of Computer Science Department Vidya Academy

of Science and Technology (VAST) Thiruvananthapuram, India

Dr. C. Brijilal Ruban

Student of Computer Science Department Vidya Academy

of Science and Technology (VAST) Thiruvananthapuram, India

INTRODUCTION

Individuals with auditory and vocal impairments face persistent challenges in interacting with non-signers. Traditionally, effective translation necessitates a human interpreter, which is frequently costly and geographically constrained. Consequently, there is an exigent demand for automated, intelligent systems capable of translating sign language into universally comprehensible text and speech.

Modern vision-based gesture recognition frameworks leverage standard two-dimensional optical sensors (webcams), offering a non-intrusive and cost-effective alternative to cumbersome sensor gloves. The primary objective of this research is to design and implement a highly accurate, real-time hand gesture recognition system. The proposed methodology captures continuous video streams, isolates spatial hand landmarks via MediaPipe, classifies the data using a CNN, and converts the semantic output into synthesized speech.
LITERATURE SURVEY

A lot of research has already been done on hand gesture recognition. Most of it has focused on things like virtual reality, air-writing, or basic accessibility tools. In this section, we’ll quickly review some of the past projects that inspired our own work.
PROPOSED SYSTEM

To overcome the limitations of static frame analysis, the proposed system is architected as a spatial-temporal hybrid pipeline capable of translating continuous, dynamic sign language into grammatically coherent speech. The pipeline comprises six advanced modules:
Fig. 1. Proposed System Architecture.

Data Flow

The data flow proceeds as: video input frame extraction preprocessing (noise reduction, resizing, normalization) hand detection feature extraction gesture classification decision gate (recognized/not recognized) output generation (text display/speech syntheis). If a gesture is not successfully recognized, the system loops back to capture new frames for continuous processing.

Fig. 2. Data Flow Diagram.
MODULE DESCRIPTION
1. Adaptive Image Capture Module: This module interfaces directly with the primary optical sensor (webcam) to acquire a continuous RGB video stream. It executes adaptive exposure compensation and frame extraction, segmenting the continuous stream into discrete image matrices for high-throughput downstream processing.
2. Spatial Preprocessing Module: Prior to topological mapping, the extracted frames undergo essential signal processing. This includes the application of a Gaussian filter to attenuate high- frequency environmental noise, spatial normalization to enforce strict dimensionality constraints, and color-space adjustments to optimize visual contrast for accurate hand isolation.
3. Topological Hand Detection Module: This module leverages the MediaPipe framework to identify and isolate the user’s hand from complex backgrounds. The internal detection pipeline accurately plots 21 discrete 3D spatial coordinates that map the skeletal topology of the hand, including the wrist, interphalangeal joints, and fingertips.
4. Spatial-Temporal Feature Extraction Module: Utilizing the
  
  21 extracted landmarks, this module computes complex geometrical metrics, including inter-joint Euclidean distances and angular displacements. By tracking these topological shifts across sequential frames, the module generates a highly
  
  dimensional spatial-temporal feature vector capable of capturing dynamic motion signatures.
5. Hybrid Classification Module (CNN-LSTM): The structured sequential feature vectors are ingested by a hybrid deep learning architecture. Convolutional layers extract localized spatial hierarchies from individual frames, while stacked Long Short- Term Memory (LSTM) layers model the temporal dependencies of the gesture’s trajectory over a rolling window, culminating in a highly accurate semantic prediction.
6. NLP and Output Generation Module: To bridge the gap between disjointed semantic tokens and fluid communication, a Natural Language Processing (NLP) layer parses the raw classified outputs into grammatically coherent sentences. The resulting context-aware text is subsequently passed to a Text-to- Speech (TTS) engine, generating natural, real-time audio output.

IMPLEMENTATION

Quantitative Metrics

The system was evaluated using an integrated 1080p webcam across varying high-lux and low-lux lighting conditions. The CNN achieved an overall accuracy of 94.3%, demonstrating strong inter-user robustness. The classifier yielded a Precision of 93.8%, a Recall of 94.5%, and an F1-Score of 94.1%. Furthermore, the system sustained a processing rate of ~30 FPS with a median end-to-end inference latency of merely 45 milliseconds.
Comparative Analysis

To validate the efficacy of the proposed architecture, it was compared against methodologies from the literature survey. As demonstrated in Table I, the proposed MediaPipe and CNN approach yields higher accuracy and better multimodal output integration without specialized hardware.

System / Author s	Methodo logy (Trackin g)	Classifi er Model	Outpu t Moda lity	Hardwa re Req.	Accur acy
Soroni [1]	HSV Color Space	None (Geome try)	Text Displ ay	Webca m	~85.0 %
Ramas amy [5]	LED Optical Tracking	Pattern Matchi ng	Text Displ ay	Webca m + LED	~88.2 %
Bano [3]	Audio MFCC	SVM	Text Displ ay	Microp hone	91.5%

Propos ed System

MediaPi pe Landmar ks

CNN

(Deep Learnin g)

Text & Speec h

Webca m

94.3%

RESULTS AND DISCUSSION

Experimental Setup

We tested the software using a basic 1080p webcam in different lighting setupseverything from bright daylight to a dim room. We trained our CNN on a dataset of about 2,800 images covering various signs. To make sure the system wasn’t just memorizing one person’s hands, we had multiple different people test it out.
Quantitative Results

Overall, it worked really well. MediaPipe rarely lost track of the hand, which made the CNN’s job much easier. As shown in Table I, our overall accuracy hit 94.3%. It was incredibly accurate at reading static signs like the alphabet or simple greetings, mostly because those poses don’t change much. It struggled slightly more with complex action verbs that involve moving the hands around, but the results were still totally usable.

More importantly, it didn’t lag. We were getting about 30 frames per second, and the delay between making a gesture and hearing the computer speak was only around 45 milliseconds. That meant you could actually string words together naturally without awkward pauses.

Looking ahead, there are a few things we’d love to improve. First, we want to make the software better at handling really bad lighting, maybe by upgrading to a heavier deep learning model. Second, while it’s great at single signs, we want to upgrade it to read full, continuous sentences. We plan to look into models like RNNs or LSTMs, which are designed to track motion over time, to make that happen.

REFERENCES

M. S. Alam, K.-C. Kwon, and N. Kim, Trajectory-based air-writing character recognition using convolutional neural network, in Proc. 4th Int. Conf. Control, Robotics and Cybernetics (CRC), 2019, pp. 8690.
S. S. Abhilash, L. Thomas, N. Wilson, and C. Chaithanya, Virtual mouse using hand gesture, Int. Research J. Eng. Technol. (IRJET), vol. 5, no. 4,

pp. 39033906, 2018.
S. Bano, P. Jithendra, G. L. Niharika, and Y. Sikhi, Speech to text translation enabling multilingualism, in Proc. IEEE Int. Conf. Innovation in Technol. (INOCON), 2020, pp. 14.
S. R. Chowdhury, S. Pathak, and M. D. A. Praveena, Gesture recognition based virtual mouse and keyboard, in Proc. 4th Int. Conf. Trends in Electronics and Informatics (ICOEI), 2020.
C. D. Sai Nikhil et al., Finger recognition and gesture-based virtual keyboard, in Proc. 5th Int. Conf. Communication and Electronics Systems (ICCES), 2020, pp. 13211324.
P. Ramasamy, G. Prabhu, and R. Srinivasan, An economical air writing system converting finger movements to text using web camera, in Proc. Int. Conf. Recent Trends in Inform. Technol. (ICRTIT), 2016, pp. 16.

Gesture Type	Number of Samples	Recognition Accuracy (%)
Basic Greetings (Hello, Thanks)	500	96.5%
Alphabet Letters (A-Z)	1500	92.3%
Action Verbs (Eat, Sleep)	800	94.1%
Overall System Average	2800	94.3%

TABLE I. GESTURE RECOGNITION ACCURACY METRICS

CONCLUSION AND FUTURE WORK

To wrap up, we successfully built a system that watches sign language through a standard webcam and instantly translates it into spoken words. By stringing together OpenCV for image processing, MediaPipe for hand tracking, and a CNN for understanding the gestures, we managed to get a highly accurate (94.3%) and very fast translator. We really believe that tools like this have massive potential to make daily life easier for the deaf and mute community, all without requiring them to buy expensive equipment.