H2V: A Real Time Sign to Text and Speech using CNN and Mediapipe

Prachi Agarwal; Paridhi Gupta; Priyanshu Chauhan; Anmol Gaur

doi:10.17577/IJERTCONV14IS040049

ICTEM 2.0 -2026 (Volume 14 - Issue 04)

H2V: A Real Time Sign to Text and Speech using CNN and Mediapipe

DOI : 10.17577/IJERTCONV14IS040049

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 33
Authors : Prachi Agarwal, Paridhi Gupta, Priyanshu Chauhan, Anmol Gaur, Priyanshi Sharma
Paper ID : IJERTCONV14IS040049
Volume & Issue : Volume 14, Issue 04, ICTEM 2.0 (2026)
Published (First Online) : 24-05-2026
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

H2V: A Real Time Sign to Text and Speech using CNN and Mediapipe

Prachi Agarwal1,Paridhi Gupta2, Priyanshu Chauhan3, Anmol Gaur4, Priyanshi Sharma5

1 Assistant Professor, Computer Science and Engineering Department, MIT, Moradabad, India

reachtoprachi@gmail.com

2,3,4,5 Computer Science and Engineering Department, MIT, Moradabad, India paridhigupta099@gmail.com

priyanshuchauhan065@gmail.com anmolgaur99565@gmail.com qapple672@gmail.com

ABSTRACT

Communication between hearing-impaired individuals and the general population is often limited due to the lack of common communication tools. This paper introduces a real-time Hand-to-Voice (H2V) system that translates static sign language gestures into readable text and audible speech. The proposed approach captures live video through a standard webcam and detects hand movements using MediaPipe to obtain 21 landmark points. These landmarks are transformed into a skeletal representation, which is then analyzed by a Convolutional Neural Network (CNN) trained on grouped sign language gestures. To further reduce misclassification among visually similar signs, a rule-based refinement mechanism is applied after CNN prediction. The final output is displayed as text and converted into speech using a non-blocking text-to-speech module. Practical testing indicates that the system operates efficiently in real time for a predefined set of gestures, demonstrating its suitability for assistive communication.

Keywords

Hand Gesture Recognition, Sign Language Translation, Convolutional Neural Network, MediaPipe, Assistive Systems

Introduction

Sign language is an essential means of communication for individuals with hearing and speech impairments. However, the absence of universal sign language knowledge among the general population often creates communication barriers. With advancements in computer vision and machine learning, automated sign language interpretation systems have gained attention as a potential solution.

The Hand-to-Voice (H2V) system proposed in this work focuses on translating hand gestures into text and speech in real time using vision-based techniques. Unlike hardware-intensive solutions that rely on gloves or sensors, the proposed system requires only a webcam. By integrating hand landmark detection, deep learning-based classification, rule-based correction, and speech synthesis, the system aims to provide an accessible and cost-effective communication aid.
Related Work

Earlier studies in sign language recognition employed sensor-based gloves, depth cameras, and marker- driven systems. Although these methods offered reasonable accuracy, they were often expensive and inconvenient for everyday use. Recent vision-based approaches utilizing convolutional neural networks and hand landmarks have shown improved flexibility and performance.

Several works have explored CNN-based gesture recognition; however, many lack real-time speech output or rely on raw image inputs that are sensitive to background variations. The proposed system differentiates itself by using skeleton-based representations, hybrid classification strategies, and a real- time graphical interface with speech feedback.
System Architecture

The proposed H2V system follows a sequential processing pipeline, beginning with video acquisition and ending with speech generation. Live video frames captured from a webcam are processed to detect hand landmarks. These landmarks are converted into skeleton images, which are classified by a CNN model. The CNN output is refined using geometric rules, after which characters are combined into meaningful text and converted into speech.
METHODOLOGY
GRAPHICAL USER INTERFACE

A user-friendly graphical interface is developed using Tkinter. The interface displays the live video feed, skeletal visualization of the detected hand, recognized characters, and the generated sentence. Additional controls allow users to clear text or trigger speech output.
EXPERIMENTAL OBSERVATIONS

The system was evaluated under controlled indoor lighting conditions using a standard webcam. It demonstrated real-time responsiveness and reliable recognition for static hand gestures. Performance was observed to vary with hand orientation, distance from the camera, and lighting consistency.
ADVANTAGES

Requires only a standard webcam
Operates in real time
Combines deep learning with rule-based correction
Provides both text and speech output
Easy to use graphical interface

8. LIMITATIONS

Supports only static gestures
Sensitive to poor lighting conditions
Limited gesture vocabulary
1. FUTURE ENHANCEMENT
  
  Future work may extend the system to dynamic gesture recognition, incorporate two-hand interactions, enable continuous word-level recognition, and support deployment on mobile platforms with multilingual speech output.
2. CONCLUSION
  
  This paper presented a real-time Hand-to-Voice (H2V) system for translating sign language gestures into text and speech. By combining MediaPipe-based hand landmark detection, CNN-driven skeleton classification, and rule-based refinement, the system offers an effective and accessible solution for assistive communication. The proposed approach demonstrates practical feasibility and provides a strong foundation for further enhancements.
3. REFERENCES

K. Yaseen, O.-J. Kwon, J. Kim, S. Jamil, J. Lee, and F. Ullah, Next-Gen Dynamic Hand Gesture Recognition: MediaPipe, Inception-v3 and LSTM-Based Enhanced Deep Learning Model, Electronics, vol. 13, no. 16, p. 3233, 2024, doi:10.3390/electronics13163233.
A. Khatak and S. Naaz, Real-Time Multi-Mode Hand Gesture Recognition Using MediaPipe and Deep Learning for Human-Computer Interaction, Journal of Computational Analysis and Applications, vol. 33, no. 08, pp. 66106621, 2024.
W. Utomo, Y. Suhanda, H. Ar-Rasyid, and A. Dharmalau, Indonesian Language Sign Detection using Mediapipe with Long Short-Term Memory (LSTM) Algorithm, J. of Informatics and Web Engineering, 2025.
F. S. Takouchouang and H. T. Vinh, Reconnaissance Automatique des Langues des Signes: Hybrid

CNN-LSTM Approach Based on Mediapipe, arXiv:2510.22011, 2025.
K. Madhurima and M. Maneesha, Sign Language Recognition Using CNN and Hand Gestures Tracking, Int. J. of Eng. Research and Science & Technology, vol. 21, no. 4, pp. 341345, 2025.
S. Kamble, SLRNet: A Real-Time LSTM-Based Sign Language Recognition System,

arXiv:2506.11154, 2025.
S. Huse, R. Makode, T. Wankhade, and T. Nachane, Real-Time ISL Recognition using CNN and

MediaPipe, Int. J. for Multidisciplinary Research, 2025.
Hand Gesture Detection for Sign Language using CNN and MediaPipe, IJCRT, vol. 13, no. 4,

2025.

ICTEM 2.0 -2026 (Volume 14 - Issue 04)

H2V: A Real Time Sign to Text and Speech using CNN and Mediapipe

H2V: A Real Time Sign to Text and Speech using CNN and Mediapipe

Prachi Agarwal1,Paridhi Gupta2, Priyanshu Chauhan3, Anmol Gaur4, Priyanshi Sharma5

Introduction

Related Work

System Architecture

METHODOLOGY

HAND DETECTION AND LANDMARK EXTRACTION

SKELETON REPRESENTATION

DATASET PREPARATION

CNN-BASED CLASSIFICATION

GROUPING STRATEGY

RULE BASED REFINEMENT

TEXT CONSTRUCTION AND STABILITY CONTROL

TEXT-TO- SPEECH

GRAPHICAL USER INTERFACE

EXPERIMENTAL OBSERVATIONS

ADVANTAGES

8. LIMITATIONS

FUTURE ENHANCEMENT

CONCLUSION

REFERENCES