LipScribe: A Deep Learning-Based CNN-RNN Framework for Visual Speech Recognition

Alwisha Wilma Dsa; Sumangala N.

doi:10.17577/IJERTCONV14IS010012

Techprints 9.0 - 2026 (Volume 14 - Issue 01)

LipScribe: A Deep Learning-Based CNN-RNN Framework for Visual Speech Recognition

DOI : 10.17577/IJERTCONV14IS010012

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 28
Authors : Alwisha Wilma Dsa, Sumangala N.
Paper ID : IJERTCONV14IS010012
Volume & Issue : Volume 14, Issue 01, Techprints 9.0
Published (First Online) : 01-03-2026
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

LipScribe: A Deep Learning-Based CNN-RNN Framework for Visual Speech Recognition

Alwisha Wilma Dsa

Student, Dept. of Computer Applications St Joseph Engineering College Mangaluru, India

Sumangala N.

Assistant Professor

Dept. of Computer Applications

St Joseph Engineering College, Mangaluru, India

Abstract – In this paper, a deep learning system that recognizes speech by looking at how lips move in video clips is introduced. The system uses a combination of CNN and RNN, which helps in understanding both the spatial and time-based features for lip movements. The CNN part takes care of the shapes and patterns of the lips, while the RNN, especially the LSTM layers, keeps track of how lip movements change over time. The model, trained on a dataset with more than 55,000 video samples, each showing different words from 15 classes. The trained model achieved a validation accuracy of 98. 55%, showing it can tell apart even subtle lip movements. The system addresses challenges like silent speech recognition, working with different speakers, and handling background noise. It has potential uses in accessibility tools, surveillance, and communication without sound.

Keywords – Lip reading, speech recognition, deep learning, CNN- RNN, LSTM, silent communication, computer vision, speech recognition without audio.

INTRODUCTION

Recent progress in artificial intelligence, human-computer interaction, and accessibility technologies has led to more demand for system that allow communication without touching or using sound. Among these, lip reading also called visual speech recognition has become a popular way to understand spoken words by looking at how a person's lips move, without needing audio input. Traditional speech recognition systems depend heavily on sound, so they struggle in quiet environments, with poor audio quality, or when there is a lot of noise. Lip reading systems rely on visual clues instead, making them useful in situations where noise is a problem, such as voice command systems, communication aids for the hearing impaired, or security systems.

Reading lips manually is complex and requires training. Human errors can happen, especially when people speak at different speeds or with different accents. Modern deep learning has made it possible to create automated lip reading systems, especially with the availability of more video data and powerful computers. This study introduces a Lip Reading System that uses a CNN-RNN model to understand both the time-based

sequence (from video frames) and the spatial features (from individual images of the lips). CNN layers extracts features from each frame, and LSTM layers track how these features change over time to form words. A large collection of short video clips with spoken word labels is used to train and test the system. The suggested model aims to provide:
- Clear classification of words across 15 visual speech categories
- High accuracy even when backgrounds, lip shapes, and lighting change
- A solid foundation for future real-time lip reading system
This paper focuses on the system's design, training process, and how it is evaluated, highlighting its importance in the growing field of silent visual speech recognition.
LITERATURE REVIEW

From traditional image processing to advanced deep learning models that can comprehend temporal visual cues, the process of visual speech recognition has undergone significant change. The increasing availability of labelled datasets, increased processing power, and developments in neural network architectures are the main forces behind this evolution. Conventional lip reading methods mostly used hand-crafted features like motion vectors, geometric distances, and lip contours in conjunction with Hidden Markov Models (HMMs) or Support Vector Machines (SVMs) for classification. Despite being fundamental, these techniques had poor generalizability and were sensitive to head movement, lighting, and speaker variability [1], [2]. Convolutional Neural Networks (CNNs), which automatically learn hierarchical spatial features from raw image data, revolutionized the recognition of visual patterns. CNNs are useful for lip reading because they can effectively extract static features like lip position and shape from individual frames. However, static analysis by itself is unable to capture the sequential nature of lip movements because speech is a dynamic process [3].

Recurrent Neural Networks (RNNs) and their variations, such as Long Short-Term Memory (LSTM) networks, have been used to handle temporal dependencies. In order to comprehend how a sequence of lip positions correlates to phonemes or words, these models must be able to learn temporal sequences and long-range dependencies [4]. End-to-end deep learning frameworks for word-level lip reading were introduced by recent studies like LipNet and the LRW (Lip Reading in the Wild) dataset. These models usually use LSTM or Gated Recurrent Units (GRUs) for sequence modelling and CNN layers for spatial encoding. The accuracy of these has been shown to be significantly higher than that of conventional models [5], [6].

A hybrid CNN-RNN architecture can jointly model spatial and temporal features, which is why they have become the industry standard in lip reading systems today. With encouraging outcomes in both controlled and uncontrolled settings, these have been trained on datasets that include thousands of video clips [5], [7]. We use a similar hybrid approach in this work, constructing a deep learning pipeline that combines LSTM for sequential learning and CNN for frame-wise feature extraction. In order to maximise performance and training efficiency, our method is adapted for a smaller set of visually distinctive words while drawing inspiration from existing architectures.
SYSTEM DESIGN AND IMPLEMENTATION
1. System Architecture
  
  The hybrid CNN-RNN deep learning model that underpins the suggested Lip Reading System is made to categorize spoken words from video input using only lip movements. There are three primary things to the architecture:
  - Frame Extraction & Preprocessing: Every video input is divided into its component frames. To simplify and highlight lip contours, the frames are resized and converted to greyscale.
  - CNN-based Spatial Feature Extractor: Each frame's low-level to high-level visual features, like movement, contour, and lip shape, are extracted by a sequence of convolutional and pooling layers.
  - RNN-based Temporal Sequence Model: To record temporal dependencies and changes in lip positions over time, the sequential frame features are sent to LSTM layers.
    
    This dual-stream design allows the model, process both spatial and temporal information, making it well-suited for recognizing word-level lip movements.
2. Data Processing Workflow
  
  For consistency and accuracy during model training, the data preprocessing workflow is crucial:
  - Frame Extraction: For each sample, video is broken down into 29 frames, each lasting one second.
  - Greyscale Conversion & Resizing: To make the input data simpler, each frame is resized to 100 x 50 pixels and converted to greyscale.
  - Label Encoding: One of 15 predefined word classes is used to label each video. To be used in the softmax classification layer, these labels are one-hot encoded.
  - Train-Test Split: Usually done in an 80:20 ratio, the dataset is divided ino training and validation subsets.
3. Natural Language Processing Implementation
  
  The following are integrated into the deep learning model:
  - Layers of CNN:
    - ReLU-activated 2D convolutional layers
    - Max-pooling and batch normalization
    - Feature vectors are the output for every frame.
  - Layers of RNN:
    - Dropout regularization in LSTM layers
    - Bidirectional LSTM to improve comprehension of temporal context
  - Dense Layers:
    - Softmax is used for multi-class classification after fully connected layers.
4. Real-Time Processing and Logging
  
  The following setup is used to train the model:
  - Loss Function: Categorical cross-entropy
  - Optimizer: Adam
  - Batch Size: 32
  - Epochs: 20
  - Framework: TensorFlow + Keras
  - Hardware: For quicker convergence, training was done in a GPU-enabled environment.
Accuracy and loss validation plots are used to track the training process to identify overfitting and convergence stability.
IMPLEMENTATION DETAILS
1. Technology Stack
  
  The Lip Reading System uses popular computer vision and deep learning tools. The main tools are:
  - Python: The main programming language
  - OpenCV: For reading videos and taking still images from them
  - TensorFlow and Keras: For building and training the deep learning model
  - NumPy and Pandas: For working with and preparing data
  - Matplotlib and Seaborn: For making charts to show how the model learns and performs
  - Scikit-learn: For converting labels and making confusion matrices
    
    Together, these tools help in creating, training, and checking the model effectively.
2. Data Description
  
  The system uses a carefully chosen dataset of people saying specific words. The dataset has:
  - Training Samples: 55,624
  - Validation Samples: 13,899
  - Number of Classes: 15 different words
  - Frames per Sample: 29 frames (about 1 second of video)
    
    Each video shows a person's face, focused on the lips, saying a word. The dataset has an equal number of videos for all 15 words, so the model can learn fairly without favoring any word.
3. Frame Processing and Sequence Formation
  
  Each video goes through these steps:
  - Frame Extraction: The video is split into 29 frames using OpenCV
  - Grayscale Conversion: Each frame is turned into black and white to reduce data size
  - Resizing and Normalization: Frames are made into 100×50 pixels, and pixel values are adjusted to a range between 0 to 1
  - Tensor Formatting: The data is arranged into 5D tensors: (number of samples, time steps, height, width, color channels)
  - Label Encoding: Each word is converted into a special code (one-hot encoding) to use in the model's final step
4. Model Training and Evaluation
  
  The model is trained by having the computer learn from data that has labels. The training setup includes:
  - Epochs: 20
  - Batch Size: 32
  - Loss Function: Categorical Cross-Entropy
  - Optimizer: Adam
  - Validation Split: 20% of the data for testing how well the model actually performs
  - Early Stopping: Training stops if the model starts to remember the data instead of learning from it
  - Model Saving: The best version of the model is saved as best_model.h5
  - Training Time:
    - Total time taken: Around 5 hours using Jupyter Notebook in Visual Studio Code (VS Code) with a CPU
The training results, including accuracy and loss curves, are shown using Matplotlib. The model achieves a validation accuracy of 98.55%, which shows it works effectively and can correctly classify words.
RESULTS
1. Performance Metrics
  
  A group of 13,899 videos, each labeled with one of 15 different spoken words, was used to check how well the Lip Reading System works.
  
  The main way to measure its performance was by looking at categorical accuracy, which tells us the percentage of words the system correctly guessed.
  
  TABLE I SYSTEM PERFORMANCE METRICS
  
  Metric
  
  Value
  
  Final Training Accuracy
  
  100.00%
  
  Final Validation Accuracy
  
  98.55%
  
  Number of Classes
  
  15
  
  Validation Samples
  
  13,899
  
  Epochs Trained
  
  20
  
  These results show the model does a great job on the validation set, meaning it has learned to recognize the different ways lips move over time.
2. Performance Optimization
  
  During training, we kept track of how well the model was learning by looking at accuracy and loss over time.
  
  The graphs show that both training and validation accuracy went up smoothly, without the model becoming too focused to
  
  train the data. The loss, which measures how faulty the model is, also dropped steadily over time.
  
  Figure 1: Training vs Validation Accuracy
  
  Figure 2: Training vs Validation Loss
3. Security and Access Control
  
  We made a confusion matrix to better understand how the model made its predictions for the 15 different words.
  
  This helps us see which words were predicted very well and which ones the model confused.
  
  Key observations:
  - Most words were predicted almost perfectly.
  - A few errors happened when the lip movements looked similar for different words.
  - No class was completely ignored or misclassified.
    
    This shows the model is good at telling the small differences between lip movements.
4. Scalability Analysis
  
  The results show the system works well for recognizing individual words using lip movements.
  - The high validation accuracy (98.55%) means the model is very good at learning and applying what it has learned.
  - The model trained smoothly, which means its stable and doesnt get stuck.
  - It works well across all 15 classes, with few mistakes.
These results support using a hybrid model which combines CNNs and LSTMs for visual speech recognition.
DISCUSSION
1. Advantages
  
  The Lip Reading System has many benefits compared to older methods:
  - Hybrid Architecture: It uses CNNs to find patterns in each frame and LSTMs to keep note of changes over time, which helps it understand how lips move.
  - High Accuracy: The model scored 98.55% accuracy on the validation set, showing it can tell apart similar word classes with few errors.
  - Simple and Flexible: The system is designed in a way that makes it easy to change for other tasks or different types of data.
  - Works Without Sound: It uses only video input, so it can work in places with no sound, like in surveillance or for people who cant hear.
  - Can Handle More Data: The system can be scaled up to handle bigger datasets or longer phrases.
2. Even when the results are good, there are still areas to improve:
  - Small Vocabulary: The model is only trained on 15 words.
  - Expanding it to recognize full sentences or continuous speech is a big challenge.
  - Speaker Differences: The model might not work as well with different people because of differences in how people move their lips or face shape.
  - Lighting and Background Issues: Strong changes in light or busy backgrounds can make the model less accurate, even after some preprocessing.
  - Real-Time Use: The model is designed for offline use, so adapting it for live video streams would need faster processing.
  - Supports Only One Language: The current system works with one language.
  - Add support for multiple languages could help it work in different parts of the world.
  - While this project used a unidirectional LSTM for temporal modeling, future improvements may include using a Bi-directional LSTM (Bi-LSTM), which can leverage both past and future context for each frame, potentially enhancing prediction accuracy.
3. Potential Applications
  
  The system can be used for several real-world tasks:
  - Assistive Technology: Helps people who cant hear understand what others are saying by reading their lips.
  - Security and Surveillance: Recognizes words in places with loud noises or where there is no sound, like in security footage.
  - Silent Command Interfaces: Used in places where speaking is not allowed, like libraries or military bases.
  - Education and Accessibility: Supports learning lip reading or making captions for people who need special help.
  - Driver Monitoring: Could be used in cars to interact without needing to use audio.
CONCLUSION

This paper introduces a deep learning system that uses both CNNs and LSTMs to recognize spoken words by looking at lip movements in videos.The combination of these two types of networks helps the model understand both what the lips look like at each moment and how they change over time. The system is trained on a large dataset of over 55,000 labeled video clips from 15 different words.It reached a validation accuracy of 98. 55%, showing it works very well for identifying spoken words from visual input. The accuracy and confusion matrix analysis confirm that the model can tell apart even similar lip movements with high accuracy.

This model helps solve some of the problems with traditional lip reading systems by working without audio.It can be used in several real-world situations, like helping people with hearing disabilities, improving security, or supporting silent communication. In the future, the model can be improved in these ways:

Recognize full sentences and longer phrases.
Work in real-time with fast processing for live use.
Support multiple languages and be trained on data from different speakers.

Overall, this system offers a strong foundation for creating new, visual speech recognising systems that can help make communication more inclusive and accessible.

REFERENCES

A. Katsaggelos, M. Kondoz, and T. Chen, Video compression systems,

in IEEE Signal Processing Magazine, vol. 17, no. 2, 2000, pp. 4558.
C. Neti et al., Audio-visual speech recognition, Johns Hopkins University Technical Report, 2000.
Y. LeCun, Y. Bengio, and G. Hinton, Deep learning, Nature, vol. 521, no. 7553, 2015, pp. 436444.
A. Graves, A. Mohamed, and G. Hinton, Speech recognition with deep recurrent neural networks, in IEEE ICASSP, 2013, pp. 66456649.
Y. Assael, B. Shillingford, S. Whiteson, and N. de Freitas, LipNet: End- to-End Sentence-Level Lipreading, arXiv preprint arXiv:1611.01599, 2016.
J. S. Chung and A. Zisserman, Lip Reading in the Wild, in Asian Conference on Computer Vision, 2016, pp. 87103.
J. Wand, J. Koutník, and J. Schmidhuber, Lipreading with long short- term memory, in IEEE ICASSP, 2016, pp. 61156119.
S. Fenghour, D. Chen, K. Guo, B. Li and P. Xiao, "Deep Learning-Based Automated Lip-Reading: A Survey," in IEEE Access, vol. 9, pp. 121184- 121205, 2021, doi: 10.1109/ACCESS.2021.3107946.
Agrawal V, Hazratifard M, Elmiligi H, Gebali F. Electrocardiogram (ECG)-Based User Authentication Using Deep Learning Algorithms. Diagnostics (Basel). 2023 Jan 25;13(3):439. doi: 10.3390/diagnostics13030439. PMID: 36766544; PMCID: PMC9914224.
Jia, J., Wang, Z., Xu, L., Dai, J., Gu, M., & Huang, J. (2022). An Interference-Resistant and Low-Consumption Lip Recognition Method.

Electronics, 11(19), 3066. https://doi.org/10.3390/electronics11193066
Fenghour S, Chen D, Guo K, Li B, Xiao P. An Effective Conversion of Visemes to Words for High-Performance Automatic Lipreading. Sensors (Basel). 2021 Nov 26;21(23):7890. doi: 10.3390/s21237890. PMID: 34883888; PMCID: PMC8659639.
R. Itu, D. Borza and R. Danescu, "Automatic extrinsic camera parameters calibration using convolutional neural networks," 2017 13th IEEE International Conference on Intelligent Computer Communication and Processing (ICCP), Cluj-Napoca, Romania, 2017, pp. 273-278, doi: 10.1109/ICCP.2017.8117016.

Metric	Value
Final Training Accuracy	100.00%
Final Validation Accuracy	98.55%
Number of Classes	15
Validation Samples	13,899
Epochs Trained	20

LipScribe: A Deep Learning-Based CNN-RNN Framework for Visual Speech Recognition

INTRODUCTION

LITERATURE REVIEW

SYSTEM DESIGN AND IMPLEMENTATION

IMPLEMENTATION DETAILS

RESULTS

DISCUSSION

CONCLUSION

REFERENCES