An Application to Convert Lip Movement into Readable Text

Saakshi Bhosale; Rohan Bait; Rohan Bangera; Shivangi Jotshi; Jinesh Melvin

doi:10.17577/IJERTV9IS050196

Volume 09, Issue 05 (May 2020)

An Application to Convert Lip Movement into Readable Text

DOI : 10.17577/IJERTV9IS050196

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 2,919
Authors : Saakshi Bhosale , Rohan Bait , Rohan Bangera , Shivangi Jotshi, Jinesh Melvin
Paper ID : IJERTV9IS050196
Volume & Issue : Volume 09, Issue 05 (May 2020)
Published (First Online): 14-05-2020
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

An Application to Convert Lip Movement into Readable Text

Saakshi Bhosale, Rohan Bait, Shivangi Jotshi, Rohan Bangera, Prof. Jinesh Melvin

Department of Computer Engineering, PCE, Navi Mumbai, India – 410206

Abstract: Lip-Reading is the process of interpreting spoken word by observing lip movement, without or with audio (Noisy). The input for the application will be frames from an entire video. The application needs to detect the face (lips) from the entire frame also, trace and understand the feature patterns of the lip moments over the parameter of time. This can be done using Computer Vision (feature extraction) and Deep Convolutional Neural Network (CNN) Model Lip- reading system is difficult to implement due to complex image processing, hard-to-train classifiers and long-term recognition processes. Automatic lip-reading technology is a very important component of humancomputer interaction technology. It is very important for human language communication and visual perception. Traditional lip-reading systems usually consist of two stages: feature extraction and classification. For the first stage, a lot of methods use pixel values extracted from the mouth region of interest (ROI) represented as visual information. At present, deep learning has made significant progress in the field of computer vision (image representation, target detection, human behaviour recognition and video recognition). Therefore, automatic lip- reading technology has shifted from the traditional manual feature extraction classification methods to end-to-end deep learning architecture models.

Keywords: Lip tracking, Phoneme, 3D CNN Model, facial points, active frames.

INTRODUCTION

With the increase in technological advancements, lip- reading systems have gained major attention. These systems are a great help to the hearing impaired as they convert audio into text that is easily readable. These systems have also proved to be extremely helpful in government operations as a spy camera.

The first step of a lip-reading system is to track facial points. The system must be able to highlight the lips from the entire face. The advantage of this method is that lips of different shapes and sizes can be tracked, allowing the system to get an exact set of points for the lips.

Once the lips have been tracked, the major challenge is to recognize the lip movement. Every person will have a different lip movement due to the many possible lip shapes and sizes. The similarity between these movements is that for a specific phoneme, the distance between the two lips is pre-defined. Phoneme is the sound created while pronouncing any syllable in any language. The distance that is measured is measured in terms of the Euclidean distance and this distance is specific to a syllable.

Hence, the classification of the lip movements will be done on the basis of this distance.

This classification is a media-based classification hence, a 3D Convolutional Neural Network (CNN) Model is best suitable for this project.
THE CNN MODEL

This project uses the Three-Dimensional Convolutional Neural Network model for testing and training data and also predicting results. CNN is the go-to model for an image- based problem. It gives the most accuracy for multi-media problems. CNN is an unsupervised model and does not require human supervision to detect important features.

CNN is also computationally efficient. It uses operations like convolution and pooling and also performs parameter sharing. A CNN model can run on any device, making them universally usefully.

2.1 Architecture Convolution-

Convolutional layer is the main building block of CNN model. Convolution is a mathematical operation that merges two sets of information together. A convolutional filter is applied to the input data and a feature map is produced. This process is performed by sliding the filter over the input. At every location, an element-wise matrix multiplication is done and the result is summed. This sum goes into the feature map.

Fig. Convolution of multiple layers

Pooling-

After a convolution operation, a pooling operation is performed, mainly to reduce the dimensionality of the layers. This enables us to reduce the number of parameters, which both shortens the training time and combats overfitting. Pooling layers reduce the height and width of each feature map independently but keep the depth intact.

Pooling has no parameters, contrary to the convolution operation. It slides a window over its input, and simply takes the max value in the window, if it is max pooling. Here too, we specify the size and stride of the window.

Fig. Max Pooling

Architecture of model in this Application-

Fig. CNN Architecture

SYSTEM ARCHITECTURE
DATASET

The dataset used in this project has been downloaded from GRID. GRID is a large multitasker audiovisual sentence corpus to support joint computational-behavioral studies in speech perception. In brief, the corpus consists of high- quality audio and video (facial) recordings of 1000 sentences spoken by each of 34 talkers (18 males, 16 female). Sentences are of the form put red at G9 now.

In each category of speaker i.e. male and female 4 types of file formats are available. The files include:
Audio files were scaled on collection to have an absolute maximum amplitude value of 1 and down sampled to 25 kHz. These signals have been end pointed. In addition, the raw original 50 kHz signals are included below. Video files are provided in two formats: normal quality (360×288;

~1kbit/s) and high quality (720×576; ~6kbit/s).

5.1 A sample of dataset

DATA FLOW DIAGRAM OF SYSTEM

Fig. Data Flow Diagram

As depicted in the diagram, the User provides a video input with or without audio. The dlib library is used to extract features from the face. Features like lips, nose, eyes are extracted and facial coordinates are produced. These facial coordinates are passed through the Model that is stored in

.json file along with weights stored in .p file. These weights are previously saved during the construction of the model. The final output of the model is an Output String that has the spoken sentence as text.

ACKNOWLEDGMENT

We would like to extend our deepest gratitude to our Project guide Prof. Jinesh Melvin for his exemplary guidance, monitoring and constant encouragement throughout this project which helped us improve our work.

We would also like to extend our gratitude to our Head of Computer Engineering Department Dr. Sharvari Govilkar for providing us with an opportunity and platform to carry out this project.

We are also extremely grateful to our Principal Dr. Sandeep Joshi who provided us with this golden opportunity as well as all the facilities needed to carry out this project successfully.

Table 4.1 Dataset Sample

REFERENCES

bin green with y eight please	set white by f eight please
lay red in c four now	place blue at I five please
place blue at I one soon	set blue with l eight now
bin white by f six soon	place red by u one soon
set green with c zero again	lay white with d six soon

bin green with y eight please	set white by f eight please
lay red in c four now	place blue at I five please
place blue at I one soon	set blue with l eight now
bin white by f six soon	place red by u one soon
set green with c zero again	lay white with d six soon

A Lip-Reading Application on MS Kinect Camera Alper YargÃ§, Muzaffer Doan Computer Engineering Department Anadolu University Eskisehir, Turkey {ayargic, muzafferd}@anadolu.edu.tr
Lipreading Using a Comparative Machine Learning Approach Ziad Thabet Faculty of Computer Science MISR

INTERNATIONAL UNIVERSITY Cairo, Egypt

Ziad1407174@miuegypt.edu.eg

Amr Nabih Faculty of Computer Science MISR INTERNATIONAL UNIVERSITY Cairo, Egypt

Amr1410718@miuegypt.edu.eg

Karim Azmi Faculty of Computer Science MISR INTERNATIONAL UNIVERSITY Cairo, Egypt

Karim1405338@miuegypt.edu.eg

Youssef Samy Faculty of Computer Science MISR INTERNATIONAL UNIVERSITY Cairo, Egypt

Youssef1410209@miuegypt.edu.eg

Ghada Khoriba Faculty of Computers and Information Helwan University Cairo, Egypt ghada khoriba@fcih.helwan.edu.eg

Mai Elshehaly School of Computing University of Leeds Leeds,

UK m.h.elshehaly@leeds.ac.uk
A Review on Methods and Classifiers in Lip Reading

Leticia Ria Aran1, Farrah Wong2 and Lim Pei Yi3 Faculty of Engineering Universiti Malaysia Sabah Kota Kinabalu, Sabah, Malaysia 1latishaaran@gmail.com 2farrah@ums.edu.my 3lpy@ums.edu.my
Lip Reading Techniques: A Survey Shreya Agrawal Computer Science and Engineering, MNNIT Allahabad Allahabad, India 211004 Email: cs134027@mnnit.ac.in Verma Rahul Om Prakash Computer Science and Engineering, MNNIT Allahabad Allahabad, India 211004 Email: cs134061@mnnit.ac.in Ranvijay Computer Science and Engineering, MNNIT Allahabad Allahabad, India 211004 Email: ranvijay@mnnit.ac.in
Personal Computer Based Real Time Lip Reading System Kazunori SUGAHARA, Makoto KISHINO, Ryosuke KONISHI Faculty of Engineering, Tottori University 4-101 Koyama-cho Minami, Tottori-shi, Tottori, Japan 680-8552 Fax: +8 1-857-3 1- 5246, e-Mail:sugahara@ele.tottori-u.ac.jp
Lip Movements Recognition Towards An Automatic Lip Reading System for Myanmar Consonants Thein Thein Ph.D Student University of Computer Studies, Mandalay (UCSM) Mandalay, Myanmar theinthein.cmw@gmail.com Kalyar Myo San Faculty of Computer Systems and Technologies (FCST) University of Computer Studies, Mandalay (UCSM) Mandalay, Myanmar kalyar.myosan@gmail.com
Lip Localization Technique Towards an Automatic Lip Reading Approach for Myanmar Consonants Recognition
Thein Thein University of Computer Studies, Mandalay (UCSM) Mandalay, Myanmar e-mail: theinthein.cmw@gmail.com Kalyar Myo San Faculty of Computer Systems and Technologies (FCST) University of Computer Studies, Mandalay (UCSM) Mandalay,

Myanmar e-mail: Kalyar.myosan@gmail.com
A REAL-TIME AUTOMATIC LIPREADING SYSTEM S.L. Wang+ , W.H. Lau+ , S.H. Leung* and H. Yan+ +Department of Computer Engineering and Information Technology, *Department of Electronic Engineering City University of Hong Kong, 83 Tat Chee Avenue, Hong Kong

An Application to Convert Lip Movement into Readable Text

Leave a Reply