Intelligent Epidemic Prevention System Based on Voice and Gesture

Download Full-Text PDF Cite this Publication

Text Only Version

Intelligent Epidemic Prevention System Based on Voice and Gesture

Next >

Haiguang Chen, Mingxing Liu, Susheng He

Shanghai Normal University

Abstract—In the end of 2019, the new crown epidemic has outbreak, and medical personnel have become the most scarce resources in all countries. Protecting their lives is an urgent task. This paper takes the principle of contactlessness as the core, the speech recognition module based on the deep learning framework GRU+CTC and the gesture recognition module based on KNN and SVM as the core algorithm, and develops an intelligent epidemic prevention system based on voice and gestures. Experimental results show that whether it is a voice module or a gesture module, its accuracy and efficiency are relatively high.

Keywords: CNN+GRU+CTC, KNN, SVM, Voice, Gesture Recognition


Since December last year, the novel coronavirus disease(COVID-19 for short) has spread on a large scale around the world, and the productivity of the "contactless economy" represented by artificial intelligence and machine learning has been indispensable in this war, which is widely used in community material transportation, automatic

(1) The intelligent epidemic

prevention system developed by us is highly usable in the current epidemic environment;

(2) We combine the gesture module and the voice module to study, combining the advantages of the two, and also have the function of improving efficiency and saving time under normal circumstances.

Figure 1 The basic flow of the algorithm


With the development of deep learning in recent years, speech recognition technology has also made considerable progress. DNN, CNN, and LSTM are some of the more mainstream directions. CNN[1] is introduced into speech recognition in 2012 by Ossama Abdel-Hamid. Initially, the convolutional layer and the pooling layer appear alternately, and the scale of the convolution kernel is relatively large, and the number of CNN layers is also Not much, mainly used to

temperature measurement in public places, disinfection and antivirus, and hotel intelligent services. The principle of "no contact" is separated by a thick layer of glass inside and outside the epidemic isolation ward. Medical staff wear heavy protective clothing, masks and goggles. It is very inconvenient to communicate inside and outside. The use of walkie-talkies at night may also affect the isolation ward. The patient's rest inside.

This intelligent epidemic prevention system is based on the principle of "no contact" combined with artificial intelligence to design a gesture voice intelligent epidemic prevention system. Some routine tasks(such as injections, medicine feeding, etc.) are designed into simple gestures, and medical staff in the isolation ward make gestures to pass The system recognizes and converts into voice and sends it to the medical staff in the duty room in time, and transmits work information while keeping the isolation ward quiet. At the same time, the system also has a voice dialogue function, so that patients can communicate with the system. In addition, the algorithm framework of this paper is shown in Figure 1. The main contributions of this paper are:

process and process features, so that they can be better used for DNN classification, as CNN shines in the image field, the application of VGGNet[2], GoogleNet[3] and ResNet[4,5], for CNN in speech Recognition provides more ideas, such as multi-layer convolution followed by pooling layer, reducing the size of the convolution kernel can enable us to train deeper and better CNN models. But CNN has no memory ability and can only handle a specific visual task, and cannot handle new tasks based on previous memories.

Recurrent Neural Network (RNN) is based on the idea of memory model. It is expected that the network can remember the features that appeared before, and infer the subsequent results based on the features, and the overall network structure is continuously circulating, hence the name Recurrent Neural Network The internet. Recurrent neural networks currently use the two most variants: LSTM[6] and GRU[7]. LSTM is a long short-term memory network. It is known from the literal meaning that it still solves the problem of short-term memory, but this short-term memory is relatively long and can solve the problem of long-term dependence to a certain extent. GRU will forget the door It synthesizes an "update gate" with the input gate. At the same time, the network no longer gives an additional memory state, but transmits the output result as a memory state continuously backwards. The input and output

of the network become particularly simple.

Connectionist Temporal Classification(CTC)[8] is a temporal classification algorithm proposed by Graves et al. in 2006. Unlike the Cross Entropy Loss method commonly used by some traditional models, CTC allows the model to learn alignment operations by itself, thereby saving time and improving efficiency. Therefore, this paper is based on a speech recognition method based on a hybrid model of CNN+GRU+CTC.

At the same time, another major module of the system- gesture recognition module. The term gesture recognition refers to the entire process of tracking human gestures, recognizing their representations, and converting them into semantically meaningful commands[9]. Generally speaking, whether the way of collecting information from gesture interaction is contact or non-contact, the gesture interaction system can be divided into two types: contact-based sensors and non-contact-based sensors.

Gesture recognition based on touch sensors is usually based on technologies such as data gloves, accelerometers, and multi-touch screens that use multiple sensors. In 2004, Kevin[10] and others designed a wireless instrument glove "CyberGlove II" for gesture recognition; in 2007, Bourke[11] and others proposed a recognition system with an accelerometer to detect normal gestures in our daily activities. Gesture recognition based on non-contact sensors is usually based on the use of optical sensing, radar detection and other technologies. In 2002, Bretzner[12] and others proposed gesture recognition using a camera to collect multi-scale color features.

This article adopts non-contact gesture recognition, uses the camera to collect 10 kinds of gestures, and then processes the gesture pictures based on KNN and SVM algorithms to give the gestures different meanings, and speak the meanings by voice.

In today's global outbreak of the new crown pneumonia epidemic, the design of this intelligent epidemic prevention system can create a data set of some common questions and input it into the framework of a deep neural network for training, which is used to answer patient questions and reduce the pressure on medical staff; It is easy to contact and needs to be kept quiet to realize the function of transmitting information with gestures; of course, it can also realize the function of ordinary chat, which can be widely used in special epidemic environments or ordinary occasions.


This intelligent epidemic prevention system combines a voice module and a gesture module. The voice module first needs a wake-up module, which is implemented by the Snowboy wake-up module. Then the core algorithm of the voice module is a speech recognition algorithm based on the CNN+GRU+CTC framework. The gesture module is mainly implemented based on KNN and SVM. The main modules are described in detail below.

3.1 Snowboy Wake-up Mechanism3.2 CNN+GRU+CTC Framework3.3 Gesture Recognition4.1 Introduction to Dataset4.2 Analysis of experimental results

Next >

Leave a Reply

Your email address will not be published. Required fields are marked *