Emotion Recognition based on Multimodel: Physical - Bio Signals and Video Signal

- Today, detecting and classifying emotions has become an important item of research and life. Emotion detection and classification are becoming more detailed and accurate thanks to development of various fields such as electronics, sensors or computer engineering. Emotion recognition methods are studied using different data collection methods and one of the most popular and effective methods is the use of physical – bio sensors. Physical – bio sensors -based approaches can provide accurate, sustainable biological information with external influences and interferences. Especially when compared with other approaches such as image processing, video processing. In this paper, we describe a method of classifying and assessing emotions based on a combination of signals collected from physical – bio sensors, video collection and machine learning methods. Specifically, we will describe the platform of a physical – bio signal collection system, the process of collecting information and the processing information system used to identify how emotional behavior is characterized. We have also shown that a combination of physical – bio sensors acquisition systems, video collection and machine learning methods can provide identification performance with an accuracy of 83.2%.


I. INTRODUCTION
The problem of classifying, evaluating and recognizing human emotions is currently taken an attention in research. Emotion recognition problem is described as an identification problem in which the input information is physicalbio signals such as heart rate, oxygen concentration, blood pressure, or objective comments collected from images, cameras, or reviews from other factors such as doctors and specialists [1]. The result after processing the collected information is conclusions about emotional state such as happiness, sadness, anger. And the problem of identifying and classifying emotions is a complex problem because of many different affected factors [2]. In addition, the assessment of emotional recognition is mostly based on logical thinking from the input information, but emotions work according to the illogical element [3]. For example, a person who has extremely good emotions, is extremely comfortable, still cannot describe the definition of good emotions consistently with others. So the emotional recognition project is still a challenging problem.
Despite many difficulties and challenges, the problem of classification and emotional identification is focused on research for following reasons: • By understanding human emotions, current information systems can increase the level of interaction with people, creating entirely new user experiences. For example, the sound system can reduce music, reduce volume during periods of extreme stress or fear, or suggest suitable movies according to the mood of the viewer. Computers and entertainment systems can also identify users' reactions like interest or annoyance when an entertainment content is randomly shown. So entertainment and information systems seem to become more companions rather than tools to serve people every day [4].
• Emotion evaluation and recognition can also be used as a means of physical -biological monitoring to self-assessment and monitoring of human emotional states. The results of physical -biological monitoring have many benefits such as a method to improve communication, or to evaluate emotional behaviors that can have negative effects on people and society. From there can give appropriate responses and methods to limit, or improve it [5].
As mentioned previous part, there are many methods used to collect emotional signal information such as physicalbio signals (heart rate, blood pressure, etc.) or factors such as facial expressions and rhythm words, gestures... or objective reviews from experts and doctors [6]. Each method of collecting emotional signal information has different advantages and disadvantages. In this paper, we will focus on what is current research. It is a method that uses physicalbio sensors such as Electromyography (EMG); Electrocardiogram (ECG) ; Electrodermal activity (EDA) to collect biomedical information. The method of using biological sensors has many outstanding advantages compared to other methods. For example, measuring devices are becoming more and more compact as the built-in sensors are becoming more and more compact. Therefore, the device that collects signals using a physicalbio sensor can be as compact as wearable devices, even jewelry, giving users a sense of privacy and individuality over methods. In other segments, when users are subject to "supervision" by cameras or other audio and video recording devices [7] [8] .

International Journal of Engineering Research & Technology (IJERT)
ISSN: 2278-0181 http://www.ijert.org Our goal is to collect the physicalbio signals of a person under different conditions of real life to detect emotions automatically. We propose a method for a multimodal detection of emotions using physicalbio signals. The paper is structured as follows. In section 2, a brief state of the art on the multimodal recognition of emotions and different methods to merge signals are shown. In section 3, all the steps of the proposed methods are explained in details; later on in section 4, a comparison between obtained results of related work and ours proposed modal; finally, conclusion and future work are reported in section 5.

II. RELATED WORKS
In general, an emotion recognition system are based on three fundamental steps: • Acquisition of the signals, • Features Extraction • The detection of emotions.
Some research has focused on the detection of emotions using facial expressions, vocal expressions or physiological signals [9], [10], [11]; however fewer studies are focused on the multimodal recognition [12] of emotions.
In addition, multimodal approach presents not only to enhance the recognition rate but also more strength to the system when unimodalities are acquired in a noisy environment and less accuracy [13].
In theory, there are three methods [14] to merge the signals from various sensors: fusion at the signal level (fusion of physiological signals), feature level fusion (fusion of features) and decision level fusion (fusion of decisions) [15], [16].

III. METHODOLOGY
In this section, we propose a multimodal and automatic method of emotion recognition based on the fusion of the above decisions. Our method is divided into two major phases namely: the Training phase and the Recognition phase

A. Training phase
This phase consists of three steps (preprocessing signal, feature extraction and training) in order to provide a training base which will then be used in the Recognition phase for the automatic detection of emotions as Figure 1.

• Preprocessing
In this step, after having acquired the physical -bio signal (here we directly measure these signals and named UTE-EMOTICA database). We isolate the part of the signal corresponding to a given emotion because we have information on the period in which each emotions is expressed. We filter it to remove the noise of the useful signal, which will facilitate the extraction of the features. We have opted for the convolution method for filtering, which consists in convoluting the signal in the spatial domain with different filters (Hanning filter is chosen). This method is less computationally expensive in calculations.

• Feature Extraction
We continuously proceed to the detection of peaks, which is done by calculating the gradient of the signal and then finding the sign changes in the gradient, because it is rare to find points in discrete signals where the gradient is zero. A maximum is shown by the passage of a positive gradient to a negative gradient, a minimum by the passage of a negative gradient to a positive gradient. To detect and isolate a peak, our method should detect a minimum followed by a maximum followed by a minimum.
Once a peak is isolated, we calculate a feature vector composed of five features that are: the mean, the variance, the mean of the filtered signal, the variance of the filtered signal, and the amplitude of the peak.
In case of video signal, the procedure of feature extraction for face emotion is consist of 3 steps. Faces are detected from input image at step 1, then transformed and aligned by using Facial Landmark method at step 2 before feeding them into the trained model at final step.
The first two steps are very important for preparing input data to the CNN in final.
At step 1, we used Multi-Task Cascaded Convolutional Neural Network (MTCNN) which is considered as state-ofthe-art face detection. This model consists of 3 separate networks: Proposal Network (P-Net), Refine Network (R-Net), and Output Network (O-Net) as depicted in Figure. 2. After detection step, face alignment step is highly recommended to be applied before moving to feature extraction phase. As recognition system depends upon how the face is aligned towards the camera, accuracy rate of face recognition can be in-creased by aligning the face by translation, rotation, and scale.
At final step, 128-D feature vector are extracted for every frame. To extract representation vector for each frame, a 6 layered Convolutional Neural Network (CNN) are used. We used pre-trained CNN model trained from public Kaggle emotion dataset.
There are 280 samples of 7 emotions in UTE-EMOTICA database At each frame, 5 features are extracted from one biosignal. With 3 types of bio-signal, we have totally 15 feature at each frame. Then we concatenate it with 128-D vector extracted from frame at same time step. At the end, we have 142-D feature vector.
After extraction of the features, linear SVM classifier is used for training recognition model. The Support Vector Machine (SVM) has been widely used in various pattern recognition tasks. It is believed that SVM can achieve nearoptimal separation between classes [15]. The embedded data from the Representation step was used as inputs of SVM classifier to train on each identity. In linear SVM classifier, a data point is viewed as an n-dimensional vector, in n-dimensional space and the SVM's goal is separating such points with an (n − 1) dimensional hyperplane, see [16]. An SVM training algorithm builds a model of data points in space so that the data points of the separate classes are divided by a clear gap that is as wide as possible [16]. Given a training set of labeled samples: A SVM tries to find a hyperplane to distinguish the samples with the smallest errors [15].

B. Recognition phase
In recognition process, new input examples are mapped into that same space and predicted to belong to a class based on which side of the gap they fall on [16]. The SVM returns the label with the maximum score, which represents the confidence to the closest match within the trained data.

IV. RESULTS
For these results, we use as database the physicalbio signals made by the UTE-EMOTICA. For this database, we used maximum 3 physicalbio sensors were used: EMG, ECG, EDA and facial video capture. During this collection, 7 emotions were taken into account, which are "Angry, Disgust, Fear, Happy, Sad, Surprise, Neutral" and every emotion was maintained for five minutes per day.
The results obtained by our algorithm when the unimodal recognition of emotions approach is used are grouped on the histogram below. This approach allows having a mean recognition rate of 63.52%.
We have 2 modals: Modal 1: there two physical-bio signals: EMG and ECG and video capture.
Modal 2: there three physical-bio signals: EMG, ECG and EDA and video capture.
As shown in the Figure 3, certain emotions are better detected with certain modalities than others. Indeed the EMG modality allow to better detect the " Angry" and "Disgust" emotions, while the ECG modality better detects the 'Disgust' and 'surprise' emotions. The EDA modality rather allows a better detection of the emotions "Angry" and "Fear". These characteristic of modalities is very crucial because it will allow putting weight on each of the modalities, depending on whether it can better detect an emotion or not for the purpose of a more efficient detection. Subsequently, we have enhanced our method to the multimodal approach to increase the emotion recognition rate. Indeed, this multimodal approach allowed having a recognition rate of 74.3% with modal 1 and 83.2% with modal 2, which is a considerable improvement of the recognition rate compared to the unimodal approach which presented a recognition rate of 63.52%.%. The results grouped in the above table present a average recognition rate. Furthermore, we find out that our method allows to detect each of the seven emotions with a good recognition rate, where the minimum of 62.4% is obtained for the emotion "Fear" and the maximum is obtained for the emotion" Angry" with a good classification rate of 94.11%. The table 2 below allows doing a comparison between our results and the different results of the methods of the state of the art that allow an instantaneous detection of emotions.

V. CONCLUSION AND PERSPECTIVES
We have proposed an enhanced method of multimodal recognition of emotions based on the processing of physicalbio signals. These signals of 2 modalities were applied for the recognition of 7 basic emotions (Angry, Disgust, Fear, Happy, Sad, Surprise, Neutral). Our proposed method for multimodal recognition based on the machine learning has been deployed and developed. The different results illustrate a remarkable improvement in the emotion recognition rate.
In our future work, we will create more sample for UTE-EMOTICA database for both physicalbio signals and video acquisition platforms in order to generate our recognition algorithm. Besides, a complete platform with more flexible and convenient for the acquisition of physical bio signals for emotions detection. Moreover, our proposed system will be improved with more appropriate recognition base for a various people.