Analysis of Vocal Pattern to Determine Emotions using Machine Learning

Download Full-Text PDF Cite this Publication

Text Only Version

Analysis of Vocal Pattern to Determine Emotions using Machine Learning

Mrs. Veena Potdar1, Mrs. Lavanya Santosp, Supritha M Bhatt3, Yashaswini M K4,

1Associate Professor, Department of Computer Science and Engineering, Dr. Ambedkar Institute of Technology, Bengaluru, India

2Assistant Professor, Department of Computer Science and Engineering, Dr. Ambedkar Institute of Technology, Bengaluru, India

3,4Students, Department of Computer Science and Engineering, Dr. Ambedkar Institute of Technology, Bengaluru, India

Abstract:- The potentiality to vary vocal sounds in order to produce speech is one of the major features which sets humans apart from other living beings. We can categorize human emotion by several attributes such as pitch, timbre, loudness, and vocal tone. It has often been noticed that humans express their emotions by varying different vocal attributes during speech generation. Hence, identification of human emotions using voice and speech analysis has a practical possibility and could potentially be beneficial in improving human conversational skills. One can follow an algorithmic approach for detection and analysis of human emotions with the help of voice and speech processing. The proposed approach has been developed with the objective of incorporation with futuristic machine learning systems for improving human-computer interactions with the help of machine learning models SVM (Support Vector Machine) and CNN (Convolution Neural Network) by the extraction of MFCC features. The above-mentioned models are trained using RAVDESS and TESS datasets.

Keywords: SVM, CNN, MFCC, RAVDESS and TESS datasets.


    Emotions are very important for humans in order to impacting perception in their everyday activities such as communication, learning and decision- making. They are communicated through speech, facial expressions, gestures and other non-verbal actions. Human emotion recognition plays a major role in the interpersonal relationship. In recent times automatic recognition of emotions has been an active research topic. This encourages several advancement activities in this field. Emotions are produced in the form of speech, hand and gestures of the body and facial expressions. Hence extracting and understanding emotions has a high importance in order to enhance interaction between human and machines.

    Emotion recognition is the process of recognizing human emotions. The human voice is very multifaceted and carries a multitude of emotions. Emotion in speech can be used to get additional insight about human behavior. If we further analyze, we can better understand the intentions of people, whether they are unhappy clients or cheering fans.

  2. EXISTING SYSTEM Previously Speech Emotion Recognition was

    implemented using other methodologies and datasets. Some of them make use of different kinds of neural networks and different types of classifiers for emotion classification. Some of the paper [4] have used different datasets like Berlin and Spanish Speech databases.

    Berlin Emotional Speech Database: The Berlin database is extensively used in emotional speech recognition. It has 535 statements spoken by 10 actors in which 5 are female and 5 are male mimicking 7 emotions namely anger, boredom, fear, joy, sadness, disgust, and neutral.

    Spanish Emotional Database: The INTER1SP Spanish emotional database has statements from two professional actors (one female and one male speaker). The dataset was recorded two times in the 6 basic emotions, even neutral emotion is also recorded (anger, joy, fear, sadness, disgust, surprise, neutral). In addition to it 4 neutral variations such as soft, loud, slow, and fast recordings can be found in the dataset.

    Jerry Joy paper [2] proposed the use of Multi- Layer Perceptron Classifier.

    Multi-layer Perceptron Classifier: (MLP Classifier) depends on fundamental Neural Network to perform classification. MLP Classifier uses a Multi-Layer Perceptron (MLP) algorithm and trains the Neural Network following Back propagation.

    To build the MLP Classifier they followed steps mentioned below.

    • Defining and instantiating the required parameter to initialize the MLP Classifier.

    • Training Neural network by feeding data.

    • The trained model is used to make predictions on new (or test) data.

    • Calculating the accuracy of the predictions

      In the paper published by [1], they have used Random Forest and Decision Tree. But we have trained both SVM and CNN model using the RAVDESS and TESS datasets Leila Kerkeni paper [4] uses the Spanish database, having the feature combination of MFCC and MS using RNN with the recognition rate 90.05%.

      Recurrent Neural Networks: (RNN) are best used for learning time series data. While RNN models are good at learning temporal correlations, they suffer from the vanishing gradient problem which increases as the length of the training sequences increases. To resolve this problem, LSTM (Long Short- Term Memory) RNNs were proposed by Hochreiter et al (Sepp and Jurgen, 1997) which uses memory cells to store information so that it can exploit long range dependencies in the data (Chen and Jin, 2015).

      Unlike traditional neural network that uses different parameters at each layer, the RNN shares the same parameters across all steps.

      Figure 1: A basic concept of RNN and unfolding in time of the computation involved in its forward computation (Lim et al., 2017).

      The hidden state formulas and variables are as follows: st = f(Uxt +Wst1) (1) with:

    • xt , st and ot are respectively the input, the hidden state and the output at time step t;

    • U, V, W are parameters matrices.

    Table 1: Recognition results using RNN classifier based on Berlin and Spanish databases


    In comparison to the existing system, we are trying a different approach by using different machine learning models such as SVM and also neural network models called CNN. Our predictions are way different from the predictions of the existing system. Emotion prediction finds an application in emotional hearing aids for people with autism; detection of an angry caller at an automated call center to transfer to a human; or presentation style adjustment of a computerized e-learning tutor if the student is bored.

    In this proposed system, we have used the RAVDESS and TESS datasets.

    They are recorded in English which is most widely spoken all over the world. Our dataset has more files when compared to Berlin dataset so that we can train our model using large number of audio files.

    RAVDESS: Dataset RAVDESS has 1440 files i.e, 60 trials per actor recorded by 24 actors = 1440 (I.e., 60*24). The RAVDESS is recorded by 24 professional actors comprising of 12 female, and 12 male, vocalizing two lexically matched statements in a neutral North American accent. Recordings expresses emotions like calm, happy, sad, angry, fearful, surprise, and disgust expressions. Each expression is constructed in two different levels of emotional intensity namely normal, and strong, with an additional neutral expression.

    TESS Dataset: TESS Dataset has stimuli modeled on the Northwestern University Auditory tests. A collection of 200 different words were audio taped using the carrier phrase "Say the word by two actresses aged 26 and 64 years, recordings were made of the set expressing each of 7 emotions namely happiness, pleasant surprise, neutral, anger, fear, disgust, and sadness. We have used 1400 recorded files of older actor in our ML models. So, in otal we have used 2840 files to train our Machine learning models.

    Working of the system:

    Feature Extraction: Dataset in the google drive is taken by librosa.load() method to give 40 MFCC features back for every audio file. These features are added to X(independent) and we save the file name into the y(dependent) variable. Later in SVM and CNN models they are split into X_train and y_train.

    SVM implementation: The extracted feature is scaled using standard scalar library and data is split into training and test data respectively. Then we build the svm model using svc library of sklearn.svm as follows:


    Then we train the model using the fit() method passing X_train and y_train.

    CNN implementation: The extracted features are converted to numpy array format and we create CNN model with 1 Convolution layer,1 activation=relu layer, 1 dropout layer, 1 flatten layer and then the dense layer, softmax activation layer. Then we fit X_train and y_train and use predict_classes() to make predictions.

    Now these models are put into pickle file or

    .p file so that they can be used for quick predictions.

    We have extracted the features from the RAVDESS and TESS datasets especially the MFCC features. We have trained both SVM and CNN model using these datasets. Both are having good accuracy. We have tried to predict the emotion by uploading the audio file to the designed Web page. The model and front-end is been connected using the flask framework.

    Emotion prediction:

    Once you upload the proper audio file of .wav format and click on the Submit button the emotion predicted will be displayed below the Submit button. This emotion may vary from (Happy, Neutral, Sad, Calm, Angry, Fearful, Surprise, & Disgust).


    The automatic emotion recognition system proposed with the help of SVM and CNN models have the accuracy of 65% and 81% respectively.

    Figure 2: SVM accuracy

    Figure 3: CNN accuracy

    Figure 4: Working of Speech Emotion recognition system


    The input files used are the audio files of the trained datasets.

    The following are the result of the conducted experiment.

    Test case

    Test case descriptio n

    Input (Audio file)

    Expecte d output

    Actual output

    Result s


    Checking for prediction on neutral emotion




    Displays Neutral




    Checking for prediction on calm emotion




    Displays Calm




    Checking for prediction on happy emotion




    Displays Happy




    Checking for prediction on angry emotion




    Displays Angry




    Checking for prediction on disgust emotion




    Displays Disgust




    Checking for prediction on sad





    Displays Sad




    Checking for prediction on fearful emotion




    Displays Fearful




    Checking for prediction on fearful emotion




    Displays Surprise




    Checking for prediction on any


    Upload the audio file of

    a ny emotion other than

    .wav extension

    Displays an alert message



    fi le allowed


    Table 2: Result of the conducted experiment.


    The proposed system successfully predicts the emotions of the audio files most of the times. So, this can be further enhanced and used for the criminal case investigations, call centers and even to detect the emotions of any individual automatically rather than predicting the emotion by the other person. Sometimes predicting the human emotion manually by another person is difficult task due to various reasons like (the absence of that person), so usage of this automatic emotion recognition system plays an important role to predict the emotions by saving time as well as money.


    In the proposed system we have just created the user interface having the option to upload the audio file, further we can enhance its improvements by providing a microphone which directly records the audio and can predict the output, also we can implement the code which converts any format of audio file to .wav when the file is uploaded to predict the emotion of that audio file. The machine learning model can be trained by more datasets to enhance its accuracy.


  1. Fatemeh Noroozi, Tomasz Sapinski, Dorota Kaminska, Gholamreza Anbarjafari, Survey on vocal-based emotion recognition, Springer Science + Business Media New York 2017.

  2. Jerry Joy, Aparna Kannan, Shreya Ram,

  3. S. Rama, Survey on Speech Emotion Recognition using Neural Network and MLP Classifier, SRM Institute of Science and Technology, Vadapalani Campus, Chennai, India.

  4. S. Lalitha, Abhishek Madhavan, Bharath Bhushan, Srinivas Saketh, Survey on Speech Emotion recognition, Published in: 2014 International Conference on Advances in Electronics Computers andCommunications.

  5. Leila Kerkeni, Youssef Serrestou, Mohamed Mbarki, Kosai Raoof and Mohamed Ali Mahjoub, Survey on Speech Emotion Recognition: Methods and Cases Study, University of Maine, Le Mans University, France, LATIS Laboratory of Advanced Technologies and Intelligent Systems, University of Sousse, Tunisia, Higher Institute of Applied Sciences and Technology of Sousse, University of Sousse, Tunisia.

Leave a Reply

Your email address will not be published. Required fields are marked *