Emotion Detection From Speech Using Mfcc & Gmm

K. J. Patil; P. H. Zope; S. R. Suralkar

doi:10.17577/IJERTV1IS9423

Volume 01, Issue 09 (November 2012)

Emotion Detection From Speech Using Mfcc & Gmm

DOI : 10.17577/IJERTV1IS9423

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 263
Total Downloads : 2560
Authors : K. J. Patil, P. H. Zope, S. R. Suralkar
Paper ID : IJERTV1IS9423
Volume & Issue : Volume 01, Issue 09 (November 2012)
Published (First Online): 29-11-2012
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Emotion Detection From Speech Using Mfcc & Gmm

K. J. Patil	P. H. Zope	S. R. Suralkar
student	Asst. Professor	H.O.D
SSBTS College of Engineering	SSBTS College of Engineering	SSBTS College of Engineering
Jalgaon, M.S., India	Jalgaon, M.S., India	Jalgaon, M.S., India

Abstract

In these years, literature about automatic emotion recognition is growing dramatically due to the development of techniques in computer vision, speech analysis and machine learning. However, automatic recognition on emotions occurring on natural communication setting is a largely unexplored and challenging problem.

Speech processing is emerged as one of the important application area of digital signal processing. Various fields for research in speech processing are emotion detection from speech, speech recognition, speaker recognition, speech synthesis, speech coding etc. The objective of automatic emotion detection is to extract, characterize and recognize the information of speakers emotions. Feature extraction is the first step for speaker recognition. Many algorithms are suggested/developed by the researchers for feature extraction. In this report, the Mel Frequency Cepstrum Coefficient (MFCC) feature has been used for designing an automatic emotion detection system. Some modifications to the existing technique of MFCC for feature extraction are also suggested to improve the emotion detection efficiency.

This report presents an approach to emotion recognition from speech signals. In this report, the framework to extract features from the speech signal that can be used for the detection of emotional state of the speaker. An essential step in the generation of expressive speech synthesis is the automatic detection and classification of emotions most likely to be present in speech input.

Keywords speech recognition , MFCC ,GMM.

INTRODUCTION

. EMOTION plays a crucial role in day-to-day interpersonal human interactions. Recent findings have suggested that emotion is integral to our rational and intelligent decisions. It helps us to relate with each other by expressing our feelings and providing feedback. This important aspect of human interaction needs to be considered in the design of human

machine interfaces (HMIs). To build interfaces that are more in tune with the users needs and preferences, it is essential to study how emotion modulates and enhances the verbal and nonverbal channels in human communication.

The tight coupling between emotional expression and human behavior is well documented, even if logical reasoning skills remain intact, the brain becomes incapable of making appropriate decisions when its emotion-controlling centers are damaged. Hence, the vital importance of cogent emotion analysis in most affective computing applications, ranging from natural language interfaces to e-learning environments,educational or entertainment games, opinion mining and sentiment analysis, humor recognition, and security informatics. For example, emotion detection is an essential tool for monitoring the presence of hateful or violent rhetoric. When it comes to manmachine communication, it is thus highly desirable to take account of emotional states as an integral part of humancomputer interaction, at both input and output levels. If a spoken dialog system could reliably determine that a user is upset or annoyed, for instance, it could switch to a potentially more adequate mode of interaction.

Likewise, expressive speech synthesis is expected to play a pivotal role in the widespread deployment and acceptance of future natural language interfaces.

Detecting emotion from speech can be viewed as a classification task. It consists of assigning, out of a fixed set, an emotion category e.g., joy, anger, boredom, sadness, fear, frustration, annoyance, satisfaction & neutral, to a speech utterance. This report presents an approach to emotion recognition from speech signals. In this report, the framework to extract features from the speech signal that can be used for the detection of emotional state of the speaker is discussed.
System design

The system consists of four major parts :- I . Speech Acquisition
1. Feature Extraction
2. Machine Learning
3. Information Fusion
For the purpose of feature extraction, spectral analysis algorithm such as Mel-frequency Cepstral Coefficients, MFCCs will be used. For prosody analysis, the statistics of pitch and energy will be used to determine prosodic features. To determine the emotion, Information fusion algorithm will be designed. For the fusion algorithm, Spectral Analysis GMM model will be applied to determine the probability density function. A k-NN will be used for the prosody feature analysis in the fusion algorithm [1].

But it is found that, emotion recognition algorithm that use prosodic features are not sufficiently accurate.

However, phonetic feature have less information for discriminating emotions. Actually there is more independent component in the phonetic features of speech than in prosodic features of speech, then the accuracy of emotion recognition can be improved by increasing the number of independent phonetic features. Therefore, we propose an emotion recognition algorithm that focuses more on phonetic features of speech.

Fig 1. Typical emotion recognition system[1]
.

The extraction and selection of the best parametric representation of acoustic signals is an important task in the design of any speech recognition system; it significantly affects the recognition performance. A compact representation would be provided by a set of mel-frequency cepstrum coefficients (MFCC), which are the results of a cosine transform of the real logarithm of the short-term energy spectrum expressed on a mel-frequency scale. The MFCCs are proved more efficient [1] [3]. Therefore, here we are using MFCC for spectral feature extraction.

The acoustic features will be modeled by Gaussian mixture models, GMMs, on the frame level. Survey indicates that

using GMM on the frame level is a feasible technique for emotion classification. Also Gaussian modeling is among the best methods to distinguish emotional classes in a space spanned by the following phonetic parameters: pitch, pitch range, average pitch, all measured across the entire utterance after end pointing (i.e. pause/speech boundary detection) [1]. Therefore, GMM algorithm is best for spectral feature classification
Like any other recognition systems, emotion recognition systems also involve two phases namely, training and testing. Training is the process of familiarizing the system with the emotions characterstics of the speakers. Testing is the actual recognition task. The block diagram of training phase is shown in Fig.2.1. Feature vectors representing the emotion characteristics of the speaker are extracted from the training utterances and are used for building the reference models.

During testing, similar feature vectors are extracted from the test utterance, and the degree of their match with the reference is obtained using some matching technique. The level of match is used to arrive at the decision. [3]
Feature extraction
MFCC is based on the human peripheral auditory system. The human perception of the frequency contents of sounds for speech signals does not follow a linear scale. Thus for each tone with an actual frequency t measured in Hz, a subjective pitch is measured on a scale called the Mel Scale .The Mel frequency scale is a linear frequency spacing below 1000 Hz and logarithmic spacing above 1kHz.As a reference point, the pitch of a 1 kHz tone, 40 dB above the perceptual hearing threshold, is defined as 1000 Mels.

First the voice data is divided into frame. Each frame is windowed using Hamming window. Second the analysis frame is converted to the frequency domain using a short time Fourier Transform. Third a certain number of sub-band energies are calculated using a Mel filter bank, which is a non linear- scale filter bank that imitates a humans aural system. Fourth, the logarithm of the sub-band energies is calculated. Finally, the MFCC is computed by an inverse Fourier Transform.

Fig. 3. MFCC Block Diagram [2]
Step 1: Preemphasis

This step processes the passing of signal through a filter which emphasizes higher frequencies. This process will increase the energy of signal at higher frequency.

Lets consider a = 0.95, which make 95% of any one sample is presumed to originate from previous sample.

Step 2: Framing

The process of segmenting the speech samples obtained from analog to digital conversion (ADC) into a small frame with the length within the range of 20 to 40 msec. The voice signal is divided into frames of N samples. Adjacent frames are being separated by M (M<N). Typical values used are M = 100 and N= 256.

Step 3: Hamming windowing

Hamming window is used as window shape by considering the next block in feature extraction processing chain and integrates all the closest frequency lines. The Hamming window equation is given as:

If the window is defined as W (n), 0 n N-1 where N = number of samples in each frame

Y[n] = Output signal X (n) = input signal

W (n) = Hamming window, then the result of windowing signal is

shown below:

Step 4: Fast Fourier Transform

To convert each frame of N samples from time domain into frequency domain. The Fourier Transform is to convert the convolution of the glottal pulse U[n] and the vocal tract

impulse response H[n] in the time domain. This statement supports the equation below:

If X (w), H (w) and Y (w) are the Fourier Transform of X (t), H (t) and Y (t) respectively.

Step 5: Mel Filter Bank Processing

The frequencies range in FFT spectrum is very wide and voice signal

does not follow the linear scale. The bank of filters according to Mel

scale as shown in figure 4 is then performed

.

Fig. 4. Mel scale filter bank, from (young et al,1997)[2]

This figure shows a set of triangular filters that are used to computea weighted sum of filter spectral components so that the output ofprocess approximates to a Mel scale. Each filters magnitude frequencyresponse is triangular in shape and equal to unity at thecentre frequency and decrease

linearly to zero at centre frequency oftwo adjacent filters [7, 8]. Then, each filter output is the sum of itsfiltered spectral components. After that the following equation isused to compute the Mel for given frequency f in HZ:

Step 6: Discrete Cosine Transform

This is the process to convert the log Mel spectrum into time domain using Discrete Cosine Transform (DCT). The result of the conversion is called Mel Frequency Cepstrum Coefficient. The set of coefficient is called acoustic vectors. Therefore, each input utterance is transformed into a sequence of acoustic vector

Step 7: Delta Energy and Delta Spectrum

The voice signal and the frames changes, such as the slope of a formant at its transitions. Therefore, there is a need to add features related to the change in cepstral features over time . 13 delta or velocity features (12 cepstral features plus energy), and 39 features a double delta or acceleration feature are added. The energy in a frame for a signal x in a window from time sample t1 to time sample t2, is represented at the equation below:

Each of the 13 delta features represents the change between frames in the equation 8 corresponding cepstral or energy feature, while each of the 39 double delta features represents the change between frames in the corresponding delta features.
CLASSIFICATION
Figure 7 shows the HMM with one emitting state. A speech starts from a start state, and stays at an emitting state for a while and finally ends at an end state. While staying in the emitting state, several observations (features) which follow a Gaussian mixture model (GMM) probability are generated. The feature vectors extracted from the speech can

be described using this model. The feature vectors follow the Gaussian mixture model (GMM) probability in the emitting state and each person has a unique probability model.

4.2.1 Illustration of GMM

At present, Gaussian mixture model (GMM) often to be used to the speaker recognition, this model has the good ability of recognition. In this work, the Gaussian mixture model (GMM) is adopted to represent the distribution of the features. Uder the assumption that the feature vector sequence is an independent

identical distribution (i.i.d) sequence, the estimated distribution of the D-dimensional feature vector x is a weighted sum of M component

A GMM is a weighted sum of M component densities and is given by the form

Where x is a dimensional random vector,

bi(x), i = 1,. . ., N, is the component densities and ci, i = 1,. . .,N, is the mixture weights.

Gaussian function of the form

with mean vector _i and covariance matrix _ i

The mixture weights satisfy the constraint that:

The complete Gaussian mixture model is parameterized by the mean vectors, covariance matrices and mixture

weights from all component densities. These parameters are

collectively represented by the notation: _

In speaker recognition system, each speaker is represented by such a GMM and is referred to by this model

For a sequence of T test vectors X = x1, x2. . . xn, the standard approach is to calculate the GMM likelihood in the log domain as:

The emotion-specific GMM parameters are estimated by the EM algorithm using training data uttered by the corresponding speaker using the HTK toolkit.

Gaussian densities Ni(x), each parameterized by a mean vector i and covariance matrix Ki; the mixture density for the model Â¤m is defined

We will use the expectation maximization (EM) algorithm for the mixtures to get maximum likelihood as explained below

Given a collection of training feature vectors, maximum likelihood model parameters will be estimated using an iterative expectationmaximization (EM) algorithm]. The EM algorithm iteratively refines the GMM parameters to monotonically increase the likelihood of the estimated model

for the observed feature vectors. Generally, five iterations are sufficient for parameter convergence. The EM equations for training a GMM can be found in the reference papers

After parameter estimation, we will determine which category the test emotional speech belongs to. By computing the likelihood of all emotional speech models and finding the model which has a maximum likelihood value, we can

categorize the test sample of speech. The likelihood of speaker

is

where T is the number of frames and t x is the feature vector from the t -th frame The probability of t x given the speaker model _ is

denote the weight, thecovariance matrix and the mean vector of the i -thGaussian of the speaker model _ , respectively.
Fig 5 .vector classification using k-NN[1]
Classification using k-NN, k = 3. The test sample T is being classified as x, because in the hyper cycle surrounding T are 2 elements from x and only one from o.

Let us assume we have sets Di, these represent c classes, k>0 and test sample x. We want to classify x as a member of one of classes Di,, k-NN does this very simply.
- find k closest vectors to test vector x, let these are r1,r2, ., rn
- make a hyper cycle Cn around x with radius r, r = max i=1,,k| ri|
The last step says: classify input vector x as the member of the class which has the majority in hyper cycle Cn.

In kNN, prior to testing a sample against the data, all samples of that person was should be removed to ensure that no match would occur due to similarity of voice rather than emotion. However, when tested, this will have only slight effect on the results

3) Information Fusion Algorithm:

Starting with a simple binary classification theory, a classifier function C() which takes a input speech s will yield a result as to which hypothesis the speech belongs based on a threshold comparison, i.e.,

Where

and Sm represents likelihood that the speech s belongs to

Hm.

In this work, we set the hypothesis as:

2 H0: the input speech is of one emotional status

2 H1: the input speech is of another emotional status. The decision arising from the spectral and prosodic feature classifiers need to be combined in order to have a

unique and more accurate classification. Many algorithms have been proposed to deal with multiple modalities. One of the most simple and popular methods is a weighted sum of likelihoods from different modalities with a weighing factor that will be empirically determined.
RESULT

Here in referred paper they obtained some 1500 features, which partly consist of frequently used features but also introduce new experimentally designed features into the analysis. All features were calculated on a 10ms frame shift rate. Table 1 shows the different feature information sources

and the number of features calculated from them. Many methods are developed for feature extraction but the table below signifies that MFCC gives better accuracy than any other method.

Table 1. Information sources, number of features calculated, and Average Accuracy. [5]
CONCLUSION

Automatic detection of emotions will be evaluated using standard Mel-frequency Cepstral Coefficients, MFCCs. These acoustic features will be modeled by Gaussian mixture models (GMMs), on the frame level. Survey indicates that using GMM on the frame level is a feasible technique for emotion classification. Also Gaussian modeling is among the best methods to distinguish emotional classes by the following phonetic parameters: pitch, pitch range, average pitch, all measured across the entire utterance.

As a result of changes in shape of human vocal tract during generaion of different emotions, resonance frequencies of vocal tract, formants, also changes. Using this phenomenon, we can extract voice features of each emotion and we can implement an emotion detection system.

References

S. D. Shirbahadurkar, A. P. Meshram, Ashwini Kohok & Smita Jadhav, An Overview and Preparation for Recognition of Emotion from Speech Signal with Multi Modal Fusion IEEE Proceedings, Vol.5., 2010.
Lindasalwa Muda, Mumtaj Begam and I. Elamvazuthi, Voice Recognition Algorithms using Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques, Journal Of Computing, Volume 2, Issue 3, ISSN 2151-9617, , March 2010.
Vibha Tiwari, MFCC and its applications in speaker recognition, International Journal on Emerging Technologies, ISSN : 0975-8364, 2010.
Mahdi Shaneh, and Azizollah Taheri, Voice Command Recognition System Based on MFCC and VQ Algorithms, World Academy of Science, Engineering and Technology, 2009.
Florian Metze, Tim Polzehl and Michael Wagner, Fusion of Acoustic and Linguistic Speech Features for Emotion Detection, IEEE International Conference on Semantic Computing, 2009.
Ashish Jain,Hohn Harris,Speaker identification using MFCC and HMM based techniques,university Of Florida,April 25,2004.
Cheong Soo Yee and abdul Manan ahmad, Malay Language Text Independent Speaker Vertification using NNMLP classsifier with MFCC, 2008 international Conference on Electronic Design.
P. Lockwood, J. Boudy, Experiments with a Nonlinear Spectral Subtractor (NSS), Hidden Markov Models and the Projection, for Robust

Speech Recognition in Cars, Speech Communication, 1992.
A. Rosenberg, C.H. Lee, F. Soong, Cepstral Channel Normalization

Techniques for HMMBased Speaker Verification, 1994.
Dr Philip Jackson, Features extraction 1.ppt,, University of Surrey,

guilford GU2 & 7XH.
Zaidi Razak,Noor Jamilah Ibrahim, emran mohd tamil,mohd Yamani Idna Idris, Mohd yaakob Yusoff,Quranic verse recition feature extraction using mel frequency ceostral coefficient (MFCC),Universiti Malaya.
http://www.cse.unsw.edu.au/~waleed/phd/html/node38.html, downloaded on 3rd March 2010

.
Jamal Price, sophomore student, Design an automatic speech recognition system using maltab, University of Maryland Estern Shore Princess Anne.
Ahmad Kamarul,Ariff Bin Ibrahim, Biomedical engineering labiratory

student pack,UTM Jjohor
E.C. Gordon,Signal and Linear System Analysis.John Wiley & Sons Ltd., New York, USA,1998.
Stan Salvador and Pjilip Chan,FastDTW: Toward Accurate Dynamic

Time Warping in Linear time space,Florida Institute of Technology,Melbourne.

Volume 01, Issue 09 (November 2012)

Emotion Detection From Speech Using Mfcc & Gmm

Emotion Detection From Speech Using Mfcc & Gmm

Step 6: Discrete Cosine Transform

Step 7: Delta Energy and Delta Spectrum

4.2.1 Illustration of GMM

K Nearest Neighborhood

k-NN — The Nearest Neighbor

Leave a Reply