Depression Detection from Speech

Download Full-Text PDF Cite this Publication

Text Only Version


Depression Detection from Speech

Arun Thomas Varghese

Department of Computer Science and Engineering Mar Athanasius College of Engineering Ernakulam, India

Vinay Gregory John

Department of Computer Science and Engineering Mar Athanasius College of Engineering Ernakulam, India

Vishnu Vidyadharan

Department of Computer Science and Engineering Mar Athanasius College of Engineering Ernakulam, India

Neethu Subash
Assistant Professor, Department of Computer Science and Engineering Mar Athanasius College of Engineering
Ernakulam, India

AbstractDepression (Major Depressive Disorder) has become one of the most prevalent mental illness across the globe. It can be described as a state in which a person feels sad or loses interest in everyday activities which are normally considered enjoyable. It affects the thoughts and the way one feels about the people that he/she comes across. More than 260 million people throughout the world suffer from depression. This comprises of people belonging to all age groups. Depression can lead a person to hopelessness and subsequently result into suicide. It is also estimated that depression will account for the most common of diseases by 2030. Nearly 20% of people with untreated depressive disorder commit suicide. Therefore ways to detect depression are a major concern. The current diagnoses are highly inconsistent and expensive. The most common technique used to detect depression is comparing the words in a persons response to a questionnaire, with a database comprising of terms commonly used by depressed patients. This method has found to be inefficient in terms of its accuracy in prediction as well as the time consumed in diagnosis. In this project, a new method for early detection of depression is implemented. The features of speech which are associated with signs of depression can be traced by neural networks. Convolutional Neural Networks, in particular, can be applied to identify depression indicators in speech. Spectrograms, which mark the intensities of frequencies of speech can be given as input to Convolutional Neural Networks designed to learn similar patterns in depressed audio.

Index TermsConvoltional Neural Networks, Mel Frequency Cepstral Coefficients(MFCC), Feature Extraction, Short Time Fourier Transform(STFT), Rectified Linear Unit(ReLu), Multi- layer Perceptron(MLP).


    Depression affects peoples well-being and quality of life. It is one of the most widespread and prevalent of the major psychiatric disorders. Depression currently accounts for about 4% of the global count of disease, and it is predicted to be the most leading cause of disease burden in countries of high-income by almost 2030. Early detection and treatment of depression is therefore extremely important in promoting remission, preventing relapse, and reducing the emotional burden of the disease. Current diagnoses are mainly subjective and lacks consistency across professionals. It is also very expensive, making it an additional burden for the individual who may be in urgent need of help. Moreover, early signs of depression are difficult to detect and evaluate. These early

    signs can be identified with the help of machine learning algo- rithms. Furthermore, these algorithms could be implemented in a wearable artificial intelligence (AI) or a home device.


    While most of the existing emotion detection research focuses on the semantic content of audio signals in depression detection, this project tires to focus on the prosodic features, which have also been found to be potential factors in prediction of depression. Prosodic features are the features that can generally be characterized by a listener as pitch, tone, voice quality, articulation, rhythm, stress, etc. In addition, other features in research include sentence length and rhythm, in- tonation, fundamental frequency, and Mel-frequency Cepstral Coefficients (MFCCs).

    1. Dataset

      All audio recordings and associated depression metrics for this project were provided by the DAIC-WOZ Database. It was assembled by USCs Institute of Creative Technologies and later released as part of the Audio/Visual Emotional Challenge and Workshop of 2016 titled Depression, Mood and Emotion. The dataset consists of 189 sessions, each of which averaging 15 minutes, which were between a participant and a virtual interviewer called Ellie. A human interviewer in another room controlled Ellie using a Wizard of Oz approach. Prior to the interview, each participant was made to complete a psychiatric questionnaire. On assessing the response of the interviewees, a binary truth classification (depressed, not depressed) was derived.

    2. Segmentation

      The first step in analyzing a persons prosodic features of speech is segmenting or in extracting the persons speech from silence, other speakers (in our case, the interviewer), and noise. Each participant in the DAIC-WOZ study fortunately was wearing close proximity microphones. Moreover, the sessions were held in low noise environments. This allowed for fairly complete segmentation in approximately 85% of interviews using the pyAudioAnanlysis segmentation module. When implementing the algorithm in a wearable device, speaker

      Fig. 1. Virtual interview with Ellie.

      diarization (speaker identification) and background noise re- moval would require further development of the product. How- ever, in the interest of quickly establishing a minimum viable product, this desired further development was not addressed in our current effort.

    3. Feature Extraction

      There are several ways in order to extract acoustic feature, which is the most crucial component to developing a suc- cessful approach. One method includes extracting short-term and mid-term audio features like MFCCs, chroma vectors, zero crossing rate, etc. and feeding them as inputs to a Support Vector Machine (SVM) or Random Forest. Since the module pyAudioAnalysis makes short-term feature extraction reasonably streamlined, our first approach to this classification problem aimed at developing short-term feature matrices from 50ms audio segments of the 34 short-term features available from pyAudioAnalysis. Since these features are lower level representations of audio, the concern arises that subtle speech characteristics displayed by depressed individuals would go undetected.

      It was found that running a Random Forest on the 34 short-term features would yield an encouraging F1 score of 0.59, with minimal tuning. This approach has already been previously employed by others, so we have treated this as baseline comparative data or a reference to develop and evaluate a completely new approach involving convolutional neural networks (CNNs) with spectrograms, which we felt could be quite promising and powerful.

      CNNs require a visual image. In this effort, speech stimuli is represented via a spectrogram. A spectrogram is a visual representation of sound, displaying the amplitude of the fre- quency components of a signal over time. Unlike MFCCs and other transformations that represent lower level features of sound, spectrograms maintain a high level of detail (including

      Fig. 2. Spectrogram of a plosive, followed by a second of silence, the words spoken are, Welcome to DepressionDetect..

      the noise, which can present challenges to neural network learning).

      The spectrograms are generated through a Short-Time Fourier Transform (STFT). STFT is defined as a short-term processing technique that breaks the signals which were overlapping frames using a moving window technique and eventually calculates the Discrete Fourier Transform (DFT) at each and every frame. It should be noted that the trade-off between frequency and time reolution has not been broadly explored in this effort. A Hann window with a window length of 513 was assumed.

    4. Convolutional Neural Networks

      Convolutional neural networks (CNNs) can be considered to be variants of the better known Multilayer Perceptron (MLP) which contains node connections that are inspired by the visual cortex. CNNs have proven to be a effective tool in video analysis, natural language processing and image recognition. More pertinent to the current effort, successful applications have also been applied to analysing speech.

      Below is a quick primer to CNNs in the context of our project. The input that CNNs usually take are images. Here, the input used is a spectrogram represented using grayscale. The audio power level at a certain frequency and time is determined by the intensity of the grayness. A filter is then passed over the image and features for depressed individuals as well as non-depressed individuals are understood with the help of the dataset labels

      The CNN starts by studying features such as lines, however in subsequent layers, starts to learn features such as the model of frequency-time curve (maybe represents intonation). These features which are learned may present a representation of the various prosodic features of speech, which in turn are

      Fig. 3. General CNN architecture.

      representative of underlying differences between depressed and non-depressed speech.

      The network can inconveniently pick up false noise signals such as plosives,ambient noise and unsegmented audio due to the incredibly detailed speech representations in spectrograms. One method to work around this problem is by using different regularization parameters such as pooling layers,dropout and L1 loss functions. However, unless abundant training data is available it is very hard for the network to find the difference between real predictors of depression and the false signal.

      1. Class Imbalance: The number of depressed subjects in the dataset is almost four times smaller than that of the non-depressed ones, which can result in a classification non-depressed bias. Additionally, the duration of different interviews differ greatly with the shortest interview being 7 minutes lost and the longest interview being 34 minutes long. This can lead to a bias due to a bigger amount of signal from an interviewee which can highlight some features that are unique to that person.

        To overcome these problems, each of the spectrograms were divided to obtain 4-second slices. After this, random sampling is done in 50/50 proportion from the depressed as well as the non depressed classes. To make sure that the CNN had interviews of equal duration, a pre-determined amount of slices were sampled from every participant. Initially, there was 35 hours of segmented audio, but after random sampling was done, this was reduced to less than 3 hours, which is adequate for this analysis.

        Various sampling methods were tried so that the size of training data could be increased, however most result in models which are highly biased, in which most class prediction are of non-depressed class. It is advised that revised sam- pling methods should be considered as high-priority in future directions (for example, refer interesting sampling method) in order to increase the training sample size.

      2. Model Architecture: A convolutional neural network (CNN) model consisting of 6 layers was employed. The model consisted of 2 convolutional layer with max-pooling along with 2 fully connected layer. Every spectrogram image has dimension 513×125 which represent 4 seconds of audio with

        frequencies from 0 to almost 8kHz. Due to the fact that most human speech lies in the range of 0.3-3kHz, frequency range was tuned as a hyperparameter. Every input is normalized with reference to decibels relative to full scale (dBFS).

        The architecture that has been employed was inspired by a paper on Environmental Sound Classification with CNN. Figure 4 represents the network architecture employed in this paper and Figure 5 displays the DepressionDetects architec- ture.

        Fig. 4. Environmental Sound Classification CNN architecture.

        Fig. 5. DepressionDetect CNN architecture.

        The CNN used here begins with an input layer being convolved with 32-3×3 filters in order to create 32 feature maps which is then followed by a ReLU activation function. Next, the feature maps undergo dimensionality reduction with a max-pooling layer, which uses a 4×3 filter with a stride of 1×3.

        A second similar convolutional layer is employed with 32- 3×3 filters followed by a max-pooling layer with a 1×3 filter and stride of 1×3.

        This layer is then followed by two dense layers. After the second dense layer, a dropout layer of 0.5 is used (meaning each neuron in the second dense layer has a 50% chance of turning off after each batch update).

        Finally, a softmax function is applied, which returns the probability that a spectrogram is in the depressed class or not depressed class. The sum probabilities of each class is equal to 1. The batch size was set as 32 and an Adadelta optimizer was also used, which adapts the learning rate based on the gradient.

      3. Training the Model: The model was created using Keras with a Tensorflow backend.

    Fig. 6. ROC curve of the CNN model.

    Training was done on 40 randomly selected audio segments of dimensions 513×125 from 31 participants in each class resulting in a total of 2,480 spectrograms. This represents almost 3 hours of audio in order to adhere by strict class balancing (depressed, not depressed) and speaker balancing (160 seconds per subject) parameters. The model was trained for 7 epochs, after which overfitting occurred based on train and validation loss curves.


    The model was assessed and the hyperparameters were tuned based on AUC score and F1 score on a training and validation set. Since, precision and recall may not be accurate,especially if test sets have unbalanced classes, we use AUC scores which are normally used to properly evaluate emotion detection models.

    The test set consisted of 560 spectrograms from 14 par- ticipants with 40 spectrograms per participant.To begin with, predictions were made on each of the 4-second spectrograms. Ultimately, however, a majority vote of the 40 spectrogram

    predictions was utilized to classify the participant as either depressed or not depressed.

    Figure 7 provides a summary of the predictive power using 4-second spectrograms with Figure 8 using the majority vote approach.

    Fig. 7. DepressionDetect CNN architecture.

    Fig. 8. DepressionDetect CNN architecture.

    As already mentioned above, a majority vote over the 40 spectrograms per interviewee was also explored as a method to predict whether a particular participant was depresssed or not. However, in order to increase the size of training data, the test set only contained 14 users, so statistics on individual spectrogram predictions and the majority vote approach were included. Predictably, model evaluation statistics improved when taking a majority vote approach,though it must be kept in mind that the sample size is quite small.

    Most emotion detection models have AUC scores of almost 0.7, however this model had an AUC score of 0.58, by making use of the lower level features which were mentioned earlier. These results strongly suggest an encouraging, new direction for using spectrograms in early detection of depression.


    Ultimately, we envision the model being implemented in a wearable device such as an Apple Watch or in a home device such as Amazon Echo. The device could prompt a person

    to answer a simple question in the morning and a simple question in the night,daily. The model then stores the predicted depression score and tracks it over time. If a certain threshold is surpassed, it notifies the person to seek help, or in extreme cases, notifies an emergency contact to help the person

    Tis model provides a strong foundation and encouraging signs for detecting depression with spectrograms. Further work should have a larger training set.Low level audio transforma- tions do a good job of reducing the noise in the data, which allows for robust models to be trained on smaller sample sizes. However, it is evident that they overlook subtleties in depressed speech.

    Fig. 9. Distribution of PHQ-8 scores.

    We would prioritize future efforts as follows:

    1. Sampling methods to increase training size without introducing class or speaker bias.
    2. Treating depression detection as a regression problem (see below).
    3. Introducing network recurrence (LSTM).
    4. Incorporate Vocal Tract Length Perturbation (VTLP). Depression is not binary, that is, it moves along a spectrum.

    So, deriving a binary classification such as depressed or

    not depressed from a single test like PHQ-8 is somewhat na¨ve and perhaps unrealistic. The threshold for a depression classification was a score of 10, but there is not much difference in depression-related speech between a score of 9 which would be classified as non-depressed and a 10 which would be classified as depressed. Hence, this problem may be better approached by using regression techniques to predict participants PHQ-8 scores and scoring the model based on RMSE.


Depression is a dysregulation of the brain function that control emotions (or moods). Depression is commonly referred

to as Major Depressive Disorder or MDD. Depression is a disorder characterized by strong and continuous negative emotions. These emotions cause a lot of negative impact on peoples lives which can cause social, educational, personal and family difficulties. It is a serious medical condition which affects the way a persons mood is controlled by the brain. Depression negatively impacts the way a person thinks, feels and acts. Due to this condition, a person comes to experience the world through a negative lens. Depression can often last for week or even months. This is commonly called as an episode of Depression. The majority of people who have this condition will experience many such episodes during their lifetime. In a given year, approximately 7 percent of people will have experienced Depression. It is prevalent in women and in young adults, though men are also somewhat prone to this affliction, and the first episode often begins in the teen years or early adulthood.

Major depression is characterized by a multitude of symp- toms that severely disrupts a persons ability to properly eat, sleep, work, have interpersonal relationships, or engage in activities that they might have once enjoyed. Major depressive disorder or MDD can become so debilitating that a person is unable to function normally in their daily life.

There are over 322 million people worldwide who are affected by this disease. It is the most common cause of disability, which affects nearly 16% of the global population. Major Depressive Disorder (MDD) attracts increasing atten- tion due to this fact. However, the underlying mechanism of this disorder is largely not understood. According to published reports from the World Health Organizaton (WHO), MDD is projected to be the major reason for disability in the world by 2030.

Depression, if left unchecked might even lead to death. This can occur if the negative symptoms of the disease, result in a person deciding to take their own life. Depression can often make people feel helpless and devoid of any hope, causing them to reach the unfortunate conclusion that suicide is the only way to end their misery.

The current methods of treating depression have found to be largely ineffective and highly expensive as well. They are also inconsistent across professionals. Our project introduced a way to implement a new technology, Convolutional Neuaral Networks in order to trace the common features or identifiers of depression. The model was able to learn the prosodic features of a set of depressed audio which was given as input (in the form of spectrograms).


  1. Gratch, Artstein, Lucas, Stratou, Scherer, Nazarian, Wood, Boberg, DeVault, Marsella, Traum. The Distress Analysis Interview Corpus of human and computer interviews. InLREC 2014 May (pp. 3123-3128).
  2. Girard, Cohn. Automated Depression Analysis. Curr Opin Psychol. 2015 August; 4: 7579.
  3. Ma, Yang, Chen, Huang, and Wang. DepAudioNet: An Efficient Deep Model for Audio based Depression Classification. ACM International Conference on Multimedia (ACM-MM) Workshop: Audio/Visual Emo- tion Challenge (AVEC), 2016.
  4. Giannakopoulos, Aggelos. Introduction to audio analysis: a MATLAB approach. Oxford: Academic Press, 2014.
  5. Piczak. Environmental Sound Classification with Convolutional Neural Networks. Institute of Electronic System, Warsaw University of Tech- nology, 2015.

Leave a Reply

Your email address will not be published. Required fields are marked *