Speaker Identification by Combining MFCC and Kohonen Neural Networks in Noisy Environments

Catalin Mircea Dumitrescu; Ioan Dumitrache

doi:10.17577/IJERTV7IS040045

Volume 07, Issue 04 (April 2018)

Speaker Identification by Combining MFCC and Kohonen Neural Networks in Noisy Environments

DOI : 10.17577/IJERTV7IS040045

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 114
Total Downloads : 76
Authors : Catalin Mircea Dumitrescu , Ioan Dumitrache
Paper ID : IJERTV7IS040045
Volume & Issue : Volume 07, Issue 04 (April 2018)
Published (First Online): 02-04-2018
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Speaker Identification by Combining MFCC and Kohonen Neural Networks in Noisy Environments

Ctlin Mircea Dumitrescu

Department of System Engineering UPB,

Splaiul Independentei no. 313, Sector 6, Bucharest, 060042, Romania

Professor, Ph.D. Ioan Dumitrache

Department of System Engineering UPB,

Splaiul Independentei no. 313, Sector 6, Bucharest, 060042, Romania

Abstract The Mel-Frequency Cepstral Coefficients (MFCC) are widely used as acoustic features for speaker recognition algorithms, but they are not very robust in the presence of additive noise.

In this paper, we present and test a robust text independent architecture for speaker recognition in noisy environments. The system uses Kohonen neural networks (Self-Organizing Maps – SOM) for speaker modelling and MFCCs as acoustic features. By using the Spearman distance as a metric function for computing the distortion between the known SOM speaker models and the unknown speech input data, we achieve a system recognition rate of 80% in noisy conditions.

We compare the performance of the system for different noise sources and signal to noise ratios.

Keywords Robust speaker recognition; Self-Organizing Maps (SOMs); Mel-Frequency Cepstral Coefficients (MFCCs)

INTRODUCTION

The speaker identification represents the task of selecting the identity of the speaker from a known population by using their voices, based on the individual acoustic features included in the speech.

There are two categories of speaker identification methods: text dependent and text independent methods. In text dependent systems the speaker is identified by using a specific phrase, like passwords, access code etc. On the other hand, text independent methods rely on the characteristics of the speakers voice for identification.

Compute Similarity

Maximum selection

Feature extract

Speech Speaker 1

Identity

Compute Similarity

Speaker N

Fig. 1. Speaker identification system

All speaker identification systems contain three modules: acoustic feature extraction module, speaker modelling technique and feature matching module.

The acoustic features extraction module converts the speech signal from a waveform into parametric representation. The speech is a quasi-stationary signal, so when it is examined

over a short period of time (usually between 5 100 ms), its characteristics are stationary. As a result, a method to characterize the speech signal is the short-time spectral analysis. The most used representation of the short-term power spectrum of a human sound is the Mel-frequency cepstrum (MFC), which represents a linear cosine transform of a log power spectrum on a nonlinear Mel-scale of frequency. Mel- Frequency Cepstral Coefficients (MFCCs) are coefficients that collectively make up an MFC. By using the Mel-scale, which is a perceptual scale of pitches judged by listeners to be equal in distance from one another, the MFC approximates the human auditory system more closely than the normal cepstrum.

This paper describes a speaker identification system that uses as feature the MFCCs combined with DMFCCs and a speaker modelling technique based on SOMs. For determining the similarity, we compute the pair-wise distance between the feature vectors of the speech sample and the known speaker models.

The paper is organized as follows. Section 2 is a presentation for the state of the art speaker identification algorithms and systems. In Section 3 we present the algorithm developed in our research. Section 4 is a case study, in which different parameters are tuned and the proposed algorithm is tested and validated. In this section we determine the optimal frame rate and a noise robust metric. Finally, Section 5 presents or conclusions.
STATE OF THE ART

The state-of-the-art speaker identification algorithms are based on statistical models of short-term acoustic measurements. The most popular feature extraction methods are: Perceptual Linear Coding (PLC) Linear Predictive Coding (LPC), Mel-Frequency Cepstral Coding (MFCC). Spectral methods, like MFCC, present an interesting property: they mimic the functional properties of the human ear by using a logarithmic scale. And by doing so, methods that use spectral analysis of the speech tends to present better performances.

Modelling techniques used in speaker identification are: Gaussian Mixture Model (GMM), Hidden Markov Model (HMM), Support Vector Machines (SVM), Vector Quantization (VQ), Dynamic Time Warping (DTW) and, recently, Self-Organizing Maps (SOMs Kohonen neural networks).

In order to establish the identity, temple models techniques, like VQ or DTW, directly compare the training against the unknown feature vectors. The distortion between them represents their degree of similarity.

L. Wang et al., in [1], presents a method that combines MFCC and phase information in order to perform speaker identification in noisy environments. The speaker model is computed by combining two GMMs, one for MFCCs and the second one for phase information. By doing so, the error rate is reduced by 20% – 70%. On clean speech, the system has a detection rate of 99.3%, but it drops at 59.7% in a noisy environment with a signal to noise ratio (SNR) of 10dB.

A method that uses Self-Organizing Maps for speaker modelling is presented in [2] by E. Monte et al. The system uses the VQ function of the SOM and its topological property,

i.e. the fact that neighbouring code-words in the feature space are neighbours in the topological map. The feature vectors were computed using two methods: LPC and MFCC (24 coefficients for both methods), and the analysis window has a duration of 30ms, with 10ms overlap between them. In the training stage, for each speaker an occupancy histogram is computed and then filtered, using a low pass filter. In the identification stage, the occupancy histogram of the unknown speaker is computed and compared with the known models from the database. Using the relative entropy, the system determines the degree of similarity between the unknown speakers histogram and the reference.

The system performances were as follows: 98.2% for clean speech using LPC, 100% for clean speech using MFCC, 8% in noisy environment with a SNR of 10dB using LPC and 19.5% in noisy environment with a SNR of 10dB using MFCC.

Another method was proposed by R. Hasan in [3]. It uses MFCC for feature vector extraction and VQ to create the speaker model. The system was tested on clean speech from 21 speakers with different code book sizes and framing windows. It had a performance of 100% recognition rate for a code book size of 64.
1. Feature extraction
  
  The feature extraction module generates a parametric representation for the speech waveform. In the proposed algorithm, we use Mel-Frequency Cepstral Coefficients (MFCCs) and delta MFCC (DMFCC), as speech features.
  
  There are many implementations of the original MFCC algorithm, introduced in 1980 by B. Davis and P. Mermelstein [4]. They mainly differ in the number of filters, the shape of the filters, the way the filters are spaced, the bandwidth of the filters and the manner in which the spectrum is warped. The algorithm used in this work is the one implemented by Slaney in [5].
  
  Pre-emphasis
  
  Frame Blocking
  
  Windowing
  
  DFFT
  
  Speech
  
  Cepstrum Extraction
  
  Log()
  
  DCT
  
  Mel-frequency
  
  Wrapping
  
  MFCC
  
  Fig. 3. Block diagram of MFCC extraction
  
  The speech waveform, sampled at 8 kHz 16 bit, is sent to a secondorder high pass filter, in the pre-emphasis stage. The aim of this stage is to increase the high-frequency components of the human voice that are suppressed during speech production. The used pre-emphasis lter is given by the following transfer function:
PROPOSED ALGORITHM

The system that we propose uses the Mel-Frequency Cepstral Coefficients as feature vectors and Self-Organizing Maps for creating the speaker model. The training of the neural network is done using noise free recordings, while the testing of the system is performed using noise corrupted speech files.

Compute distortion

Minimum selection

Feature extract

Speaker 1 SOM

Identity

Compute distortion

Speaker N SOM

Fig. 2. Proposed algorithm

The algorithm uses the VQ functionality of the self- organizing maps in order to establish the identity of the speaker. In order to establish the identity of the unknown speaker, we compute the mean quantization distortion between the unknown speech input data and all the known speaker models from the database. The identity of the unknown speaker corresponds to the SOM for which the quantization distortion presents the smallest value.

After that, the signal is blocked into frames of N samples with a frame overlap of M samples. Typical values are 30ms frame width and 10ms overlap. The proposed algorithm uses a 32ms frame, but the optimal overlap between frames is determined through testing.

To minimize discontinuities between frames, a windowing function is applied to each frame. We use the Hamming window in this work:

After applying the windowing function, each frame is converted from time domain into frequency domain using the DFFT algorithm. The power of the spectrum obtained in the last step is mapped on the Mel-scale using a filter bank. The algorithm uses a filter bank of 40 equal area filters (see Error! Reference source not found.), implemented by Slaney in [5]. It covers the frequency range of 133 6854 Hz and it consists of:
1. Speaker modeling
For the speaker model, we use Kohonen neural networks or Self-Organizing Maps (SOMs), so for each of the speakers we train an 8×8 SOM, using the 64 components feature vectors.

Self-Organizing Maps provide a way of representing multidimensional data in much lower dimensional spaces – usually one or two dimensions. They consist of components called nodes or neurons, arranged in 1D, 2D, usually hexagonal or rectangular grid, or higher order (not common) structures. Each node has a weight vector associated with it and a specific topological position in the output map space; the dimension of the weight matches the dimension of the data input vector. Distances between neurons are calculated from their positions with a distance function.

Fig. 5. SOM architecture

The main advantage of using Self-Organizing Maps is that they preserve the topological properties of the input space.

Being a neural network, a SOM needs to be trained. The training is unsupervised and it uses a competitive learning algorithm. At each training step, an input vector is chosen and fed to the network. Using a distance function, a winning neuron is identified, and then the weights of the winning neuron and its neighbours are updated. Thus, the winning neuron and its close neighbours move towards the input vector. The training stage ends after a predefined number of epochs.

The proposed algorithm associates each speaker an 8×8 hexagonal SOM with 64 inputs. As distance function we use linkdist, the number of training epochs is set at 500.

CASE STUDY

In the following section, we are going to determine the optimal frame step for MFCC extraction algorithm, identify a robust metric for computing the similarity between the unknown test speaker and the speaker models from the database and test the proposed algorithm for different type of noise and SNRs.

The database used in this work is The CHAINS corpus [6]. It contains 36 speakers, with 2 recordings for each speaker. The different recording sessions are about two months apart and present different speaking styles. For our algorithm, we train the SOMs using the solo reading recordings, where subjects simply read a prepared text at a comfortable rate. The testing of the system is done using the retell recordings, where, after reading the Cinderella fable in the solo condition, subjects were asked to retell the story in their own words. The sound files are down-sampled from 44.1 kHz to 8 kHz, 16 bit mono PCM.

Determining the optimal frame rate

In order to compare two Self-Organizing Maps, a performance index must be defined. In this work, we are using the mean quantization error of the SOM as performance index. One of the goals of a SOM is to quantize the input values into a finite number of centroids or output vectors. The quantization error is defined as the distance between an input vector and its nearest centroid. The sum of the quantization error over the input data represents the network distortion:

As mentioned earlier, we are using the mean value of the distortion as performance index for the speaker model:

Where: N represents the number of input vectors, d a distance function.

To determine the optimal frame rate, we will generate speaker models using several frame overlaps (12.5 ms, 10 ms, 8 ms, 6 ms, 5 ms, 4 ms and 3 ms) and, afterwards, compute for each of the models, the mean quantization error for the training data. The SOM performance index is computed using the following metrics: Euclidean distance, Cityblock distance, Chebychev distance and Spearman distance.

Euclidean distance

Cityblock distance

2.1

Mean Quantization Error

Given vectors

a

data

matrix

, treated as

, and a data matrix

row Evolution of the mean quantization error (Euclidean distance)

,

treated as

vectors

and

row vectors, the distance between the 2.3

is defined as: 2.2

2

1.9

1.8

1.7

1.6

1.5

1.4

1.3

36

31

26

21

16

11

6

1 80

Speaker ID

128

100

160

Frame rate (fps)

200

256

300

Chebychev distance
Spearman distance

Fig. 6. Quantization error using Euclidian distance

Evolution of the mean quantization error (Cityblock distance)

13

12

Mean Quantization Error

11

10

9

8

7

Where:

is the rank of take over

is the rank of take over

6

36

31

26

21

16

11

6

1 80

Speaker ID

100

128

160

Frame rate (fps)

200

256

300

Fig. 7. Quantization error using Cityblock distance

Evolution of the mean quantization error (Chebychev distance)

1

0.95

The evolution of the average quantization error for each of the speaker models is presented in figures 6 to 9. By analysing them, we can observe that increasing the frame rate (decreasing the frame overlap), the quantization errors for ech metric functions decreases. But with the increase of the frame rate, the number of MFCCs also increases, thus the systems processing time is affected. In order to determine an optimal frame rate, we must define an optimization criterion.

We propose the following optimization criterion:

0.9

Mean Quantization Error

0.85

0.8

0.75

0.7

0.65

36

31

26

21

16

11

6

1 80

Speaker ID

100

128

160

Frame rate (fps)

256

200

300

Fig. 8. Quantization error using Chebychev distance

Evolution of the mean quantization error (Spearman distance)

Where: represents the variation (in %) of performance index defined in (4); the variation (in %) of MFCCs due to frame rate increase

0.47

0.46

Mean Quantization Error

0.45

0.44

0.43

0.42

0.41

0.4

0.39

0.38

0.37

36

31

26

21

16

11

6

1 80

Speaker ID

100

128

160

Frame rate (fps)

200

256

300

Fig. 9. Quantization error using Spearman distance

The mean quantization error for each speaker model, as can be seen from the previously figures, decreases with the frame rate (for higher frame rates we have a lower quantization error).

data and the speaker model are from the same person (the minimum values are on the minor diagonal), thus we have a recognition rate of 100 %.

This fact can also be observed from figure 10, where the evolution of the mean performance index, for all the speaker models, is shown.

Evolution of the mean quantization error for all speaker model

2

Mean quantization error using Euclidean distance training data without noise

36

31

26

Speaker ID

21

Mean quantization error using Cityblock distance training data without noise

36

31

26

Speaker ID

21

10

Euclidean distance Cityblock distance Chebychev distance Spearman distance

16 16

11 11

6 6

System Mean Quantization Error (log scale)

11 6 11 16 21 26 31 36 11 6 11 16 21 26 31 36

101 Model ID

Model ID
1. Euclidian norm (b) Cityblock norm
36

31

26

0

Speaker ID

10 21

16

11

6

Mean quantization error using Chebychev distance training data without noise

Mean quantization error using Spearman distance training data without noise

36

31

26

Speaker ID

21

16

11

6

10-1 11 6 11 16 21 26 31 36 11 6 11 16 21 26 31 36

80 100 128 160 200 256 300

Model ID

Model ID

Frame rate (fps)

Fig. 10. Optimisation criteria vs. frame rate

From figure 11, which illustrates the evolution of the optimization criterion in relation to the frame rate, we can conclude that the optimal frame rate for extracting the MFCCs is 200 fps (frame overlap of 5ms). The Spearman metric presents a maximum when the frame rate is 200 fps; all the other metrics wave a downward trend, the performance increases slower than the number of Mel-Frequency Cepstral Coefficients.

Euclidean metric Cityblock metric Spearman metric Chebychev metric

0.25
1. Chebychev norm (d) Spearman norm
  
  Fig. 12. Systems mean quantization error no noise
  
  By adding noise, the minimum values for the performance index no longer appear on the minor diagonal (where the speaker model and speech sample correspond to the same person), as can be seen from figure 13 where we have the system outputs for a SNR of 20 dB, and so the identification rate decreases.
  
  Mean quantization error using Cityblock distance
  
  0.2
  
  Optimization criterion value
  
  0.15
  
  Mean quantization error using Euclidean distance training data with SNR = 20 dB
  
  36
  
  31
  
  26
  
  Speaker ID
  
  21
  
  16
  
  11
  
  6
  
  training data with SNR = 20 dB
  
  36
  
  31
  
  26
  
  Speaker ID
  
  21
  
  16
  
  11
  
  6
  
  11 6 11 16 21 26 31 36 11 6 11 16 21 26 31 36
  
  0.1
  
  Model ID
  
  Model ID
  1. Euclidian norm (b) Cityblock norm
0.05

0

128 160 200 256 300 100

Frma rate (fps)

Fig. 11. Optimization criterion evolution

Mean quantization error using Chebychev distance training data with SNR = 20 dB

36

31

26

Speaker ID

Speaker ID

21

16

11

6

Mean quantization error using Spearman distance training data with SNR = 20 dB

36

31

26

21

16

11

6

1

1 6 11 16 21 26 31 36

Model ID

1

1 6 11 16 21 26 31 36

Model ID

Noise robust metric identification

To determine a noise robust metric, we test the speaker models with a noise corrupt version of the training data. Because the CHAINS corpus database is recorded in a sound- proof booth (no noise is present), we have manually altered the training recordings by adding Gaussian white noise with different SNRs (40 dB, 35 dB, 30 dB, 25 dB, 20 dB, 15 dB, 10 dB, 5 dB and 0 dB). In these tests we use a frame width of 32ms and a frame overlap of 5ms, as determined in section A.

By analysing the system outputs, figure 12 (blue represents lower values – red higher values) for the case where the input data is without noise, we can observe that the minimum values for the mean quantization error are obtained when the speech

(c) Chebychev norm (d) Spearman norm

Fig. 13. Systems mean quantization error SNR 20 dB

By analyzing figure 12(a d) and figure 13(a d) we can establish that one of the four norms that are used to compute the performance index, is less influenced by the noise. When the mean quantization error is computed using the Spearman metric as a distance function, the systems present roughly the same distribution of the output values for the two presented cases (no noise figure 12 (d), 20db SNR figure 13 (d)).

The identification rate for a SNR of 20 dB is 97% when Spearman metric is used; for the other metrics the rate is below 50%.

Figure 14 shows the evolution of the identification rate, for all the four metric functions, in relation to the SNR value. It can be easily seen that:

Looking at figures 15 and 12 (d), we see that the systems quantization error is nearly the same in the two cases (input data retell sound files and solo reading sound files).

100

95

90

85

80
- the performance of the system, when the Chebychev metric is used, it is heavily influenced by the Gaussian white noise
- using the Cityblock metric, the system outperforms the case when the performance index is computed using the Euclidian metric
- the Spearman metric is the least affected by noise.
Identifcation rate vs. SNR

Euclidean metric

Cityblock metric

Chebychev metric Spearman metric

100

95

90

85

80

75

70

Identification Rate (%)

65

60

55

50

45

40

35

30

25

20

airport noise

15 restaurant noise

Identifcation rate vs. SNR

75 10

street noise1 street noise 2

70 5 Gaussian white noise

0

Identification Rate (%)

65 40

60

55

50

45

40

35 30 25 20 15 10 5 0

SNR (dB)

Fig. 16. Identification rate vs. SNR using retells recordings

35

30

25

20

15

10

5

0

NaN 40 35 30 25 20 15 10 5 0

SNR (dB)

Fig. 14. Identification rate vs. SNR using training data as input
System validation

We evaluate the system performance using the retell recordings of the CHAINS orpus database and different noise sources (Gaussian white noise, airport noise, restaurant noise

Figure 16 shows the evolution of the systems identification rate for different types of noise sources and signal to noise ratios. We can see that the system presents a 100% identification rate, even for a SNR value of 25 dB and for all types of noise sources. The system is influenced more by the Gaussian white noise, the identification rate drops at 85% at SNR of 20 dB and the performance decrease is more rapid for this type of noise. In the other cases, the system presents an identification rate of 100% until a SNR of 15dB 10 dB. The highest rate at 0 dB SNR is nearly 65%, using street noise as corruption source; the lowest value is 15%, when using GWN.

and street noise) with multiple SNRs. The Mel-Frequency Cepstral Coefficients are extracted using a frame width of 32ms and a frame overlap of 5ms (section A). The quantization error is computed using the Spearman metric as distance function.

As benchmark, to compare the degradation of the systems performance with noise, we determine the systems mean quantization error using the retell speech files without noise corruption. Afterwards, the input files are mixed with different noise sources using multiple signal to noise ratios. All the files

Mean quantization error airport noise with SNR = 20 dB

36

31

26

Speaker ID

21

16

11

6

1

1 6 11 16 21 26 31 36

Model ID

Mean quantization error restaurant noise with SNR = 20 dB

36

31

26

21

16

11

6

1

1 6 11 16 21 26 31 36

Model ID

are down-sampled to 8000 Hz, near telephone line quality.

airport noise (b) restaurant noise

Mean quantization error (retell speech files without noise)

36

31

26

Speaker ID

21

Mean quantization error street noise1 with SNR = 20 dB

36

31

26

Speaker ID

21

16

11

6

Mean quantization error street noise 2 with SNR = 20 dB

36

31

26

Speaker ID

21

16

11

6

16 1

1 6 11 16 21 26 31 36

Model ID

1

1 6 11 16 21 26 31 36

Model ID

(c) street noise (d) street noise

11

6

1

1 6 11 16 21 26 31 36

Model ID

Fig. 15. Mean quantization error for retell recordings (without noise)

Mean quantization error Gaussian white noise with SNR = 20 dB

36

31

26

Speaker ID

21

16

11

6

1

1 6 11 16 21 26 31 36

Model ID

(e) Gaussian white noise

Fig. 17. System mean quantization error using retell sound files and SNR 20 dB

airport noise restaurant noise

street noise1

street noise 2 Gaussian white noise

Influence of noise on the mean Spearman distance between the speaker model and corresponding test data

0.58

0.56

Mean Spearman distance

0.54

0.52

0.5

0.48

0.46

0.44

0.42

0.4

NaN 40 35 30 25 20 15 10 5 0

SNR (dB)

Fig. 18. Influence of noise on the mean Spearman distance between the speaker model and corresponding test data

According to (8), the Spearman distance is one minus the sample Spearman's rank correlation between observations, treated as sequences of values. The rank correlation coefficient is between -1 and 1:
- 1 if the correlation between the two rankings is perfect; one ranking is the reverse of the other.
- 0 if the rankings are completely independent.
- 1 if the correlation between the two rankings is perfect; the two rankings are the same

An increased rank correlation coefficient implies a high correlation between the rankings of the two observations. The two observations, the speaker model and the speech feature vectors, have similar shapes.

CONCLUSIONS

Mel-Frequency Cepstral Coefficients (MFCCs) are the most popular speech features, used in speaker identification algorithms, but they are not ery robust with noise. The degradation of the MFCC values, due to noise, implies the degradation of the system performance.

Using the Spearman distance, in order to compute the mean distortion between the known speaker models and the unknown speech input data, the systems identification rate is less influenced by noise. This is due to the fact that the Spearman distance is computed on the ranks of the vectors, not their values.

For a low SNR (15 dB 0 dB), the mean distance is greater than 0.5, which indicates a moderate correlation between the input data (speaker retell recording) and the speaker model.

Analysing the above-mentioned results, we can conclude that the system presented in this work is noise robust and has a high identification rate.

When the input data corruption is performed using real noise sources (like airport, restaurant and street noise), the system outperforms those presented in [1] and [3].

Depending on the noise source and SNR value, the system presents an identification rate greater than 80% (SNR greater than 10dB). The identification drops to 20%, when the speech is corrupted using airport noise with SNR value of 0 dB.

REFERENCES

Wang, L., Minami, K., Yamamoto, K., Nakagawa, S., 2010. Speaker Identification by Combining MFCC and Phase Information in Noisy Environments. Acoustics Speech and Signal Processing (ICASSP), Dallas.
Monte, E., Hernando, J., MirÃ³, X., Adolf, A., 1996. Text Independent Speaker Identification on Noisy Environments by Means of Self Organizing Maps. Fourth International Conference on Spoken Language ( ICSLP), vol. 3, pp. 1804 1807.
Hasan, R., Jamil ,M., Rabbani, G., Rahman, S., 2004. Speaker Identification Using Mel Frequency Cepstral Coefficients. 3rd International Conference on Electrical & Computer Engineering ICECE.
Davis, S. B., Mermelstein, P., 1980. Comparison Of Parametric Repre- Sentations For Monosyllabic Word Recognition In Continuously Spoken Sentences. IEEE Transactions on Acoustic, Speech and Signal Processing, vol. 28, no. 4, pp. 357 366.
Slaney, M., 1998. Auditory Toolbox. Version 2 – Technical Report

#1998-010. Interval Research Corporation.
Cummins, F., Grimaldi, M., Leonard, T., Simko, J., 2006. The CHAINS corpus: CHAracterizing INdividual Speakers. Proceedings of SPECOM06, St. Petersburg.
Faundez-Zanuy, M., Monte-Moreno, E., 2005. State Of The Art On Speaker Recognition. IEEE Aerospace and Electronic Systems Magazine, vol. 20, no. 5, pp. 7-12.
Kohonen, T., Honkela, T., 2007. Kohonen network [Online]. Available: http://www.scholarpedia.org/article/Kohonen_network.
Bodt, E., Cottrell, M., Verleysen, M., 2002. Statistical tools to assess the reliability of self-organizing maps. Neural networks: the official journal of the International Neural Network Society, vol. 15, pp. 967-978.
Ganchev, T., Fakotakis, N., Kokkinakis, G., 2005. Comparative Evaluation of Various MFCC Implementations on the Speaker Verification Task," in SPECOM.
Kumar, S. C., Rao, P. M., 2011. Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm. International Journal on Computer Science and Engineering (IJCSE), vol. 3, no. 8, pp. 2942 – 2954.

Speaker Identification by Combining MFCC and Kohonen Neural Networks in Noisy Environments

Euclidian norm (b) Cityblock norm

Chebychev norm (d) Spearman norm

Euclidian norm (b) Cityblock norm

(c) Chebychev norm (d) Spearman norm

airport noise (b) restaurant noise

(c) street noise (d) street noise

(e) Gaussian white noise

Leave a Reply