Automatic Speech Recognition for Telephone Voice Dialling in Yoruba

DOI : 10.17577/IJERTV1IS4039

Download Full-Text PDF Cite this Publication

Text Only Version

Automatic Speech Recognition for Telephone Voice Dialling in Yoruba

T.S. Ibiyemi A.G. Akintola

Dept. Electrical & Electronics Engineering Dept. of Computer Science University of Ilorin, Ilorin, Nigeria University of Ilorin, Ilorin, Nigeria


Human Computer Interaction is largely by electromechanical devices. These interaction media are grossly not users friendly and often risky and life threaten in some applications such as driving and phone dialling. This paper presents our work in telephone auto-dialing in yorùbá language. The speech recognition algorithm used was coded in C language and run on a Pentium duo core 2.6 GHz 2 GB RAM PC with a gsm set and a multimedia headset attached to the PC. The experiments yielded 94% speaker recognition rate, and 82% phone sentence recognition rate.

  1. Introduction

    Human Computer Interaction, HCI, is largely by electromechanical devices such as keyboard, mouse, joystick, printer, and monitor. These interaction media are grossly not users friendly and often risky and life threaten in some applications such as driving and phone dialling. A natural and better human- machine interaction to eliminate this fatal risk of driving and phoning is voice dialling of phone. However, telephone voice dialling by anybody who has access to the phone is not good enough. Hence, there is a need to be able to authenticate authorised users of the phone. In order to extend this technology to the grass root, this speaker authentication prior to telephone voice auto-dialling is implemented in Yorùbá language.

    Yorùbá language is one of the three dominant local languages spoken in Nigeria by about 22 million people. It is a tonal language, that is, the tone of pronunciation of a yorùbá word determines the meaning of that word. This is unlike non-tonal languages where the spelling of a word suffices to infer its meaning. The problem is further compounded by the fact Yorùbá language is full of homographic words. Homographic words are words with the same spelling but having different meanings depending on their pronunciation tones. Yorùbá language has 25 letter alphabets (a , b, d, e, , f, g, gb, h, i , j , k, l, m, n, o, , p, r, s, , t, u, w, y ), 7 of

    them are vowels (a,e, , i, o, , u ), the remaining 18 are consonants ( b, d, f, g, gb, h, j , k, l, m, n, p, r, s, , t,w, y ) . They are three tone levels, namely, low tone, mid tone, and high tone. The low and high tones are represented by grave accent symbol (`), and acute accent symbol ( ´ ) respectively. Mid tone has no accent symbol. Accent symbol , where required, is only allowed on vowel in a yorùbá syllable. The ten yorùbá numerals are (ofo, en , eji, ta, rin, àrún, èfà, eje, j, san) equivalent of the ten numerals (0,1,2,3,4,5,6,7,8,9) respectively. Mobile telephone numbers in Nigeria consist of 11 digits drawn from these numerals with the first digit always digit 0 for calls within the country.

    Man-machine interface by speech is a new paradigm shift which is natural and most users friendly. This new paradigm shift is made possible by automatic speech recognition, ASR, system.

  2. Automatic Speech Recognition System

    Fig.1 shows the basic structure of an automatic speech recognition system [1, 2]. The air pressure variation caused by speech is transformed to electrical analogue signal by a microphone. The analogue signal output of the microphone is then fed into an analogue to digital converter for conversion into digital samples. These speech samples are pre- processed to put them in a suitable form for easy extraction of the discriminating characteristics inherent in the samples represented as feature vectors. There are two phases to an ASR system, namely, training phase, and query or operational phase. During training phase, the feature vectors that characterized each word or/and speaker are stored as that words or speakers reference template in a database.

    After the training phase, the system is ready to be deployed for speech recognition. The word to be recognized passes through the microphone, preprocessing and feature extraction units as the training session. However, the extracted feature vectors, this time, are not stored but compared with each words feature vectors in the reference template

    database . The output of the comparison is compared to a threshold for decision making on if the word is recognized or not. The different words used in the

    Let si , i 1,2,, N bea framespeechsamples; N no.of samples

    training phase constitute the vocabulary of the words

    that can be recognized.

    1 N



    E i

    N i 1


    Frame Zero Crossing Rate:

    z 1 N sgn(s j ) sgn(s j 1)

    N j 2 2


    sgn(s( )) 1


    , s( ) 0

    , s( ) 0


    Energy Upper Threshold:

  3. Hardware Model

    The hardware model of this ASR system consists of a PC with an in-built sound card, a microphone,

    Tu where:

    1 L E


    L i 1


    and an rs232 gsm set. A speaker makes an utterance

    in yorùbá which is captured and traduced from

    Ei i

    th frame energy

    sound-wave to electrical analogue signal by the microphone. The output the microphone is fed into the in-built sound card on the PC for conversion into digital form. During training the digitised speech data are stored for offline processing. The software often used for driving the sound card allows some

    L no. frames assumed as noise,10 inourcase

    Energy Lower Threshold:

    flexibility in configuration of bitrate, number of audio channels, number of bits per sample. But during the operational phase, speech data is captured

    Tl 0.25 *Tu


    online, that is no intermediate storage of speech data is required.

    Zero Crossing Rate Threshold:

    T 1 L z (5)

  4. Software Model

    The software model of the automatic speech recognition consists of series of algorithms for realising the speech recognition as next described




    L i 1



    th framezcr


    1. Pre-processing

      The pre-processing implementation consists of the following steps:

      1. Voice Activity Detection, VAD

        The VAD, also known as words endpoint detection, removes the inter-word silence periods. The energy

        L no. framesassumedasnoise, 10inourcase

      2. Pre-Emphasis Filter

        The high pass FIR filter implemented has a transfer function of:

        and zero crossing rate, ZCR, of a word is used in conjunction with thresholds to segment the activity area of the word from the silent background. The

        H (z) 1

        where: a

        az 1



        VAD algorithm is defined as: Frame Energy:

        The time response of this filter is:



        s(n 1)


        magnitude output of filter bank on mel-frequency scale. In order to convert the obtained mel-scaled power spectrum to time domain, an the inverse discrete cosine transform, DCT, is taken. The output

      3. Frame Blocking

        The partitioning of a word samples into short blocks in order to make the signal stationary is based on eqn.7:

        of the inverse DCT are the mel frequency cepstral coefficients. The algorithm for obtaining the MFCCs is described in eqn(9) to eqn(13).

        Apply DFT to each of the windowed speech signal of

        xi, j


        M ).i

        j) , j

        0,1,, K

        1; i

        0,1,, L 1;



        N no. of samplesper word


        N 1

        Y n) w(k)x(k)e j 2 kn / N , n 0,1,

        , N 1


        1. ( N M )

          K M

          no. of frames

          k 0

          Get the power spectrum of eqn(9):

          per word

        2. no. of overlappedsamples

        per frames

        Y (n)

        ( (Yreal

        (n)2 )




        K no. of samplesper frame






      4. Windowing

      Frame blocking using rectangular window as defined in eqn (7) lead to ringing frequency response also known as Gibbs phenomenon as a result of sharp edges. Hence, Hamming window which is a raised cosine window is used to smooth the edges.

      n 0,1,, N 1

      Convert power spectrum of eqn(10) in linear frequency to power spectrum in mel-scale frequency:

      N 1

      Hamming window is defined by eqn (8):

      Pmel (m)


      Y (k).H (k, m) , m

      k 0

      0,1,, L




      L no. of mel filters


      2 n

      0 , f (k ) fc (m 1)

      f (k ) f (m 1)

      w(n) 0.54 0.46 cos

      ,0 n N 1

      c , fc (m 1) f (k) fc (m)

      N 1 H (k, m)

      fc (m) fc (m 1)

      f (k ) f (m 1)



      Hanmmin g window

      c , fc (m) f (k ) fc (m 1)


      speech frame samples

      fc (m) fc (m 1)

      0 , f (k ) fc (m 1)

    2. Feature Extraction

      The characterisation of speech signal data for the purpose of simultaneous recognition of what is said and who said it, that is, speech recognition and speaker verification respectively is most efficiently

      f (k)


      f min



      f max


      1 . f

      f min


      , k 1,2,, N


      handled by mel frequency cepstral coefficients, MFCC. This is because MFCC characterisation is very similar to that of the human aural perception. Hence, MFCCs are used as feature vectors in this

      f min

      f max

      Minimum speech frequency Maximumspeech frequency

      work for simultaneous speaker authentication and speech recognition [5,6,7,8,9].

      fc (m)



      m 1 . f

      , m 1,2,, L

      The process of obtaining the MFCCs involves transformation of the windowed frame of speech data from time domain to frequency domain and then

      f (m)



      fmax fmin

      f (m)


      1 , m

      1,2,, L


      back to time domain after processing. Firstly, the L 1

      spectrum magnitude of the windowed speech signal data is obtained on a linear scale frequency by FFT. This output magnitude is converted power magnitude which is convolved with the frequency response

      fmin fmax f


      M inim umm el frequency M axim umm el frequency

      speech frequency

      m el speech frequency

      Obtain the logarithm of the mel scale power spectrum:

      1 N

      c1 xi

      N i 1


      Pmel (m) log10 Pmel (m) , m 1,2,, L (12)

      -Set 0.01

      Convert logarithm mel frequency power spectrum to time domain cepstral coefficients (mfcc) using inverse DCT:

      -Set m = 1

      -Set n(h) 0 , h 1,2,, M

      -Calculate distortion, D:


      Ci Pmel ( j) cos

      .i( j

      0.5) / M ,

      D 1 N k

      xi, j


      c1, j

      j 1

      i 1,2,, M


      N.k i 1 j 1

      where: M

      no. of coefficients,

      step1: Double Codebook Size by Splitting

      L no. of mel


      for j

      1 to m do

    3. Vector Quantisation, VQ

      It is usual in speaker and speech recognition involving multiple utterances of words to generate very large number of feature vectors per word during the training phase. Hence, the total number of feature vectors can easily become unmanageable in term of storage as templates or in matching computation. These two problems can render speech recognition in embedded application unrealisable. Hence, it is


      ci 1

      cm i 1


      m 2m

      ci ci , i



      1,2,, m


      imperative to use vector quantisation as data compression method. [3,4,5] . The VQ problem definition is:

      VQP roblemD efinition:

      for i

      ( for j

      step2: Distribute feature vectors by clustering

      -Set D D

      1 to N do

      1 to M do

      G iven:T

      X1 , X 2 ,, X N


      ( j*

      arg min j xi


      c j )

      & no.of desiredcodevectosr M

      s * xi ;

      Find C

      c1 , c2 ,, cM


      n( j* )

      n( j* ) 1


      & thecodevectosr' s partitionregions )


      P s1 , s2 ,, sM

      such that average



      is min im ised

      step3: Update Centroids/Codevectors

      This problem is solved, in our case, using LBG-VQ

      1 n j

      c j n

      s j (h)

      , j 1,2,, M


      algorithm of Fig. 2.



      j h 1

      step4: Calculate new Distortion

      step0: Codebook Initialisation

      -Input N, M, Xi = { xi,1, xi,2,..,xi,k} , i=1,2…,N ; k=dimension of feature vector

      D 1 N M


      xi, j


      ch, j


      (* Ntotal no. feature vectors in training set; X training set feature vectors

      Mno. of codevectors/vocabulary

      N.M.k i 1 h 1 j 1

      words *)

      -Calculate 1-codevector codebook:

      if D D D

      Then goto step2 otherwisegoto step5

      appropriately populated during training. An adaptive threshold is determined for each of the codevector, that is each word utterance, during training. A simple Euclidean distance measure is used in matching a

      step5: Repeat until desired number of codewords

      if (m M ) Then goto step1 otherwisestep6

      step6: Output codebook

      Output c j , j 1,2,, M

      step7: stop


    4. Recognition

      The recognition phase is implemented by simple Euclidean distance measure and empirically determined threshold as given in eqn(14):

      test utterance with the templates in the codebooks. The computed distance is compared with the stored thresholds in determining the speaker and recognition of the spoken sentence for auto-dialling of the gsm set. The system is coded in C language and run on a Pentium duo core 2.6 GHz with 2GB RAM on board. The PC has a gsm set and a multimedia headset attached to the PC. However, the final system will be an embedded front-end interfaced to a gsm set. The experiments yielded 94% speaker recognition rate, and 82% phone sentence recognition rate.

      di arg min

      2 2 , i

      1,2,, N




      if di T

      Re cognisedotherwise


      Not Re cognised.

  5. Experiment and Result

    Speech contains more information than what is said but also includes information on the speaker, accent, gender, and age group. Hence, one single process suffices to authenticate the speaker and to recognise the word. Some experiments were conducted using 20 native Yorùbá speakers to pronounce each of the 13 words in the vocabulary 10 times. These 2,600 words were all used for training. For the recognition phase, the 20 speakers that participated in the training pronounced the telephone sentence (pè fnú

    + 11 digit phone number of their choice) and the word (gbé fnú) once in Yorùbá. Each word is sampled at the rate of 8000 samples per second with each sampled quantised into 16 bits. Fig.2. shows speech waveforms of a Nigerian mobile phone number 08034265239 pronounced as a sentence in Yorùbá. The pre-emphasis filter coefficient used is

    0.97 at a framed window of 256 samples with 128 samples overlap. The MFCC feature extraction method is used having 20 mel filter bank and 16 dimensional feature vectors.

    There are 20 codebooks, with one codebook for each speaker. Each codebook has 13 codebook-lets, each codebooklet represents each word of the vocabulary. A codebooklet contains 10 codevectrs representing the 10 utterances per word per speaker. The codebooks, codebooklets, and codevectors are

  6. Conclusion

    A users friendly human computer interaction based on speech recognition for telephone auto-dialling in Yorùbá was developed. The speech recognition algorithm used was coded in C language and run on a Pentium duo core 2.6 GHz 2 GB RAM PC with a gsm set and a multimedia headset attached to the PC. The experiments yielded 94% speaker recognition rate, and 82% phone sentence recognition rate. Though the system was developed on a PC, the target would be an embedded front-end unit interfaced to a gsm set.

  7. Acknowledgement

    We acknowledge with great appreciation the generous research and development grant received from Federal Government of Nigeria through the STEP-B project to execute this work.

  8. References

  1. Lipeika Antanas, Lipeikiene Joana, Telksnys Laimutis, Development of Isolated word Speech Recognition,

    Informatica, vol.13, no.1, 2002, pp. 37-46

  2. E-Hocine Bourouba, et al, Isolated Words Recognition System Based onHybrid Approach DTW/GHMM, Informatica 30, 2006, pp. 373-384

  3. Allam Musa, MareText Independent Speaker Identification based on K-Means Algorithm, International Journal on Electrical engineering and Informatics, vol.3, no.1, 2011, pp100-108

  4. Srinivasan A., Speaker Identification and Verification using Vector Quantisation and Mel Frequency Cepstral coefficients, Research Journal of Applied Sciences, Engineering and technology, vol.4, no.1, 2012, pp. 33-40

  5. Satyahad Singh, and Rajan E.G., MFCC VQ based Speaker Recognition and its Accuracy Affecting Factors, International Journal of Computer Applications, vol. 21, no.6, 2011, pp.1-6

  6. Kekre H.B., and Vaishali Kulkarni, Performance Comparison of speaker Recognition using Vector Quantization by LBG and KFCG, International Journal of applications, vol.3, no.10, 2010, pp.32-37

  7. Rashidul Hasan, Mustafa Jamil, Golam Rabbani, Saifur Rahman, Speaker Identification using Mel Frequency Cepstral Coefficients, Proc. 3rd International Conference on Electrical and Computer engineering, ICECE 2004, 28- 30 December, Dhaka Bangladesh, 2004, pp. 565-568

  8. Linde Y., Buzo A., Gray R.M., An Algorithm for Vector Quantiser Design, IEEE Trans on Communications, vol. COM-28, no. 1, 1980, pp. 84-95

  9. Wael Al-Sawalmeh, Khaled Daqrouq, Omar Daoud, Abdel-Rahman Al-Qawasmi, Speaker Identification System based Mel Frequency and Wavlet Transform using Neural Network Classifier, European Journal of scientific Research, vol.41, no. 4, 2010, pp. 515-525

  10. Srinivassan A., 2011, Speech Recognition using Hidden Markov Model, Applied Mathematical Science, vol.5, no. 79, 2011, pp. 3943-3948

Leave a Reply