Single Channel Speech Source Separation using Complex Matrix Factorization

Download Full-Text PDF Cite this Publication

Text Only Version

Single Channel Speech Source Separation using Complex Matrix Factorization

P Naga Yamini

Reg.1116468

Final Year M.Tech, Signal Processing

Sri Venkateswara University College of Engineering Tirupati- 517502

Prof. Dr. T.Sreenivasulu Reddy

Dept. Of E.C.E,

Sri Venkateswara University College of Engineering Tirupati – 517502

Abstract

Single Channel Speech Source Separation has been a challenging area of research in signal processing. Unlike many conventional methods of speech source separation that utilize information from multiple sensor readings, the problem of single source separation deals with separating sources obtained from only one sensor. Conventional methods of non- negative matrix factorization use only spectral magnitude and ignore spectral phase of the individual sources in the separation process. In this paper, we investigate the process of single channel speech source separation by considering a linear instantaneous model. A novel method of complex matrix factorization (CMF) that decomposes a complex spectrogram matrix into a complex base matrix and a real coefficient matrix is developed. Spectral phase of the sources are therefore incorporated in the process of source separation.

The proposed separation method comprises of a learning followed by separation. In learning stage the basis vectors are obtained from the training data of the speakers taken from the corpus and a dictionary is created. In the separation stage, the dictionary is used to estimate the weights through complex matrix factorization in order to separate the sources. Experiments are performed using different mixture signals at various Signal-to-Signal Ratios (SSR) to evaluate the performance of the proposed method. The accuracy of source separation is measured by evaluating the objective measures like Log Likelihood Ratio (LLR) and Weighted Slope Spectral

Single channel source separation(SCSS) is the extreme case of under-determined source separation where only one mixture signal of more than one source signals, is available. In many practical applications, only one observation is available from the hardware and in such cases, conventional source separation techniques that require more than one observation are not appropriate. Hence the problem of single channel source separation has become widely interesting.

  1. PROBLEM FORMULATION

    Let z[n] be the mixed signal which comprises of two speakers z1[n] and z2[n]. The problem of single channel speech source separation aims at obtaining the estimates of z1[n] and z2[n] using a single mixture signal z[n].

    z[n] z1[n] z2[n]

    We convert the time domain speech signal into frequency domain by nding the Short time fourier transform. Let

    Z(k,), Z1(k,), Z2 (k,)

    denote the STFT of z[n],z1[n] and z2[n] respectively where k represents the time frame index and represent the frequency bin index in the STFT domain. Therefore we can write

    Z(k,) Z1(k,) Z2 (k,)

    Distance (WSS). The proposed CMF method exhibits reasonably

    better performance when compared to other NMF methods and

    Z (k,) e

    jZ ( k , ) Z (k,) e

    jZ1 ( k , ) Z

    (k,) e

    jZ2 ( k , )

    its variants available in literature.

    1. INTRODUCTION

      Source separation has been a topic of investigation for over two decades. The problem of source separation refers to the technique of separating the sources underlying in some mixtures of more than one source. A classical example of source separation is the cocktail party problem which represents the situation where a person is able to focus on a single conversation, when surrounded by a number of separate conversations. Separation can be classied as blind and non- blind. When no prior information about the sources is known, it comes under blind source separation. In contrast non-blind or supervised separation methods use prior information of the sources to train the separation model. Another type of classication of source separation methods is Over- determined source separation and Under-determined source separation, which is based on the number of sensors and sources. In over-determined case, the number of sensors is more than the number of sources and vice-versa in under- determined case.

      1 2

      Various methods in literature use some training data and construct a set of basis vectors for all the sources present in the mixture. With the pre-learnt bases, weights are estimated for a mixture using matrix factorization methods from which sources can be separated. From each signal in the training set of clean speech, we extract a set of basis vectors Xtrain, which can be used in the separation process to calculate weights. Both learning the basis vectors and estimation of weights require complex matrix factorization. So the problem reduces to nding an accurate technique to estimate complex bases Xtrain and corresponding weights Hi such that

      Zi Xtrain Hi

  2. COMPLEX MATRIX FACTORIZATION APPROACH TO THE SPEECH SOURCE SEPARATION

    Non Negative Matrix Factorization (NMF) is a linear basis decomposition technique, subject to constraints of non- negativity on data being imposed. It basically decomposes a non-negative matrix Z into product of two matrices X and H, constrained such that all the elements of X and H are non- negative.The matrix X is termed as Basis vector matrix and H

    is termed as the Weight matrix or coefficient matrix, E is the residual matrix or simply the error in approximation.

    Z XH E XH

    Cost functions measures the divergence between Z and XH. This can be expressed as below,

    X , H arg minC Z ; X , H

    X ,H 0

    subject to X, H >= 0.

    Z i max(0,imag(Z )) ,

    Z i min(0,imag(Z ))

    Here max, min, real and imag denotes element-wise functions operating on matrices. These functions calculates maxima, minima, real part and imaginary part of each element in the matrix. The complex matrices Z and X are separated by using the transformation dened in Eq.3.1. Hence,

    NMF is widely used for speech source separation. Decomposition of a mixed signal into corresponding basis vectors and estimation of corresponding weights is known to work well for single channel mixtures. In general NMF based separation assumes that the phase of the source signal is either equal to the mixed signal or it is assumed to be constant. In the subsequent sections we propose a method of complex matrix factorization based source separation that includes the phase information also in the separation process

    thereby separation process is more accurate.

    A. Proposed Complex Matrix Factorization Method

    Consider Z to be a complex matrix which needs to be

    Z Z r Zr j(Zi Zi )

    X X r X r j( X i X i )

    Weight matrix H is separated as

    H H H

    where

    H max(0, real(H))

    H min(0, real(H))

    (3.2)

    (3.3)

    (3.4)

    factorized into product of basis vector matrix X and weight matrix H,

    Z = XH

    where the base matrix X is complex and the weight matrix X is real.

    The minimization problem is stated as

    X , H arg minC Z ; X , H

    X ,H 0

    If the cost function used is Squared Euclidean Distance, then

    2

    max, real, min are the element-wise functins that operate on matrices and give the maximum, real part and minimum of the elements in the matrix.

    Through the transformations dened above, we decompose a complex matrix into some non-negative matrices.

    Zr , Zr , Zi , Zi , Xr , Xr , Xi , Xi , H , H

    are the non-negative matrices that are obtained using the above transformations. Using equations 3.2, 3.3, 3.4 and

    substituting in Z XH we get

    Z r X r H X r H

    X , H arg min Z XH

    X ,H

    (3.0)

    Z r X r H X r H

    If the cost function used is KL-divergence, then

    Z i X i H X i H

    X , H arg minDKL

    (Z XH )

    Z i X i H X i H

    Let Z XH

    X ,H

    be the approximated matrix. This

    Let,

    factorization problem is a complex matrix factorization problem. Applying a simple transformation to convert complex matrix factorization problem into a non-negative matrix factorization problem and solve the task of source separation in NMF framework. The transformation is given as follows:

    Z 1 Z r , Z 2 Z r , Z 3 Z i Z 4 Z i

    Also for convenience let us have

    Z1 Zr Z2 Zr Z3 Zi Z4 Zi

    Now with all the notations described above, we convert complex matrix factorization into non-negative matrix factorization. On applying triangle inequality to equation 3.0, we obtain

    where,

    Z Z r Zr j(Zi Zi )

    (3.1)

    2 4

    2

    min Z XH min Zk Z k

    Z r max(0, real(Z )) ,

    X ,H X ,H k 1

    (3.5)

    Z r min(0, real(Z ))

    We know that Zk , Z k

    are independent of each other.

    Hence equation 3.5 becomes

    2 4

    2

    min Z XH

    X ,H

    min

    k 1 X ,H

    Zk Z k

    (3.6)

    Hence the problem now reduces to

    for all

    min

    X ,H

    2

    Zk Z k

    k {1, 2, 3, 4}.

    In fact the term on the R.H.S of the equation 3.6 represents the upper bound to the solution of the optimization problem in equation 3.0. Therefore convergence of R.H.S of the equation 3.6 guarantees convergence of the cost function in equation 3.0.

    Sequential solving of these optimization problems will lead to a bias towards the rst optimization problem. Hence we combine the sub-matrices into a single matrix and then solve them concurrently. This is shown as follows:

    Fig. CMF Approach for joint modeling of Magnitude and Phase

    1. Algorithm for computing X and H using CMF method

      1: Input : Complex matrix Z and its transformed matrix Zc. 2: Initialization : The matrices X+r, X-r, X+i, X-i, H+ and H-

      are assigned random non-negative values.

      3: Rearrange the elements of these sub-matrices to form Xc and Hc as shown.

      4: Alternating multiplicative updates : Update the elements

      of the sub matrices Xc and Hc by using the alternating multiplicative updates of NMF.

      5: Update of weight matrix :

      H1, H4

      H1 H4

      2

      and

      H2 , H3

      H2 H3

      2 .

      We denote the matrix on the L.H.S as Zc and the matrices on the right side as Xc and Hc respectively.

      Also, X1 = X+r, X2 = X-r , X3 = X+i , X4 = X-i.

      Hence we have

      6: Repeat : Step 4 to 5 for a number of iterations to minimize the error between and .

      7: Termination : Reconstruct the original matrices X and H from Xc and Hc by doing

      Zc Xc Hc

      As H1 = H4 and H2=H3, we perform an update after every NMF iteration as shown below.

      8: Output : Complex matrix X and real matrix H.

      and

      H1, H4

      H1 H4

      2

    2. Dictionary Learning

      In order to estimate z1[n] and z2[n], we rst form over complete dictionaries that represent the basis vectors of

      H H

      speech of both the speakers. After extracting the basis vectors,

      H2 , H3

      2 3

      2

      we form the dictionary by concatenating the basis vectors of both the speakers. Then we decompose the mixed signal by

      Hence the CMF problem is reduced into NMF problem of the form

      2

      min Z c Xc Hc

      with respect to Xc and Hc,

      where Zc, Xc and Hc are non-negative matrices. Using the transformations, we have converted a complex matrix factorization problem into a non- negative matrix factorization problem. The approach described above is illustrated in gure 4.1.

      using this dictionary and calculate corresponding weight matrix. The following algorithm describes the process of dictionary learning in case of two speakers speaking in one mixed signal.

    3. Algorithm for Dictionary Learning

      1: Input : Clean speech of the speakers, z1[n] and z2[n].

      2: Short Time Fourier Transform (STFT) : Calculate Short Time Fourier Transform for each signal in the training data, say Z, which is a complex matrix.

      3: Decomposition using transformations : Find the matrix Zc from the complex STFT matrix Z using the transformations shown in figure 4.1.

      4: Random initialization of basis vector matrix and weight matrix : The matrices

      Xr , Xr , Xi , Xi , H , H

      are assigned random non-negative values.

      5: Rearrange the elements of these sub-matrices to form Xc and Hc.

      6. Update the elements of the sub matrices Xc and Hc using the multiplicative update rules

      7: Now nd out the final non-negative matrices Xc and Hc . 8: Basis matrix extraction : The basis vector matrices Xc of

      all the speech signals of the same speaker are concatenated

      together to form the overall base matrix of that speaker.

      9: Forming the dictionary : All the base matrices obtained in the previous step for all the speakers are concatenated together to form the overall dictionary X.

      10: Output : Overall dictionary matrix X, which is used in the separation process.

    4. Decomposition of the mixed signal

    Using CMF the spectrogram (STFT) of the mixture signal Zmixture is decomposed into a product of complex matrix X and a real coefficient matrix H following the algorithm A but with a xed basis matrix obtained from dictionary learning

    Zmixture = [(Xbasis )1 (Xbasis )2] H

    The same equation after transformations turns out to be

  3. PERFORMANCE EVALUATION

    1. Database

      For all the experiments, the audio signals taken from GRID corpus[6] which is an audio-visual corpus used in speech perception and automatic speech recognition studies. The corpus consists of high-quality audio and video recordings of 1000 sentences spoken by each of 34 talkers (18 male and 16 female). Each speech signal consists of sentences of the form bin blue at L 6 please. It is actually < command

      :1>< color :1>< preposition : 1 >< letter : L >< digit:6>< adverb :3>.

      The data set consists of single channel speech signals in .wav format with a sampling frequency of 25 kHz. Corpus actually consists of video les of the speakers but our interest is only on speech source separation. Hence we have taken only the audio les. The complete corpus and transcriptions are freely available for research use.

    2. Evaluation Criteria

    To do the performance analysis and to measure the performance of the proposed method for single channel speech source separation, both subjective and objective quality measures exist in literature. Subjective measures are obtained by collecting the rating by group of listeners. Evaluation is done by using human auditory system and its perception to re-constructed audio. The main disadvantage of

    Zc = [(Xc

    ) (Xc

    ) ] H

    these subjective measures is that they are time-consuming and

    mixture

    basis 1

    basis 2 c

    expensive but they have high validity. The second way of

    where (Xcbasis )1 ad (Xcbasis )2 are the bases obtained from dictionary learning. It is xed during the decomposition of the mixture spectrogram. Only coefficient matrix is updated using the update rule dened in algorithm A The basis matrix is kept xed and H is initialised by random positive noise. The decomposition is done for a xed number of iterations or until the convergence criterion is satised. After the decomposition, the spectrograms of the sources are estimated by converting Xc and Hc into X and H respectively by

    ,

    E. Reconstruction of source signals

    In the previous section, we have decomposed a complex STFT matrix into a product of complex base matrix and a real coefficient matrix. Now from the decomposed matrices we need to estimate the complex STFT of individual sources. The weight matrix H can be split as two sub-matrices H1 and H2, each one containing the coefficients for a particular speaker in the mixture. Hence, the estimated complex STFT of the source signals in the mixture are given as,

    Z1 = (Xbasis )1 H1

    Z2 = (Xbasis )2 H2

    where (Xbasis )1 and (Xbasis )2 are the base matrices of the speaker 1 and speaker 2, extracted using Algorithm C, are the complex STFT of the sources present in the mixture. Now to reconstruct the signals in time domain, take the Inverse short time fourier transform (ISTFT) of and estimate the time domain source signals.

    validation criteria is by evaluating objective measures which extract a metric of speech quality between the reconstructed speech signal and the original reference signal using mathematical techniques. In our experiments we use objective measures in our performance analysis.

    1. Objective Evaluation Methods

      In objective evaluation measures, we assess the speech quality from the extracted physical parameters of the reconstructed speech signals. The reference speech signals which are used to form the mixed signals are taken and they are compared with the reconstructed speech signals after source separation. Objective measures are used to measure the improvement of speech quality before and after separation of source signals. In this section we describe the objective measures that are used to compare the quality of source separation. We used Linear Predictive Coefficients based measures – Log Likelihood Ratio (LLR) and Weighted Slope Spectral Distance (WSS) for performance analysis of our separation process. Lower the values of LLR and WSS better is the performance.

      Experiments are performed using the audio les taken from the GRID database mentioned. Audio les of two speakers, one male and one female are taken from the corpus. The dataset is divided into two parts, training dataset and testing dataset. Used 300 speech audio les of both the speakers for training the bases. 10 mixture signals from the same speakers of Signal-to-Signal ratios (SSR) 0dB,4dB,8dB and 10dB are taken for the testing the accuracy of separation. From each signal in the training data set, 10 basis vectors are extracted. This number is variable and set by the user. Separation is done by Non- negative matrix factorization

      method and Complex matrix factorization method Results obtained through various subjective and objective measures are presented in the following sections. As mentioned the results are compared with existing techniques of source separation using Non-negative matrix factorization and tabulated properly. In all the experiments, xed the window size to be 800 samples(32ms at 25khz), 50% overlap between adjacent frames and we take 512 point DFT for each frame. Experiments are carried out for different values of from 0.1 to

      1 in steps of 0.1 and for = 2, 5 and 10. Used randomly generated mixture signals of different signal to signal ratios (SSR) from the grid corpus. Experiments are also conducted where is varied for every iteration.

    2. Results Obtained using Log Likelihood Ratio( LLR) Lower the value of LLR, better is the performance. As observed in table 5.2, the LLR values are lower for our CMF than NMF.

      Table 5.2: Objective measures results using LLR at

      different SSRs

      Figure 5.1: LLR of speaker 1 at different Signal-to-signal ratios (SSR)

      Figure 5.2: LLR of speaker 2 at different Signal-to-signal ratios (SSR)

    3. Results Obtained using Weighted Slope Spectral Distance (WSS)

      Lower the value of WSS, better is the performance. As observed in table 5.3, the WSS values are lower for our CMF than NMF.

      Table 5.3: Objective measures results using WSS at

      different SSRs

      Figure 5.3: WSS of speaker 1 at different Signal-to-signal ratios (SSR)

      Figure 5.3: WSS of speaker 2 at different Signal-to-signal ratios

      (SSR)

  4. CONCLUSION

This paper proposed a novel method for single channel speech source separation using complex matrix factorization. In CMF, decomposed a complex matrix factorization problem into a non-negative matrix factorization problem and hence able to model both magnitude and phase of the short time fourier transform (STFT) of the speech waveforms. Hence this method overcomes the disadvantage of the traditional source separation techniques by Non-negative matrix factorization that utilizes only magnitude information neglecting the phase information. The detailed procedure for the conversion of a

CMF problem into NMF problem is discussed in the chapters covered previously. All the experiments are done using the GRID corpus. The performance analysis is also done and proposed technique is compared with the existing techniques that operate only on magnitude ignoring the phase information. Signicant improvement is observed for the proposed method when compared to existing methods. This work can be further extended by considering the noisy mixture signal recorded in reverberant surroundings. Therefore this can be carried out in real time scenarios where our proposed method can be implemented and is expected to give better results for single channel source separation.

REFERENCES

    1. D. D. Lee and H. S. Seung, Algorithms for non-negative matrix factorization, in In NIPS, pp. 556562, MIT Press, 2000.

    2. M. N. Schmidt and R. K. Olsson, Single-channel speech separation using sparse non-negative matrix factorization, in in International Conference on Spoken Language Processing (INTERSPEECH, 2006).

    3. E. M. Grais and H. Erdogan, Single channel speech music separation using nonnegative matrix factorization with sliding

      windows and spectral masks., in INTERSPEECH, pp. 17731776, ISCA, 2011.

    4. J. Eggert and E. Korner, Sparse coding and nmf, in Neural Networks, 2004. Proceedings. 2004 IEEE International Joint Conference on, vol. 4, pp. 25292533 vol.4, 2004.

    5. Z. L. Zunyi Tang, Shuxue Ding and L. Jiang, Dictionary learning based on nonnegative matrix factorization using parallel coordinate descent, Abstract and Applied Analysis, vol. 2013, 2013.

    6. M. Cooke, J. Barker, S. Cunningham, and X. Shao, An audio-visual corpus for speech perception and automatic speech recognition, The Journal of the Acoustical Society of America, vol. 120, pp. 24212424, November 2006.

    7. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs, in Acoustics, Speech, and Signal Processing, 2001. Proceedings. (ICASSP 01). 2001 IEEE International Conference on, vol. 2,pp. 749752 vol.2, 2001.

    8. Y. Hu and P. C. Loizou, Evaluation of objective quality measures for speech enhancement, IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, no. 1, pp. 229238, 2008.

    9. R. Crochiere, J. Tribolet, and L. Rabiner, An interpretation of the log likelihood ratio as a measure of waveform coder performance, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 28, no. 3, pp. 318323, 1980.

Leave a Reply

Your email address will not be published. Required fields are marked *