Estimation of Inter-Channel Phase Differences using ... - IEEE Xplore

10 downloads 0 Views 269KB Size Report
Abstract—Estimation of non-linearities in phase differences between two or more channels of an audio recording leads to a more precise spatial information in ...
2014 IEEE 8th Sensor Array and Multichannel Signal Processing Workshop (SAM)

Estimation of Inter-Channel Phase Differences using Non-Negative Matrix Factorization Hendrik Kayser and J¨orn Anem¨uller

Kamil Adilo˘glu

Medizinische Physik and Cluster of Excellence Hearing4all Universit¨at Oldenburg 26111 Oldenburg, Germany Email: {hendrik.kayser, joern.anemueller}@uni-oldenburg.de

H¨orTech gGmbH Center of Competence Research and Development Marie-Curie-Str. 2, 26129 Oldenburg, Germany Email: [email protected]

Abstract—Estimation of non-linearities in phase differences between two or more channels of an audio recording leads to a more precise spatial information in audio signal enhancement applications. In this work, we propose the estimation of these non-linearities in multi-channel, multi-source audio mixtures in reverberant environments. For this task, we compute short term cross-correlation functions between the channels and extract the non-linear inter-channel phase differences as well as a measure of activation for each source. This is conducted by decomposition of the cross-correlation matrix using a nonnegative matrix factorization method. Our evaluation shows that the estimated inter-channel phase differences depict the nonlinearities. Furthermore, the estimated activations reflect the time instances where the sources are active. In audio source separation experiments the proposed method outperforms a state-of-the-art approach based on linear phase differences by 30% in terms of relative improvement.

Δlin phase

0 −2 0

2000

4000 Frequency (Hz)

6000

8000

Fig. 1. Example of the IPD for one single source out of a three sources mixture. Difference to the estimated linear IPD (Δlin phase) is shown over frequency. The zero line (black) is given by the linear IPD, the yellow line denotes the IPD computed from the GCC-function of the undisturbed sound source (available separately as ground truth) and the red line indicates the non-linear IPD estimated by our proposed approach.

I. I NTRODUCTION In the task of spatial signal enhancement of multi-source mixtures, in the ideal case, the sources are point sources and no reverberation is present in the room. Hence, the impulse response that describes the transmission to a sensor resembles a Dirac δ-function. However, in a real scenario, the impulse response has a significant influence on the mixing process. In particular early reflections introduce small variations, i.e., non-linearities, into the inter-channel phase differences (IPDs) between the signals, captured by a pair of sensors which are ideally linear with frequency. Therefore estimation of these non-linearities improves the performance of signal enhancement approaches which take the spatial configuration of the sound sources into account. Tasks like automatic speech recognition [1] potentially benefit from a precise estimation in the signal enhancement steps of the processing chain. IPDs and their non-linearities are captured by generalized cross-correlation (GCC, [2]) in multi-channel recordings. GCC has been shown to be robust in the estimation of the time delay of arrival (TDOA) [3]. In this context, a correctly estimated TDOA refers to the direct path TDOA of the sound source. In our contribution, we extend this approach by estimating additional components associated with the main directional component of the sound source and present a novel method for estimation of the non-linear IPDs. We regard the spatial representation of given multi-channel multi-source audio signal in the GCC domain and incorporate non-negative matrix

978-1-4799-1481-4/14/$31.00 ©2014 IEEE

Linear Ground truth Non−linear

2

factorization (NMF) for estimating the directional characteristics of each source within the given mixture. Typically, NMF is employed for decomposing the spectrogram representation of a given audio signal, where the spectral features and their activations are the resulting components. The latter approach has been successfully applied in the audio signal processing domain for various tasks like speech enhancement [4], [5], audio source separation [6] and music transcription [7]. Here, we utilize NMF in combination with the GCC for the estimation of the directional characteristics of each source image. We show how these characteristics can be utilized for estimating the non-linear IPDs and demonstrate the performance of the proposed method in a basic source separation scenario. An example comparing non-linear IPDs to the linear approximation is found in Figure 1. II. P ROBLEM FORMULATION AND M ETHODS We regard the mixing process resulting in a multi-channel, multi-source signal as defined by the following expression x f n = Af s f n

(1)

where xf n = [x1,f n , · · · , xJ,f n ]T denotes the short-time Fourier transform (STFT) coefficients of the mixture, sf n = [s1,f n , · · · , sK,f n ]T the source STFT coefficients and Af =

77

2014 IEEE 8th Sensor Array and Multichannel Signal Processing Workshop (SAM)

[ajkf ]jk denotes the STFT coefficients of a causal, stationary mixing system. In this formulation and applying to the remaining part of the paper, f = 1, . . . , F denotes the frequency index, n = 1, . . . , N the time frame index, j = 1, . . . , J the channel index and k = 1, · · · , K the source index. In the present work we focus on the under-determined scenario, for which J < K. We also assume that we know the number of sources K in advance. Our approach, described in detail in the following sections, consists of the following processing steps: Features for the IPD estimation are given by a matrix containing GCC functions of each time frame of a signal mixture. An initial TDOA estimation is conducted which is used for the initialization of the NMF procedure and serves at the same time as benchmark approach in this work. The GCC matrix is decomposed into the contributions of each sound source of the mixture using NMF. Based on the outcome of the NMF the non-linear IPD is estimated for each source. This information is used for approximating the mixing system A given in (1), which is in turn incorporated in a basic binary masking source separation approach for estimating each source in the given mixture.

B. Non-Negative Matrix Factorization We utilize NMF for decomposing Gj of size (2Lj +1) × N into two lower dimensional non-negative matrices Cj of size (2Lj + 1) × K and Ej of size K × N . The columns Cj,k of Cj are called basis vectors which indicate the correlation pattern of a source k and the rows Ej,k of Ej the corresponding activation depicting the activity level of the same source over time. This factorization is usually expressed as a minimization problem as in the following minCj ,Ej >=0 D(Gj |Cj Ej ), where D(Gj |Cj Ej ) is the cost function defined as dIS ([Gj ]ln |[Cj Ej ]ln ).

(5)

l=−Lj n=1

In this formulation, dIS (•|•) is the Itakura-Saito divergence function, which is defined as x x dIS (x|y) = − log − 1. (6) y y In order to induce a certain behavior in those two matrices beyond non-negativity, penalty terms are added to the cost function given in (5). We aim at forcing each basis vector to contain information about one isolated sound source. Ideally, the main TDOA is given by the location of a single maximum in the corresponding basis vector which is surrounded by smaller peaks originating also from early reflections from the room which are likely to impinge from different, but not heavily diverging, directions. We introduce two penalty terms, the first being a flattening term, which is given by  F(Cj ) = [CTj Cj ]k k . (7)

A well-known and successful variant of the GCC approach to TDOA estimation is the phase-transform (PHAT) weighting. It yields an equally weighted contribution of each frequency bin in the computation of the GCC function. It is computed from the spectra x1,f n and xj,f n between channels 1 and j of the given mixture signal as in the following F 1  Ψj,f n · x1,f n x∗j,f n · e2πif l/F , F

Lj N  

D(Gj |Cj Ej ) =

A. Generalized Cross Correlation

Gj,ln =

(4)

(2)

k

f =0

This term penalizes large-valued elements of Cj , whereas small values are not affected heavily. Furthermore we want the basis vectors to be mutually orthogonal under the assumption that each source comes from a different direction. The penalty term which enforces orthogonality for Cj is defined as follows  [CTj Cj ]kk . (8) O(Cj ) =

where l is the delay in samples between two channels, which is defined in [−Lj , Lj ], Lj denoting the maximum delay in samples that can occur between channels 1 and j. Ψ is a frequency weighting which takes the following form in the PHAT-case 1 . (3) Ψj,f n =   x1,f n x∗j,f n 

k=k

By enforcing the desired behavior of Cj the activation vectors Ej,k are also affected to be assigned to a single source. They are supposed to depict higher weights along the time axis when the corresponding sources are active. With incorporation of these penalties the complete cost function is written as follows

This yields the (due to the inverse Fourier transform in (2)) real-valued GCC matrix Gj of size (2Lj + 1) × N . It can take values between -1 and 1, thus by adding of 1 to each element non-negativity is ensured. We slightly adapted this approach by an additional frequency weighting which is applied after PHAT: A Hanning window is used as additional weighting to avoid artifacts in the resulting GCC matrix that are introduced by the effective convolution with a sinc-function if no windowing is applied prior to the inverse Fourier transform described in (2). This operation indeed broadens the main peaks in the resulting GCC function but suppresses additional peaks accompanying the main peak in case of no additional window.

C(Gj |Cj Ej ) = D(Gj |Cj Ej ) + αO(Cj ) + βF(Cj ) Lj

=

N  

dIS ([Gj ]ln |[Cj Ej ]ln )+

l=−Lj n=1

α



k=k

78

[CTj Cj ]kk + β

 k

[CTj Cj ]k k ,

(9)

2014 IEEE 8th Sensor Array and Multichannel Signal Processing Workshop (SAM)

GCC−PHAT matrix

where α and β are factors for weighting the two penalty terms in the optimization procedure. We use multiplicative updates [7] in order to obtain fast update rules   [Cj Ej ].[−2] · Gj ) ETj + 2αCj , (10) Cj ← Cj · [Cj Ej ].[−1] ETj + 2αCj 1 + 2βCj   CTj [Cj Ej ].[−2] · Gj ) . (11) Ej ← Ej · CTj [Cj Ej ].[−1]

Delay

−L

Activity

+L 1

In these update equations, · denotes Hadamard entry-wise product, •.[n] denotes a matrix that results when each entry of matrix • is raised to the power of n and 1 is a matrix of size K × K containing only ones.

n∈Nj,k

where Nj,k =

ζ

Time

Fig. 2. GCC matrix (upper panel) of a mixture of three speech sources and the activation estimated by our approach (lower panel). Activity over time of each of the sources is denoted by a different color.

D. Source Separation In this contribution, we focus on the effect of a precise estimation of spatial characteristics of audio sound sources in reverberant environments. We hypothesize that considering the non-linearities in the IPDs improves the spatial signal enhancement of any kind. For this purpose, we demonstrate the proposed method in a source separation scenario using a simple binary masking1 approach in order to avoid the effect of non-linearities being masked by specific properties of more sophisticated source separation algorithms (e.g., spatial characteristics in beamforming, statistical properties of Bayesian approaches). We compute the mixing matrix using the IPDs computed with (13) by the proposed method as follows



n | k ∈ argmax (Ej ) ,

0.5 0

C. IPD Estimation Estimation of the IPDs corresponding to one source in a given mixture is based on the assumption that the correlation pattern computed in a single time frame is dominated by the most active source. The information about the activity of a source k is provided by the corresponding row of the NMF activation matrix Ej,k . In each time frame n, the most active source is selected. By this means, each time frame in the GCC matrix is assigned to one source coming along with an  j,k activation measure. To obtain an individual GCC function G for each source k, the GCC matrices are averaged over all time frames that were assigned to the same source. More precisely a weighted average is taken in which the activity measures are used as weights  Ej,kn Gj,n n∈Nj,k  j,k =  G , (12) Ej,kn 

0

Ak,f = [1, e−ϕj,k,f , · · · , e−ϕJ−1,k,f ]T .

ζ = 1 . . . K.

(14)

Note that we initialize the first channel with ones and use the corresponding IPDs for the other channels.

To emphasize the area around the main peak and also to limit the potential detrimental effect of a miss estimation of the GCC function, the estimate is multiplied with a Gaussian window with its maximum at the estimated TDOA of the source. Naturally the width of the window is defined by standard deviation σ and can be expressed in time since its domain is delay time in the GCC function. A constant c is added to the function making its tails converge to the value of c and furthermore it is divided by its maximum. This procedure assigns smaller weights to the IPDs differing noticeably from values corresponding to the main TDOA. An example of the outcome of this procedure is shown in Figure 2 for the case of three superimposed speech sources impinging from different directions. For the computation of the non-linear frequency-dependent  j,k of source k is transIPD, the estimated GCC function G formed back into the frequency domain and the phase angle is computed from the resulting complex spectrum ⎛ ⎞ Lj   j,k · e−2πif l/(2Lj +1) ⎠ . ϕj,k,f = arg ⎝ (13) G

III. E XPERIMENTAL E VALUATION We evaluate the performance of both our proposed nonlinear IPD estimation method and a single-TDOA-based linear IPD - method in terms of Signal-to-Distortion-Ratio (SDR) [8] in dB. The performance limit is given by the results obtained with the use of the ground truth GCC functions. These were computed from the separately available source images using the same parameters as used in the whole setup for computing G and subsequent time-averaging. Additionally we compare the methods in terms of relative improvement which denotes the reduction of the difference to the ground truth achieved by our approach compared to the reference. A. Data We considered speech source mixtures from the development dataset of the 2008 Signal Separation Evaluation Campaign (SiSEC)2 . This data set contains convolutive stereo 1 http://www.loria.fr/∼evincent/soft.html:

“SiSEC Reference Software”

2 http://sisec2008.wiki.irisa.fr/tiki-index.php?page=Under-determined+

l=−Lj

speech+and+music+mixtures

79

2014 IEEE 8th Sensor Array and Multichannel Signal Processing Workshop (SAM)

TABLE I SDR IN D B OF ( LEFT TO RIGHT ) THE UNPROCESSED SIGNALS ( UNPROC .), AFTER APPLYING THE TDOA BASED SOURCE SEPARATION ( LINEAR ), THE PROPOSED METHOD ( NON - LIN .) AND USING THE GROUND TRUTH (GT). T HE LAST COLUMN SHOWS RELATIVE IMPROVEMENT GAINED BY THE PROPOSED APPROACH . T HE DIFFERENT ROWS BREAK THE RESULTS DOWN

linearities are captured by our method this helps to improve the source separation performance. IV. C ONCLUSION AND F UTURE D IRECTIONS We proposed a novel NMF based method for estimating the non-linear frequency-dependent IPDs from short-term GCCPHAT functions within a given multi-channel multi-source signal. We introduced an NMF algorithm with penalty terms to decompose the GCC matrix into basis vectors, that resemble the GCC functions of single sound sources in a mixture, and their activations and successfully demonstrated the benefit from incorporating the estimated non-linear characteristics of the IPDs in a source separation approach. The results indicate that the proposed method proves its advantages in more complex environments. In this contribution, we demonstrated this in a source separation scenario. However, the proposed method can be applied to any kind of spatial signal enhancement approaches and is not limited to the number of input channels or the number of target sources. In many audio signal processing tasks like hearing aids, cochlear implants, audio/video conferences the processing needs to be done online and in real-time. In future work, we will redesign the NMF decomposition in an online manner by considering the input signal causally and by adapting a sliding window approach which will reduce the size of the matrix to be decomposed and in turn enables real-time processing of the input signal.

INTO AVERAGE PERFORMANCE ACHIEVED ON DIFFERENT SUBSETS OF THE DATA ACCORDING TO THE NUMBER OF SOURCES (# SRC ) AND REVERBERATION TIME (T 60 ) AS INDICATED BY THE FIRST TO COLUMNS .

# src

T60 (ms)

unproc.

linear

non-lin.

GT

Rel. Imp.

3 3 4 4 3 4

130 250 130 250 {130, 250} {130, 250}

-0.29 -0.44 -0.54 -0.63 -0.36 -0.60

2.82 1.79 1.75 0.81 2.31 1.12

3.25 2.11 1.96 1.06 2.68 1.36

4.15 2.69 2.93 1.58 3.42 2.03

32.8% 34.8% 17.8% 32.5% 33.3% 26.4%

-0.51

1.60

1.89

2.59

29.3%

Grand average

mixtures with two different reverberation times of 130 ms and 250 ms. There are 8 configurations of 3 sources and 12 of 4 sources. Each configuration was captured with a microphone distance of 5 cm and of 1 m resulting in a total of 40 mixtures with a duration of 10 s each. B. Algorithmic Settings For all methods under comparison the same processing settings were used. G was computed on the basis of a shortterm Fourier transform with a window length of 1024 samples and a shift of 512 samples. For the linear IPD approach and the proposed method the TDOAs τk were estimated from the physical expedient part of G in the delay time dimension according to the maximum possible delay between the microphones. The TDOA estimation method used here is comparable to the GCC-PHAT approach decribed in [3]: maximums were taken over time and the K highest peaks were selected as source TDOAs. The NMF procedure was initialized with these results and applied to the same part of G after adding 1 to all entries. The resulting E is then used to estimate ϕ from the whole GCC matrix. For the linear IPD method ϕ is directly derived from the estimated τk : ϕj,k,f = 2πif τk . The weights of the penalties, α, β, and the parameters determining the shape of the Gaussian window, σ, c, applied prior to the computation of ϕ, were optimized in a grid search procedure. The results presented here were obtained with the following settings: α = 2.0, β = 6.2, σ = 20 μs, c = 0.2.

ACKNOWLEDGMENTS The research leading to these results has received funding from the German Research Foundation (DFG) through the Research Unit FOR 1732 “Individualized Hearing Acoustics” (H.K., J.A.) and the European Union’s Seventh Framework Programme (FP7/2007-2013) under ABCIT grant agreement n◦ 304912 (K.A.). R EFERENCES [1] A. Ozerov, M. Lagrange, and E. Vincent, “Uncertainty-based learning of acoustic models from noisy data,” Computer Speech and Language, vol. 27(3), pp. 874–894, 2013. [2] C. H. Knapp and G. C. Carter, “The generalized correlation method for estimation of time delay,” IEEE Trans. on Acoustics, Speech and Signal Processing, 1976. [3] C. Blandin, A. Ozerov, and E. Vincent, “Multi-source tdoa estimation in reverberant audio using angular spectra and clustering,” Signal Processing, vol. 92(8), pp. 1950–1960, 2012. [4] N. Lyubimov and M. Kotov, “Non-negative matrix factorization with linear constraints for single-channel speech enhancement,” in Proc. Interspeech, Lyon, France, 2013. [5] N. Mohammadiha, P. Smaragdis, and A. Leijon, “Supervised and unsupervised speech enhancement using nonnegative matrix factorization,” IEEE Trans. on Audio, Speech and Language Processing, vol. 21(10), pp. 2140–2151, 2013. [6] A. Ozerov and C. F´evotte, “Multichannel nonnegative matrix factorization in convolutive mixtures for audio source separation,” IEEE Trans. on Audio, Speech and Language Processing, vol. 18(3), pp. 550–563, 2010. [7] C. F´evotte, N. Bertin, and J. J. Durrieu, “Nonnegative matrix factorization with the itakuro-saito divergence. with application to music analysis,” Neural Computation, vol. 21(3), pp. 793–830, 2009. [8] E. Vincent, R. Gribonval, and C. F´evotte, “Performance measurement in blind audio source separation,” IEEE Trans. on Audio, Speech and Language Processing, vol. 14(4), pp. 1462–1469, 2006.

C. Results The source separation results are shown in Table I. Both methods achieve a fairly well separation of the signals compared to the unprocessed case in reference to the ground truth. However, our method outperforms the linear IPD approach on average by 0.3 dB meaning a relative improvement of 30%. Especially in the setups with longer reverberation times the improvement is largest. This is due to the fact that in these scenarios reflections from the environment are stronger and inject more distinct non-linearities into the IPDs. As the non-

80

Suggest Documents