Blind Source Separation And Deconvolution By ... - IEEE Xplore

2 downloads 0 Views 495KB Size Report
Blind separation algorithms can have many applications in areas invloving processing of multi-sensor signals, such as speech enhancement (the 'cocktail.
BLIND SOURCE SEPARATION AND DECONVOLUTION BY DYNAMIC COMPONENT ANALYSIS H. Attias* and C.E. Schreinert Sloan Center for Theoretical Neurobiology University of California, San Francisco, CA 94143-0444

Abstract We derive new unsupervised learning rules for blind separation of mixed and convolved sources. These rules are nonlinear in the signals and thus exploit high-order spatiotemporal statistics t o achieve separation. The derivation is based on a global optimization formulation of the separation problem, yielding a stable algorithm. Different rules are obtained from frequency- and time-domain optimization. We illustrate the performance of this method by successfully separating convolutive mixtures of speech signals.

1

INTRODUCTION

In the problem of linear square blind separation [l],one considers L independent signal sources zi(t) (e.g., different speakers in a room) and L sensors yi(t) (e.g., microphones at several locations). Each sensor receives a mixture of the source signals. The task is to recover the original sources from the observed sensor signals. The separation is termed blind because it must be achieved without any information about the sources, apart from their statistical independence. *Corresponding author. E-mail: [email protected]. Phone: (415)476-1576. Fax: (415)502-4848. +E-mail: [email protected]. Phone: (415)476-2591. Fax: (415)502-4848.

0-7803-4256-9/97/$10.00 0 1 997 IEEE

456

Blind separation algorithms can have many applications in areas invloving processing of multi-sensor signals, such as speech enhancement (the ‘cocktail party’ problem) and the analysis and interpretation of biomedical signals (e.g., EKG, EEG [8]). Most of the separation methods that have been proposed aim at a simplified version of the problem where the mixing process is linear and instantaneous (memoryless). In that case we seek a separating transformation gij that, when applied t o the sensor signals y i ( t ) , will recover the sources, possibly scaled and permuted: 2i(t) = C j g i j y j ( t ) . In particular, independent component analysis (ICA) algorithms [2-71 can identify gij fast and efficiently in many cases. However, the mixing in realistic situations is not memoryless, due t o multipath propagation and the impulse response of thie medium and of the sensors. The resulting ‘ ~ 0 n ~ 0 1 ~ t imixtures ve’ cannot be separated by ICA methods. In this paper we present a novel unsupervised learning algorithm for blind separation of linear, time-invariant mixtures with memory, termed dynamic component analysis (DCA). The separation in this case requires a transformation with a dynamic impulse response g i j ( t ) (a matrix of filters), L

2i(t) =

1 m

dt’g&‘)&

- t’) ,

j=l0

where &(t)are the recovered source signals. More generally, the signals y i ( t ) may be taken from any temporal multi-sensor data set; the new signals &(t) are termed the dynamic components (DC) of those data. Like the original sources, the DC’s are characterized by their statistical independence, and consequently by the property that their joint moments factorize. In the time domain, this implies (&(t)m2j(t T ) ~ )= ( 2 i ( t ) m ) ( 2 j (+ t T ) ” ) , for i # j and all orders m,n at any time lag T ; the average is taken over time t. Note that in contrast, the independent components found by ICA algorithms satisfy this propierty only for T = 0. In order to find the separating transformation g i j ( t ) , one could impose the joint moment factorization as a condition on the resulting signals &(t). Rather than imposing it explicitly, which can practically be done only for ow-order moments [14], an effective way to impose this condition implicitly m d to all orders is to formulate the separation task as an optimization probem via the use of a latent-variable model [lo]. Specifically, we construct a model for the joint distribution of the sensor signals over N-point time docks, p y [ y ( t o )..., , y ( t ~ - 1 ) ] parametrized , by the separating filter matrix

+

457

g ( t o ) ,..., g ( t M - - l ) . Next, we define the ‘distance’ between our model sensl

distribution and the observed distribution using the Kullback-Leibler (K1 distance [ll],an information theory-based measure for the distance betwec two distributions. The model parameters are then optimized to minimi! this distance by the stochastic gradient descent method, yielding the DC learning rules for gzj ( t ). This global optimization formulation of the problem can be given in eithl the frequency domain or the time domain. Section 2 presents the frequenc; domain formulation and the associated learning rules, whereas the tim1 1 domain version is given in Section 3. The performance of DCA is illustrati in Section 4 by successfully separating convolutive mixtures of speech signal Notation: we work in discrete time t,. Lower-case symbols are used f{ time-domain quantities and upper-case symbols for their frequency-domal I counterparts. We use subscripts t o refer t o discrete times and frequencie e.g., IC, = x ( t n ) and XI, = X ( w k ) . Vectors and matrices are boldfaced.

2

FREQUENCY-DOMAIN OPTIMIZATIOP

Let x, be the L-dimensional model source vector, whose elements zt,n z z ( t n )are the source activities at time t,; these are the latent variables. Lt, y n be the L-dimensional model sensor vector. We work with N-point ti$ blocks {t,}, n = 0 , ..., N - 1. The two are related by I M-1 Xn

=

gmYn-m

,

x k

=GkYk

,

m=O

where the separating transformation g, is a matrix of filters of length M 1 N , and GI, = G(wk) is its N-point DFT. We focus first of the frequenc! domain formulation (r.h.s. of (2)) where the separation problem factorizes! To construct a model sensor distribution p y ( { Y k } ) we must start wit‘ a model source distribution p x ({Xk}). We use a factorial frequency-doma model, L N/2-1

PX({Xk}) = i=l

n:

Pz,k(Xz,k)

,

c

k=l

where Pz,k is the joint distribution of ReX,,k,ImX,,k. From (2) we obta, &det(GkGL)px, which depends on the separating parameters g I and the parameters used to describe Pi,k (see below).

py =

458

Denoting the observed sensor signals by y k , we now define a distance measure D between their joint distribution pg and our model distribution p y . For this purpose we adopt the KL distance function [ll],which can be shown t o satisfy D ( p p , p y ) = -Hp - (1ogpy)p; the second term on the r.h.s. is evaluated by averaging l o g p y ( Y ) using the observed distribution p g . Since H g , the entropy of the observed signals, is independent of the mixing model parameters, minimizing D is equivalent to maximizing the log-likelihood of the data, (logpy)g, with respect to g,. It follows that D(pp,Py)

L

1 N/2--1 -1ogdetGkGL f x l o g p a , k N a=1 k=l

(

),

(4)

after dropping the average sign and terms independent of g,. Before deriving the learning rules we make a few simplifications in the model (3) by omitting the frequency dependencle of Pz,k and using the same parametrized functional form for all sources. In addition, we restrict P z , k (Xz,k) to depend only on the squared amplitude 1 Xa,k 12. These simplifications are made for convenience, but a more complicated parametrization can be used in situations where the actual source distribution depends non-trivially on the frequency or phase. Note that our model sources are white, in anticipation i f the whitening effect discussed below. Hence Pa,k(Xz,k) = P(I X Z , k 12;&), where & is a vector of parameters for source i. For instance, P may be a nixture of Gaussian distributions whose means, variances and weights are :ontained in &. The frequency-domain DCA learning rules for the separating filters g, md the source distribution parameters & are now obtained using a stochastic Sradient descent minimization of the KL distance (4):

dGk

=I

E

[I - $!&)xi] Gk ,

vhere Sg, are obtained from 6Gk by inverse DFT for 0 5 m 5 M - 1 and .re set to zero for m 2 M . The vector @(X,)above is related t o the model ource distribution by d

ti) = -x. logP(a =I aa

@ ( ~ i , k ;

?he learning rate is set by

E.

459

x L , ~

12;

&)

.

(6)

We point out that to derive the rule for gm we used 6gm = -cdD/dgmi I but since the resulting &I, required matrix inversion for each wk at eacl; iteration, we multiplied it by the positive-definite matrix GLGk t o get the I less expensive rule ( 5 ) . It can be shown [9]that this rule indeed decreases D ai each iteration in the small-c limit. Furthermore, it can be shown t o satisfy thd property of equivariance (see [6,7] for equivariant algorithms for instantaneoui mixing) , which guarantees uniform performance across all invertible mixin4 processes. We emphasize the importance of using a time block that is sufficientll,, longer than our model filters. As is evident in the frequency-domain formulation (r.h.s. of (2) and 6 G k rule of (5)), we are effectively solving N individual mixing problems, one a t each W k , and risk recovering the sourced with different ordering permutation at different frequencies, possibly reducind the separation quality. The key point here is that these N problems (or N I increments bGk) are not independent, since the minimization of the distancd function with respect t o the M time-domain coefficients g, couples them and solves them simultaneously. Consequently, to minimize the freedom of arbii trary permutations by exploiting this coupling we must choose M