where the number of sources is equal to the number of sen- sors, the natural gradient ..... Theory and Applications, 43(11):894{906, 1996. 9] S. Douglas, A.
1
Natural Gradient Algorithm for Blind Separation of Overdetermined Mixture with Additive Noise L.-Q. Zhang, A. Cichocki and S. Amari Brain-style Information Systems Group, BSI The Institute of Physical and Chemical Research Saitama 351-0198, Wako shi, JAPAN
Abstract | In this letter we study the natural gradient approach to blind separation of overdetermined mixtures. First we introduce a Lie group on the manifold of overdetermined mixtures, and endow a Riemannian metric on the manifold based on the property of the Lie group. Then we derive the natural gradient on the manifold using the isometry of the Riemannian metric. Using the natural gradient, we present a new learning algorithm based on the minimization of mutual information.
mixture of sources by
x(t) = As(t) + n(t)
(1)
where A Rm n is an unknown mixing matrix of full rank, n(t) is the vector of additive white Gaussian noises. In this letter we consider the overdetermined case, m n. The blind separation problem is to recover original signals s(t) from observations x(t) without prior knowledge on the source signals and the mixing matrix A except for indeI. Introduction pendence of the source signals. The demixing model is a Recently blind separation of independent sources has linear transformation in the form become an increasing important research area due to y(t) = Wx(t) (2) its rapidly growing applications in various elds, such as telecommunication systems, image enhancement and where y(t) = (y (t) ))T is an estimate of source biomedical signal processing 1-11]. It has been shown that signals s(t), W 1 Rn m isyna(tdemixing matrix to be deterthe natural gradient improves dramatically the learning ef- mined. The general solution to the blind separation is to ciency in blind separation 1-8]. For the standard case nd a matrix W such that WA = P where n n where the number of sources is equal to the number of sen- is a nonsingular diagonal matrix and P Rn n is aRpersors, the natural gradient algorithm has been developed by mutation. Amari et al 3], and independently as the relative gradient by Cardoso 7]. However in most practical cases, the numIII. Natural Gradient ber of active source signals is unknown and changing over In this section we discuss some geometrical structures, time. Therefore in the general case the mixing matrix and such as the Lie group and the Riemannian metric, on the demixing matrix are not square and not invertible. The manifold blind separation of more sources than mixtures was dis- n m of demixing matrices de ned as Gl(n m) = W rank(W) = min(n m) : For W Gl(n m), there cussed in 10] by using overcomplete representations. In R exists an orthogonal matrix Q Rm m such that this letter, we study blind separation of overdetermined mixtures, where the number of sensors is not less than the W = W1 W2]Q (3) number of sources. The main objective of this letter is to extend the idea of where W1 Rn n is nonsingular, and W2 Rn (m;n) . natural gradient to overdetermined mixtures, and apply the natural gradient to derive an ecient learning algorithm. A. Lie Group Gl(n m) It is a surprise that the optimal natural gradient algorithm, The Lie group plays a crucial role in deriving natural in the sense of minimizing the e ect of noises on the output gradient of the manifold Gl(n n). We introduce the Lie signals, is in the same form as the standard case where the group structure on the manifold Gl(n m). It is easy to number of sources is equal to the number of sensors. It is verify that Gl(n m) is a C 1 manifold of dimension nm. plausible to use overdetermined mixtures to improve upon The operations on the manifold Gl(n m) are de ned as blind source separation algorithms in extracting the signals follows of interest from mixtures. X Y = X1 Y1 X1 Y2 + X2 ]Q (4) ; 1 ; 1 y X = X1 X1 X2 ]Q (5) II. Blind separation of overdetermined mixture 2
2
2
2
f
j
g
2
2
2
2
2
;
Assume that the unknown source signals s(t) = where X = X1 X2 ]Q and Y = Y1 Y2 ]Q are in Gl(n m), (s1 (t) sn (t))T are zero-mean processes and mutually is the multiplication operator of two matrices X and Y, statistically independent and x(t) = (x1 (t) xm (t))T is and is the inverse operator on Gl(n m). The identity elan available sensor vector, which is a linear instantaneous ement in Gl(n m) is de ned by E = In 0]Q. It is easy to
y
2
prove that both the multiplication and the inverse mapIV. Learning Algorithm pings are C 1 mappings. The inverse operator satis es Assume that p(y W), pi (yi W) are the joint probability X Xy = Xy X = E: Therefore the manifold Gl(n m) density function of y and marginal pdf of yi (i = 1 m) with the above operations forms a Lie Group. respectively. Our target is to make the components of y as mutually independent as possible. To this end, we employ B. Riemannian Metrics the Kullback-Leibler divergence as a risk function 3] The Lie Group has an important property that it admits n X an invariant Riemannian metric. Let TW be the tangent l (W) = H (y W) + H (yi W) (10) space of Gl(n m), and X, Y TW be the tangent vectors. i=1 We introduce an inner product on TW with respect to W R as < X Y >W : Since Gl(n m) is a Lie group, any Z where p(y W) log p(y W)dy H (yi W) = R H (y W) = Gl(n m) de nes an onto-mapping: W W Z. The pi (yi W) log pi (yi W)dyi : The divergence l(W) meamultiplication transformation maps a tangent vector X at sures the mutual independence of the output signals yi (k). W to a tangent vector X Z at W Z. Therefore we The output signals y are mutually independent if and only can de ne a Riemannian metric on Gl(n m), such that the if l(W) = 0. In order to develop an ecient on-line learning right multiplication transformation is isometric, that is, it algorithm, we simplify (10) into the following cost function preserves the Riemannian metric on Gl(n m). Explicitly l(y W), we write it as follows n T ) ) X log pi (yi (k ) W) (11) l ( y W ) = log( det ( WE < X Y > =< X Z Y Z > : (6)
;
2
2
!
W
;
;
;
WZ
j
j ;
i=1
If we de ne the inner product at the identity E by < where E is the identity element of the Lie group Gl(n m), X Y >E= tr(XYT ), then < X Y >W is automatically and det(WET ) is the determinant of matrix WET . In the induced by following discussion, we use the following decomposition
< X Y >W =< X Wy Y Wy >E :
(7)
W = W1 W2]Q x = QT xx12 (12) where W1 Rn n and x1 Rn . The ordinary gradient of l(y W) with respect to W is given by. dl(y W) = W;T + '(y)xT dl(y W) = '(y)xT : (13) 1 2 1 dW1 dW2 where '(y) is a vector of nonlinear activation functions
C. Natural Gradient 2 2 For a cost function l(W) de ned on the manifold Gl(n m), the natural gradient r~ l(W) is the steepest ascent direction of the cost function l(W) as measured by the ; Riemannian metric on Gl(n m), which is the contravariant (W ) form of partial derivatives rl(W) = @l@ W : The n m natural gradient of the function l(W) is de ned by 1] 'i (yi ) = ; d logdyp (y ) = ; pp ((yy )) : Therefore the natural gralearning algorithm on Gl(n m) can be implemented < X r~ l(W) >W =< X rl(W) >E (8) dient as follows ij
i
0
i
i i
i
;
for any X TW . Comparing the both side of (8), we have ~ l(W) = l(W)(WT W + NI )
where NI = QT
r
0 0 Q Rm 0 Im;n 2
(9)
2
;
;
;
(14)
V. Optimization of Learning algorithm
The demixing model projects the sensor signals into Rn , m is a block diag- and the projection depends on the matrix NI . In this sec-
onal matrix, Im;n is an (m n) (m n) identity matrix. It is worthy noting that the natural gradient on the manifold Gl(n m) has an additional term compared with the one on the manifold Gl(n n). In the overdetermined case, the matrix WT W is singular, while WT W + NI is a positive de nite matrix for any W Gl(n m). The property ensures that the natural gradient descent algorithm keeps the same kind of equilibria of the learning system as the ordinary gradient descent one. Remark 1. It is easy to see that the NI is a projection matrix. The result indicates that the natural gradient for overdetermined mixtures is not unique, which depends on the orthogonal matrix Q. The redundancy makes it possible to choose an optimal projection for learning algorithms. ;
i
W = (I '(y)yT )W '(y)xT NI
2
r
i
tion we consider the optimization of such projection. Decompose the mixing matrix in the following form
A = QT
A1 0
(15)
where matrix Q is an orthogonal matrix, A1 Rn n is a nonsingular matrix. The mixing model transforms signal into a hyperplane = the source x 1 x x = QT 0 x1 Rn : The orthogonal complement of is denoted by ? . The question here is that which projection matrix NI is the best for learning algorithms in the sense of minimizing the inuence of noises. To this end we assume that the noises are Gaussian with 2
S
j
2
S
S
3
the covariance matrix E (nnT ) = 2 I and are independent of sources. We decompose the demixing matrix using the same orthogonal matrix Q in (15) as W = W1 W2 ]Q and we have y = WAs + Wn = W1A1 s + Wn (16) This means that W2 does not contribute to the main term. To minimize the e ect of noises on the output y, We introduce the following cost functional F (W2 ) = E ( Wn 22 ) = E (nT WT Wn): (17) On the other hand we decompose the vector of noises into the following form n = QT v = QT v1 v2 ]: It is easy to derive E (vvT ) = QE (nnT )QT = 2 I: Then the cost functional (17) can be rewritten as F (W2 ) = E (v1T W1T W1 v1 ) + 2 tr(W2T W2 ): (18) The minimal solution of the cost functional F (W2 ) is W2 = 0. This means that the transform y = Wx should be orthogonal to the normal space ? , that is, for any z ? y = Wz = 0. In this case our natural learning algorithm (14) is simpli ed to the standard form ; W = I '(y)yT W (19) where W is, in general, a non-square n m demixing matrix (n m). It should be noted that the learning algorithm (19) has been proposed for the special case when the number of sensors is exactly equal to the number of sources 3], 8]. It is apparent that the learning algorithm (19) is of the equivariance property in the sense of Lie multiplication. It has been proved the natural gradient improves the learning eciency in blind separation 1]. Here we present a rigorous geometric interpretation why the algorithm (19) works eciently for overdetermined mixtures. Using the theory of information geometry, we can also analyze the e ect of noises on the performance of the learning algorithm. Due to the limited space, the problem is left in a full paper. k
k
S
2 S
;
VI. Conclusion
In this letter we discuss some geometrical structures on the manifold of the non-square demixing matrices and derive the natural gradient on the manifold. Using the natural gradient, we present a novel learning algorithm for blind separation of overdetermined mixtures. The learning algorithm works eciently in blind separation. The detailed derivation of the natural gradient algorithm and computer simulations will be given in a full paper. References 1] S. Amari. Natural gradient works eciently in learning. Neural Computation, 10:251{276, 1998. 2] S. Amari and A. Cichocki. Adaptive blind signal processing{ neural network approaches. Proceedings of the IEEE, 86(10):2026{2048, 1998. 3] S. Amari, A. Cichocki, and H.H. Yang. A new learning algorithm for blind signal separation. In G. Tesauro, D.S. Touretzky, and T.K. Leen, editors, Advances in Neural Information Processing Systems 8 (NIPS*95), pages 757{763, Cambridge, MA, 1996. The MIT Press.
4] A.J. Bell and T.J. Sejnowski. An information maximization approach to blind separation and blind deconvolution. Neural Computation, 7:1129{1159, 1995. 5] J. Cardoso and S. Amari. Maximum likelihood source separation: Equivalance and adaptivity. In Proceedings of SYSID,97, pages 1063{1968, 1997. 6] J.-F Cardoso. Blind signal separation: Statistical principles. Proceedings of the IEEE, 86(10):2009{2025, 1998. 7] J.-F. Cardoso and B. Laheld. Equivariant adaptive source separation. IEEE Trans. Signal Processing, 44(12):3017{3030, Dec. 1996. 8] A. Cichocki and R. Unbehauen. Robust neural networks with on-line learning for blind identication and blind separation of sources. IEEE Trans Circuits and Systems I : Fundamentals Theory and Applications, 43(11):894{906, 1996. 9] S. Douglas, A. Cichocki, and S. Amari. Multichannel blind separation and deconvolution of sources with arbitrary distributions. In Proceeding of IEEE Workshop on Neural Networks for Signal Processing(NNSP'97), pages 436{445, Florida, US, September 1997. 10] T.W. Lee, M.S. Lewicki, and M. Girolami. Blind source separation of more soures than mixtures using overcomplete representations. IEEE Signal Processing Letters, 6(4):87{90, April, 1999. 11] L. Zhang, S. Amari, and A. Cichocki. Natural gradient approach to blind separation of over- and undercomplete mixtures. In Proceeding of Independent Component Analysis and Signal Separation(ICA'99), pages 455{460, Aussois, France, 1999.