In ICANN'96,The Lecture Notes in Computer Science Vol.1112 pp.287-292, Springer Verlag, 1996.
Two Gradient Descent Algorithms for Blind Signal Separation H. H. Yang and S. Amari Lab. for Information Representation, FRP, RIKEN Hirosawa 2-1, Wako-shi, Saitama 351-01, JAPAN E-mail:
[email protected] Two algorithms are derived based on the natural gradient of the mutual information of the linear transformed mixtures. These algorithms can be easily implemented on a neural network like system. Two performance functions are introduced based on the two approximation methods for evaluating the mutual information. These two functions depend only on the outputs and the de-mixing matrix. They are very useful in comparing the performance of dierent blind separation algorithms. The performance of the new algorithms is compared to that of some well known algorithms by using these performance functions. The new algorithms generally perform better because they minimize the mutual information directly. This is veri ed by the simulation results. Abstract.
1
Introduction
The mutual information (MI) is one of the best contrast functions for designing blind separation algorithms because it has several invariant properties from the information geometrical point of view [1]. However, it is generally very dicult to obtain the exact function form of the MI since the probability density functions (pdfs) of the outputs are unknown. The Edgeworth expansion and the GramCharlier expansion were applied in [7] and [3] respectively to approximate the pdfs of the outputs in order to estimate the MI. The estimated MI is useful not only for deriving blind separation algorithms, but also for evaluating the performance of the algorithms. In the literature of blind separation, the performance of a blind separation algorithm is often measured by the cross-talking de ned as the summation of all non-dominant elements in the product of the mixing matrix and the de-mixing matrix. However, the cross-talking can not be evaluated in practice since the mixing matrix is unknown. The MI is recommended to measure the performance of dierent blind separation algorithms. The algorithms in [3] and in this paper are derived by applying the natural gradient descent method to minimize the estimated MI. The informationmaximization approach[4] is another way to nd the independent components.
The output is rst transformed by a non-linear function, and then the gradient descent algorithm is applied to maximize the entropy of the transformed output. Although this approach derives a concise algorithm for blind source separation, it does not directly minimize the dependence among the outputs except for the case in which the non-linear transforms happened to be the cumulative density functions of the unknown sources[5]. In this paper, we derive two blind separation algorithms based on two functions to approximate the MI. The performance of the two algorithms is compared to other existing algorithms including the info-max algorithm in [4].
2
Approximations of the MI
Let s(t) = [s1 (t); 1 1 1 ; sn (t)]T be n unknown independent sources. Each source is stationary and has moments of any order with a zero mean. The model for the mixtures x(t) = [x1 (t); 1 1 1 ; xn (t)]T is
x(t) = As(t)
where A 2 Rn2n is an unknown non-singular mixing matrix. To recover the original signals from the observations x(t) we apply the following linear transform: y(t) = Wx(t)
where y(t) = [y 1 (t); 1 1 1 ; y n (t)]T and W = [wka ]n2n 2 Rn2n is a de-mixing matrix driven by a learning algorithm which has no access to both the sources and the mixing matrix. The basic idea of the ICA [7] is to minimize the MI:
Z
Qn p(pya)(ya) dy = 0H (yjW ) + X H (yajW ) a=1 a=1 R where R pa(ya) is the marginal pdf, H (yjW ) = 0 p(y) log p(y)dy, and H (ya jW ) = 0 p (ya ) log p (ya )dya the marginal entropy. I(
a
y jW ) =
n
y
p( ) log
a
It is easy to compute the joint entropy H (yjW ) = H (x) + log jdet(W )j. But the computation of the marginal entropy H (y a jW ) is dicult. Applying the Gram-Charlier expansion to approximate the marginal pdf pa (y a ), we nd the approximation of the marginal entropy in [3]: H (y a )
(a )2 1 log(2e) 0 3 2 2 1 3!
a 2
0 (2144!)
3 1 + (a3 )2 a4 + (a4 )3 8 16
(1)
where a3 = ma3 , a4 = ma4 0 3, and mak = E [(y a )k ]. Let F1 (a3 ; a4 ) denote the right hand side of the approximation (1), then I(
yjW ) 0H (x) 0 log jdet(W )j +
n X
a=1
F1 (a3 ; a4 ):
(2)
The Edgeworth expansion is used in [7] to approximate the pdf pa (y a ). The approximation of the marginal entropy is found to be 1 1 7 1 1 (a3 )2 0 (a4 )2 0 (a3 )4 + (a3 )2 a4 : (3) H (ya ) log(2e) 0 2 2 3 3! 2 3 4! 48 8 But the cubic of the 4th order cumulant which is crucial for separation is not included in this approximation. We shall use the following approximation instead of (3) to derive the learning algorithm: H (y a j
W ) 12 log(2e) 0 2 113! (a3 )2 0 2 114! (a4 )2 + 18 (a3 )2 a4 + 481 (a4 )3
(4)
From this formula, we obtain another approximation for the MI: I(
yjW ) 0H (x) 0 log jdet(W )j +
n X
a=1
F2 (a3 ; a4 )
(5)
where F2 (a3 ; a4 ) is de ned by the right hand side of (4). Our blind separation algorithms are derived by minimizing the MI based on (2) and (5).
3
Algorithms
To minimize I (y jW ), we use the following natural gradient descent algorithm:
W
y jW ) W T W W
@I ( d = 0(t) dt @
(6)
where (t) is a learning rate function. From (2) and (6), we can derive the rst algorithm:
W
I 0 (f 1(3 ; 4 ) y2 )yT 0 (g1 (3 ; 4 ) y3 )yT gW (A1) where denotes the Hadamard product of two matrices: C D = [cij dij ], I is an identity matrix, yk = [(y1)k ; 1 1 1 ; (yn )k ]T for k = 2; 3, f 1 (3; 4) = [f1(13 ; 14 ); 1 1 1 ; f1(n3 ; n4 )]T , f1 (y; z) = 0 12 y + 94 yz, g1 (3; 4) = [g1 (13; 14); 1 1 1 ; g1(n3 ; n4 )]T , g1(y; z) = 0 16 z + 32 y2 + 34 z2 . d = (t)f dt
In (A1), the cumulants a3 and a4 are driven by the following equations: da3 = 0(t)(a3 0 (ya )3 ); dt
da4 = 0(t)(a4 0 (y a )4 + 3) dt
(7)
where (t) is a learning rate function. Using (5) instead of (2), we have the second algorithm:
W
I 0 (f 2(3 ; 4 ) y2 )yT 0 (g2 (3 ; 4 ) y3 )yT gW
d = (t)f dt
(A2)
f
g
where 2 (1; 1) and 2 (1; 1) are vector functions de ned from f2 (y; z ) = 0 12 y + 34 yz and g2 (y; z ) = 0 61 z + 12 y 2 + 41 z 2 . The equation (7) is used again to trace the cumulants. Both algorithms (A1) and (A2) can be rewritten as
W
I
yy W
d = (t)f 0 h( ) T g (A) dt ( ) = [h1 (y 1 ); 1 1 1 ; hn (y n )]T ; ha (y a ) = fi (3 ; 4 )(ya )2 + gi (3 ; 4 )(ya )3 ,
where h y for i = 1 or 2. The info-max algorithm in [4] has the following form
W
d = (t)f( dt
W 01)T 0 f (y)xT g
(8)
which is dierent from the algorithm (A). But if the natural gradient of the entropy is used, the algorithm (8) becomes
W
I 0 f (y)yT gW
d = (t)f dt
(B )
where the activation function f (y ) is determined by the sigmoid function used in transforming the outputs. For example, if the transform function is tanh(x), the activation function is 2tanh(x). Note the algorithm (B) also works well for some other activation functions such as those proposed in [2, 4, 6, 8]. Both algorithms (A) and (B) have the same \equivariant" property as the algorithms in [6]. A signi cant dierence between (A) and (B) is that the activation function employed in the former is data dependent while the activation functions employed in the later are data independent. The data dependence of the activation functions makes the algorithm (A) more adaptive. The tracking of a3 and a4 by the equation (7) is part of the algorithm (A). To compare the performance of the algorithm (A) and (B), we de ne Ip (
yjW ) = 0 log jdet(W )j +
n X
a=1
F1 (a3 ; a4 ):
It is the changing part of the MI when W is updated. By using this function, we compare the performance of the algorithm (A) to that of the algorithm (B).
4
Simulation
In the rest of this paper, we use (B1) and (B2) to refer the two versions of the algorithm (B) when the functions x3 and 2tanh(x) are used respectively. We choose four modulated signals and one uniformly distributed noise as the unknown sources in the simulation. The ve sources are mixed by a 5 2 5 mixing matrix A whose elements are randomly chosen in [01; +1]. The learning rate exponentially decreases to zero. The same learning rate is chosen for all algorithms in every simulation.
The simulation results are shown in Figure 1 and Figure 2. In each sub- gure of Figure 1, the curves for four outputs are shifted upwards from the zero level in order for a better illustration, and all the outputs shown there are within the same time window [0:1; 0:2]. In Figure 2, the total history of Ip for running (A1), (A2), (B1) and (B2) is displayed to compare their performance. From the simulation results shown in these gures, we have the following observations: 1. The algorithm (A) is faster in speed and better in quality than the algorithm (B) in separating the mixed sources. 2. The performance of blind separation algorithms can be measured objectively by the function Ip (y jW ) by tracking the moments of the outputs. 14
By algorithm (A1)
12
12
10
10
8
8
6
6
4
4
2
2
0
0
−2 0.1
0.11
0.12
0.13
0.14
0.15 t
0.16
0.17
0.18
0.19
0.2
By algorithm (B1)
14
By algorithm (A2)
14
−2 0.1
0.11
0.12
0.13
0.14
0.15 t
0.16
0.17
0.18
0.19
0.2
0.15 t
0.16
0.17
0.18
0.19
0.2
16 By algorithm (B2) 14
12
12 10
10 8
8 6
6 4
4 2
2
0
0
−2
−2
−4 0.1
0.11
Fig. 1.
5
0.12
0.13
0.14
0.15 t
0.16
0.17
0.18
0.19
0.2
−4 0.1
0.11
0.12
0.13
0.14
The comparison of the separation by the algorithms (A1), (A2), (B1) and (B2).
Conclusion
In this paper, the performance function Ip (yjW ) is introduced to evaluate the separation quality of blind separation algorithms. The algorithm (A) is derived
based on the natural gradient descent algorithm to minimize the MI in the most ecient way. It is an on-line learning algorithm with the \equivariant" property. The activation functions used in this algorithm are data dependent. This data dependence makes the algorithm more adaptive. The simulation shows that the algorithm (A) can achieve the separation faster in speed and better in quality than the algorithm (B) with the two most popular activation functions x3 and 2tanh(x). 3 2.5 Ip of (B2) 2 1.5 Ip of (B1)
Ip
1 0.5 Ip of (A2)
0 −0.5 −1
Ip of (A1) −1.5 0
0.02
Fig. 2.
0.04
0.06
0.08
0.1 t
0.12
0.14
0.16
0.18
0.2
The history of Ip in running the algorithms (A1), (A2), (B1) and (B2).
References 1. S. Amari. Dierential-Geometrical Methods in Statistics, Lecture Notes in Statistics vol.28. Springer, 1985. 2. S. Amari, A. Cichocki, and H. H. Yang. Recurrent neural networks for blind separation of sources. In Proceedings 1995 International Symposium on Nonlinear Theory and Applications, volume I, pages 37{42, December 1995. 3. S. Amari, A. Cichocki, and H. H. Yang. A new learning algorithm for blind signal separation. In Advances in Neural Information Processing Systems, 8, eds. David S. Touretzky, Michael C. Mozer and Michael E. Hasselmo, MIT Press: Cambridge, MA. (to appear), 1996. 4. A. J. Bell and T. J. Sejnowski. An information-maximisation approach to blind separation and blind deconvolution. Neural Computation, 7:1129{1159, 1995. 5. A. J. Bell and T. J. Sejnowski. Fast blind separation based on information theory. In Proceedings 1995 International Symposium on Nonlinear Theory and Applications, volume I, pages 43{47, December 1995. 6. J.-F. Cardoso and Beate Laheld. Equivariant adaptive source separation. To appear in IEEE Trans. on Signal Processing, 1996. 7. P. Comon. Independent component analysis, a new concept? Signal Processing, 36:287{314, 1994. 8. C. Jutten and J. Herault. Blind separation of sources, part i: An adaptive algorithm based on neuromimetic architecture. Signal Processing, 24:1{10, 1991. This article was processed using the LATEX macro package with LLNCS style