Neural Estimation of Basis Vectors in Independent

3 6

Neural Estimation of Basis Vectors in Independent Component Analysis Juha Karhunen, Liuyue Wang, and Jyrki Joutsensalo Helsinki University of Technology, Laboratory of Computer and Information Science, Rakentajanaukio 2 C, FIN-02150 Espoo, Finland. email: Juha.Karhunen@hut. Fax: +358-0-4513277

Abstract:

Independent Component Analysis (ICA) is a recently developed, useful extension of standard Principal Component Analysis (PCA). The associated linear model is used mainly in source separation, where only the coecients of the ICA expansion are of interest. In this paper, we propose a neural structure related to nonlinear PCA networks for estimating the basis vectors of ICA. This ICA network consists of whitening, separation, and estimation layers, and yields good results in test examples. We also modify our previous nonlinear PCA algorithms so that their separation capabilities are greatly improved.

1. Introduction

Currently, there is a growing interest among neural network researchers in unsupervised learning beyond PCA, often called nonlinear PCA. Such methods take into account higher-order statistics, and are often more competitive than standard PCA when realized neurally [9, 10]. Nonlinear PCA type methods can be developed from various starting points, usually leading to mutually dierent solutions. We have previously derived various robust and nonlinear extensions of PCA starting either from maximization of the output variances or minimization of the mean-square representation error [9, 10]. Independent Component Analysis (ICA or INCA) is a useful extension of PCA that has been developed some years ago in context with source or signal separation applications [8, 5]. Its starting point is the uncorrelatedness property of PCA. Instead of requiring that the transformed coecients or the coecients of a linear expansion of data vectors be uncorrelated, they must be mutually independent in ICA (or as independent as possible). A precise de nition of ICA and many results are given in a recent fundamental paper [5]. The few neural papers related to ICA deal with separation of sources from their linear mixtures. Most of them utilize the seminal HJ algorithm [8, 3] introduced by Herault and Jutten. In this paper, we concentrate more on the related problem of estimating the basis vectors of ICA. They are the counterparts of the PCA eigenvectors, but characterize the data in many cases better. ICA basis vectors should be useful for example in exploratory projection pursuit [6], where one tries to project the full-dimensional data onto directions that reveal as much as possible about the structure of the data. Directions on which the distribution of projected data is furthest from Gaussian are often regarded as most revealing ones. In a sense, ICA basis vectors provide such directions.

2. ICA and source separation

In current ICA methods, the following data model is usually assumed. The L-dimensional k:th data vector xk is of the form

xk = Ask + nk = X sk(i)a(i) + nk ; M

i=1

(1)

where in the M -vector sk = [sk (1); : : : ; sk (M )]T , sk (i) denotes the ith independent component (source signal) at time k, A = [a(1); : : : ; a(M )] is an L M 'mixing matrix' whose columns a(i) are the basis vectors of ICA, and nk denotes noise. The following assumptions are typically made [2, 12]: 1. A is a constant matrix with full column rank. Thus the number of sources

L. Usually M is assumed to be known. 2. The coecients (source signals) sk (i) must be mutually statistically independent at each time instant k, or as independent as possible. 3. Each source signal sequence fsk(i)g is a stationary zero-mean stochastic process. Only one source signal is allowed to be Gaussian. - Especially the second assumption is strong, but on the other hand very little information is required on the ICA basis matrix A. The noise term nk is often omitted from (1), because it is impossible to distinguish it from the source signals. Adaptive source separation [2, 12] consists in updating an M L separating matrix Bk so that the M -vector y k = B k xk (2) is an estimate yk = ^sk of the original independent source signals. Under the assumptions made above, the estimate s^k (i) of the ith source signal may appear in any component yk (j ) of yk . It is also impossible to determine the amplitudes of the source signals sk (i) from the model (1) without additional assumptions. In [2, 12], the authors assume that each source signal sk (i) has unit variance. Several source separation algorithms utilize the fact that if the data vectors xk are rst preprocessed by whitening (sphering) them, then the separating matrix Bk in (2) becomes orthogonal: Bk BTk = IM , where IM denotes M M unit matrix. Whitening can be done in many ways (see [11]). PCA is often used, because one can then simultaneously compress information optimally. Assume now that the matrix Bk has converged to a separating solution B. Knowing B, we can estimate the basis vectors a(i) (i = 1; : : : ; M ) of ICA by using the theory of pseudoinverses. The minimum-norm solution of (2) is in the general case where B is not orthogonal M

^ k = BT (BBT )?1yk = X s^k (j )^a(j ): x^k = Ay M

j =1

(3)

Here a^(j ) denotes the j th column of the L M matrix A^ = BT (BBT )?1. Comparing this with the ICA expansion (1), we see that the vectors a^(j ) are the desired estimates of the basis vectors of ICA. However, they may appear in any order like the estimated source signals s^k (j ) in the vector yk , and their norms may vary. A relevant way to x the expansions (1) and (3) more uniquely is to normalize the basis vectors a(j ) or a^(j ), and order the terms in a similar way as in standard PCA according to decreasing powers Efsk (i)2 g or Efs^k (j )2g.

3. Neural structures for ICA

We can now relate the above model to PCA and PCA type neural networks as follows. Let Wk = [wk (1); : : : ; wk (M )] denote the L M weight matrix of a PCA type neural network [9, 3, 14] consisting of M neurons, where wk (i) is the weight vector of the ith neuron at kth learning step. Then the outputs yk (i) = wk (i)T xk of the neurons are the elements of the output vector

yk = WkT xk :

(4)

Clearly, WkT in (4) corresponds to Bk in (2). The optimal weight vectors w(i) become the PCA eigenvectors c(i), i = 1; : : : ; M , on the following constraints [10]: uncorrelatedness of the outputs: Efyk (i)yk (j )g = 0, i 6= j ; orthonormality of the weight vectors: WkT Wk = I; sequential maximization of the output powers Efy(i)2 g for i = 1; : : : ; M . In ICA, the last two constraints are replaced by the requirement that the output signals yk (i) must be mutually independent (or as independent as possible). Except for Gaussian outputs, this xes the directions of nal weight vectors w(i) uniquely [5]. In Section 4, we present some neural algorithms for estimating the separating matrix B. Naturally, one can then compute the basis vectors of ICA from (3), but this requires non-neural postprocessing because of the matrix inversion. A completely neural method for estimating the basis vectors of ICA can be developed by modifying the standard 3-layer linear feedforward network shown in Fig. 1. This network consists of input and output layers with L neurons, and of middle hidden layer with M neurons. Let R denote the M L weight matrix between the input and hidden layer, and Q the L M weight matrix between the hidden and output layer. Consider now encoding and decoding of the data vectors x by using the network of Fig. 1 in autoassociate mode. If M < L, data compression takes place in the hidden layer, and the output x^ = QRx of the network is generally an approximation of the input vector x.

Q

R=B

V W x^

x

Q

T

x^

x

v

y

y

Figure 1: (Left) The standard linear feedforward network. When used as an ICA network, the outputs of the hidden layer must be mutually independent. Figure 2: (Right) The proposed ICA network, consisting of whitening, separation, and basis vector estimation layers. Typically, the network is trained by minimizing the mean-square approximation error J1 (Q; R) = Efk x ? QRx k2g using for example the backpropagation algorithm. It is well known [1] that the optimal solution is given by any matrix of the form R = (QT Q)?1QT , where the columns of Q constitute some arbitrary linearly independent basis of the M -dimensional PCA subspace of the input vectors x. This PCA subspace is de ned by the M principal eigenvectors of the data covariance matrix EfxxT g. Assume now that nk in (1) is standard zero-mean white noise vector with covariance matrix Efnk nTk g = 2 IL , where 2 is the common variance of the components of nk , and that nk is uncorrelated with the sources sk (i). Under these assumptions, the covariance matrix of the data vectors (1) is [18]) Efxk xTk g =

M X Efs (i)2ga(i)a(i)T + 2I : i=1

k

L

(5)

Furthermore, it is easy to see [18] that the basis vectors a(i), i = 1; : : : ; M , of ICA theoretically lie in the M -dimensional PCA subspace of (5). The central idea behind our ICA network is to utilize the freedom in choosing the matrix Q (and R). This is done by imposing the additional constraint that the components of the vector y = Rx must be as independent as possible. This forces the network of Fig. 1 to converge in theory to such a minimizing solution where the columns of Q are the desired basis vectors a(i) of ICA. In fact, we need not use the structure R = (QT Q)?1 QT , because the independence constraint for y and minimization of the MSE error Efk x ? Qy k2g are sucient for the estimation of the basis vectors of ICA. Note that if Q = A and yk = sk , minimizing Efk xk ? Qyk k2 g is equivalent to minimizing Efk nk k2g in the ICA model (1). Thus we can seek R in the form R = WT V, where V is any M L matrix that whitens the data vectors xk and possibly drops their dimension from L to M , and WT is now some orthonormal M M separating matrix. Figure 2 shows the arising ICA network structure. As usual, feedback connections (not shown) are needed in the learning phase, but after learning the network becomes purely feedforward. Even though the ICA networks of Figs. 1 and 2 are linear after learning, nonlinearities must be used in learning the matrix B or WT . They introduce higher-order statistics into computations, which is necessary in estimating the ICA expansion. The structure of Fig. 2 is used in context with our robust or nonlinear PCA learning algorithms which require whitening of the input data for yielding good separation results. In the PFS/EASI algorithm (see the next section) the separating matrix B = R performs all the tasks of whitening, dropping the dimension, and separating the sources. In this case, one can use the simpler original network in Fig. 1, but on the other hand the learning algorithm is more complicated.

In both the ICA network structures, the matrix Q can be learned by minimizing the MSE error Efk xk ? Qyk k2 g. Omitting the expectation, the gradient of k x ? Qy k2 with respect to Q is ?2(x ? Qy)yT , which yields the stochastic gradient algorithm

Qk+1 = Qk + k (xk ? Qkyk )ykT

(6) for learning Q. Here k > 0 is a small learning parameter. We use this algorithm for estimating the basis vectors of ICA in context with various separating algorithms.

4. Neural separating and whitening algorithms

Whether the goal is separation of independent sources (components) alone or estimation of the basis vectors of ICA or both, we always need a separating algorithm. In the following, we list some relevant neural possibilities. Such algorithms are discussed in more detail in [11]. 1. The HJ algorithm [8, 3]. Here, the separating matrix B in Fig. 1 is sought in the form B = (I + S)?1 (or B = I ? S), and the o-diagonal elements of S are updated using the rule sij = k g1(yk (i))g2 (yk (j )), where g1(t) and g2 (t) are two dierent odd functions, and k > 0. This seminal algorithm is simple and local, but often fails in separating more than two sources. 2. The EASI (or PFS) algorithm [2, 12]. The update formula for the separating matrix B = R in the network of Fig. 1 is Bk+1 = Bk ? k [yk ykT ? I ? yk g(ykT ) + g(yk)ykT ]Bk: (7) Here g(y) is a column vector whose i:th component is g(y(i)). If g(t) grows faster than linearly, the algorithm (7) must be stabilized in practice; see [2, 11, 12]. 3. The bigradient algorithm [19]. The learning algorithm for the orthogonal matrix W in Fig. 2 is Wk+1 = Wk + kvk g(ykT ) + k Wk (I ? WkT Wk) (8) where the outputs yk = WkT vk , and the inputs vk = Vk xk are obtained by whitening and possibly compressing the original data vectors xk . Here k is another gain parameter, usually about 0.5 or 1 in practice. 4. Oja's Nonlinear PCA subspace rule [9, 10, 14]. This is used quite similarly as (8), but the update formula for W in Fig. 2 is dierent (k > 0):

(9) Wk+1 = Wk + k [vk ? Wk g(yk )]g(ykT ): Probably the simplest algorithm for computing the whitening matrix Vk in the network of Fig. 2 is [2, 12, 16] Vk+1 = Vk ? k [vk vkT ? I]Vk: (10)

In [16], Plumbley has proposed more sophisticated neural learning algorithms which simultaneously whiten the original data vectors and compress them into the PCA subspace. Whitening of the data often improves dramatically the performance of the algorithms (8) and (9) in source separation. The same phenomenon has been observed also in other applications of robust and nonlinear PCA such as clustering [17] and exploratory projection pursuit [7]. The separation properties of the bigradient algorithm (8) can be justi ed as follows. It can be shown [13] that if the data vectors are prewhitened and all the source signals sk (i) have a negative kurtosis: Efsk (i)4g ? 3[Efsk (i)2g]2 < 0; (11) P 4 then it is sucient to minimize the sum of fourth moments M i=1 Efyk (i) g for achieving separation. Minimizing the sum of fourth moments corresponds to using k < 0 and the cubic nonlinearity g(t) = t3 in (8) [19]. In practice, a fast growing learning function such as g(t) = t3 often causes stability problems without an extra normalization. This can be avoided by using the sigmoidal nonlinearity g(t) = tanh(t) with k > 0 in (8), which leads to approximate minimization of the sum of fourth moments (see [7, 11, 19]).

1

s1(t)

1 0.8 0.6

0

−1 0

10

20

30

40

50

60

70

80

90

100

10

20

30

40

50

60

70

80

90

100

10

20

30

40

50 t

60

70

80

90

100

0.4 2

x_2

s2(t)

0.2 0

0 −2 0

−0.2

2

s3(t)

−0.4 −0.6 −0.8 −1

−0.8

−0.6

−0.4

−0.2

0 x_1

0.2

0.4

0.6

0.8

1

0

−2 0

Figure 3: (Left) The parallelogram data, true basis vectors of ICA (dashed lines) and estimates of ICA basis vectors given by the bigradient algorithm (solid lines). Figure 4: (Right) Original source signals in Comon's example. Respectively, if the all the sources have a positive kurtosis, the sum of fourth moments must be maximized for prewhitened data. This can be done in the bigradient algorithm (8) by just changing the sign of k to opposite with the above choices of g(t). The same combinations of k and g (t) can be applied also to the EASI algorithm (7), but the sign of k must be opposite to that used in (8). Nonlinear Oja's rule (9) suits best to separation of sources that have a negative kurtosis with the choices g(t) = tanh(t) and k > 0, because this algorithm is not stable if k < 0. Its separation properties are analyzed mathematically in detail in [15].

5. Experimental results

We have simulated the above algorithms using the parallelogram data described in [9] and Comon's test example [4]. In the following, we present a typical simulation result for both these two test cases. Other separation experiments, also with more realistic speech and image data, are described in [11, 19, 20]. In Fig. 3, the dots are 2-dimensional data vectors xk uniformly distributed inside the parallelogram. The dashed lines show the directions of the true ICA basis vectors a(1) and a(2). The solid lines are estimated ICA basis vectors a^(1) and a^(2). They were obtained by rst whitening the data using (10), and then computing the separating matrix WkT from the bigradient algorithm (8) using the nonlinearity g(t) = tanh(t). The estimated basis vectors are the columns of the matrix Q learned through (6). The nonlinear PCA algorithm (9) and the EASI/PFS algorithm (7) perform equally well. Estimation of the ICA basis vectors from (3) with B = WT V yielded roughly similar results. In Comon's example [4], the three original source signals sk (i) shown in Fig. 4 are uniformly distributed noise, a ramp, and a sinusoid. All the sources have negative kurtosis. Fig. 5 depicts the 3 components of the 100 data vectors xk used in the experiment. These were formed from the model (1), where the true normalized basis vectors of ICA were a(1) = [0:0891; ?0:8909; 0:4454]T , a(2) = [0:3906; ?0:6509; 0:6509]T , and a(3) = [?0:3408; 0:8519; ?0:3976]T . The data vectors were used several times in the learning phase for achieving sucient convergence. After this, the weight matrices were frozen, and the data vectors were inputted again to the network. Fig. 6 shows the separated outputs yk (1), yk (2), and yk (3) given by the nonlinear PCA rule (9). These output signals are good estimates of original sources in Fig. 4. The algorithms (8) and (7) also yield good separation results. Using the same procedure as for the parallelogram data and (9) yielded estimated basis vectors of ICA a^(1) = [?0:1054; 0:8917; ?0:4401]T , a^(2) = [0:3918; ?0:6541; 0:6470]T , and a^(3) = [0:3319; ?0:8519; 0:4073]T .

2

z1(t)

x1(t)

2

0

−2 0

10

20

30

40

50

60

70

80

90

−2 0

100

0

10

20

30

40

50

60

70

80

90

30

40

50

60

70

80

90

100

10

20

30

40

50

60

70

80

90

100

10

20

30

40

50 t

60

70

80

90

100

2

z3(t)

x3(t)

20

0

−2 0

100

5

0

−5 0

10

2

z2(t)

x2(t)

5

−5 0

0

10

20

30

40

50 t

60

70

80

90

100

0

−2 0

Figure 5: (Left) The three components of the data vectors xk . Figure 6: (Right) The separated outputs given by the nonlinear PCA algorithm.

References

[1] P. Baldi and K. Hornik, Neural Networks, Vol. 2, 1989, pp. 53-58. [2] J.-F. Cardoso and B. Laheld, \Equivariant Adaptive Source Separation". Submitted to IEEE Trans. on Signal Processing, October 1994. [3] A. Cichocki and R. Unbehauen, Neural Networks for Optimization and Signal Processing. Wiley, 1993. [4] P. Comon, in Proc. of Workshop on Higher-Order Spectral Analysis, Vail, Colorado, June 1989, pp. 174-179. [5] P. Comon, Signal Processing, Vol. 36, 1994, pp. 287-314. [6] J. Friedman, J. American Statistical Assoc., Vol. 82, No. 397, 1987, pp. 249-266. [7] C. Fyfe, D. McGregor, and R. Baddeley, Univ. of Strathclyde (Glasgow, Scotland), Dept. of Computer Science, Research Report/94/160, October 1994. [8] C. Jutten and J. Herault, Signal Processing, Vol. 24, 1991, pp. 1-10. [9] J. Karhunen and J. Joutsensalo, Neural Networks, Vol. 7, No. 1, 1994, pp. 113-127. [10] J. Karhunen and J. Joutsensalo, Neural Networks, Vol. 8, No. 4, 1995, pp. 549-562. [11] J. Karhunen, L. Wang, and R. Vigario, \Nonlinear PCA Type Approaches ...", to appear in Proc. 1995 IEEE Int. Conf. on Neural Networks, Perth, Australia, November 1995. [12] B. Laheld and J.-F. Cardoso, in M. Holt et al. (Eds.), Signal Processing VII: Theories and Applications. EURASIP, 1994, Vol. 2, pp. 183-186. [13] E. Moreau and O. Macchi, in Proc. IEEE Signal Processing Workshop on Higher Order Statistics, Lake Tahoe, USA, June 1993, pp. 215-219. [14] E. Oja, H. Ogawa, and J. Wangwivattana, in T. Kohonen et al. (Eds.), Arti cial Neural Networks (Proc. ICANN-91), North-Holland, 1991, pp. 385-390. [15] E. Oja, Helsinki Univ. of Technology, Lab. of Computer and Inform. Science, Report A26, August 1995. [16] M. Plumbley, in Proc. IEE Conf. Arti cial Neural Networks, ANN'93, Brighton, UK, May 1993, pp. 86-90. [17] A. Sudjianto and M. Hassoun, in Proc. 1994 IEEE Int. Conf. on Neural Networks, Orlando, Florida, June 1994, vol. II, pp. 1247-1252. [18] C. Therrien, Discrete Random Signals and Statistical Signal Processing. Prentice-Hall, 1992. [19] L. Wang, J. Karhunen, and E. Oja, \A Bigradient Optimization Approach ...", to appear in Proc. 1995 IEEE Int. Conf. on Neural Networks, Perth, Australia, November 1995. [20] L. Wang et al. \Blind Separation of Sources ...", to appear in Int. Conf. on Neural Networks and Signal Processing, Nanjing, China, December 1995.