Negentropy and Kurtosis as Projection Pursuit ... - Semantic Scholar

5 downloads 0 Views 588KB Size Report
Mark Girolami. Department of Computing and Information ...... Draft 3.1, Cavendish Laboratory, University of Cambridge, Aug 17,. 1996. [5] Girolami, M and Fyfe, ...
Negentropy and Kurtosis as Projection Pursuit Indices Provide Generalised ICA Algorithms Mark Girolami Department of Computing and Information Systems University of Paisley, Scotland [email protected]

Colin Fyfe Department of Computing and Information Systems University of Paisley, Scotland [email protected]

Abstract We develop a generalised form of the independent component analysis (ICA) algorithm introduced by Bell and Sejnowski [1], Amari et al [2] and lately by Pearlmutter and Parra [3] and also MacKay [4]. Motivated by information theoretic indices for exploratory projection pursuit (EPP) we show that maximisation by natural gradient ascent of the divergence of a multivariate distribution from normality, using the negentropy as a distance measure, yields a generalised ICA. We introduce a form of nonlinearity which has an inherently simple form and exhibits the Bussgang property [30] within the algorithm. We show that this is sufficient to perform ICA on data which has latent variables exhibiting either unimodal or bimodal probability density functions (PDF) or both. Kurtosis has been used as a moment based projection pursuit index and as a contrast for ICA [5, 6, 7]. We introduce a simple adaptive nonlinearity which is formed by on-line estimation of the latent variable kurtosis and demonstrate the removal of the standard ICA constraint of latent variable pdf modality uniformity.

1 INTRODUCTION The term Independent Component Analysis was first used by Jutten and Herrault [8] to describe the linear transformation of a random vector onto a basis which minimised the statistical information between its components. It was considered as an extension of Principal Components Analysis (PCA) which minimises the statistical information between vector components up to second order. Comon [9] provided an analysis of the

ICA transformation and developed a batch based algorithm yielding an orthogonal rotation which minimised an approximation of the mutual information between the components. The sum of squares of fourth order marginal cumulants was proposed by Comon as an approximative contrast for ICA, the nonlinear PCA algorithms of Oja and Karhunen et al have been shown to implicitly maximise this contrast in performing an ICA [10, 11, 12, 13, 14, 15]. Recently, Girolami and Fyfe [16] have shown that the nonlinear PCA algorithm can be considered as a neural implementation of Comon’s batch algorithm and provide additional justification for the nonlinear PCA approach to ICA. Bell and Sejnowski developed the entropy maximisation algorithm for ICA [1]; although no explicit pdf approximations were made in terms of Edgeworth or Gram Charlier expansions, the use of the sigmoid as the activation function imposed an a priori assumption that the underlying data had super-gaussian pdf’s such as Laplacian. Amari et al [2] develop an algorithm similar in form to that of Bell and Sejnowski, however, they utilise a Gram Charlier expansion of the data entropy and propose a non-monotonic activation function. The use of the natural gradient [21] removes the burdensome matrix inversion required of the original algorithm and introduces the equivarient property as detailed by Cardoso [19]. Pearlmutter [3] considers ICA from a maximum likelihood perspective, as does MacKay [4]. They however make no assumption regarding the form of the data pdf, and as such the algorithm nonlinearity. It is therefore incumbent on the user to have a priori knowledge of the data pdf, as a parametric form for the data pdf is required for the learning period. Exploratory Projection Pursuit [17] is a statistical tool which allows structure in high dimensional data to be identified. This is achieved by projecting the data onto a low dimensional subspace and searching for structure in the projection. By defining indices which give a measure of how ‘interesting’ a given projection is, projection of the data onto a subspace which maximises the given index will then provide a maximally ‘interesting’ direction. Departures from a Gaussian distribution are viewed as ‘interesting’, as skewed or multi-modal distributions present certain structures within the data. If we then use an index which is a function of the direction of projection, index maximisation will then provide a direction furthest from gaussian. Girolami and Fyfe utilise a nonlinear form of the negative feedback network which approximates a stochastic maximisation of the normalised fourth order cumulant (i.e. kurtosis) of the data in performing an EPP and link this with ICA [18]. The rest of this paper will now consider EPP and two specific indices which will yield a generalised ICA. Section 2 considers an index based on negentropy, a stochastic natural gradient ascent algorithm is developed to identify directions which are furthest from gaussian. Section 3 considers two simple parameterised forms of nonlinearity which will provide ICA for data which is sub-gaussian and data which is super gaussian. Section 4 provides examples of the algorithm performance using computer simulations. Section 5 considers a hierarchical network architecture which performs an EPP based on kurtosis maximisation. A nonlinearity which is shaped by the on-line latent data kurtosis estimation is proposed. Section 6 reports on one significant simulation showing the powerful performance possible using this form of network and learning.

2 NEGENTROPY AS AN EPP INDEX As noted by Marriot [20] clustered projections which are approximately symmetrical and mesokurtic can sometimes be difficult to identify with indices based on third and fourth moments. This suggests the use of indices based on information theoretic criteria, we shall consider negentropy as a direct distance measure from normality. Negentropy is defined in [9] as J ( pu ) = H ( pG ) − H ( pu ) where H ( pu ) is the entropy of the data density u, and

H ( pG ) is the equivalent entropy of a Gaussian density which has equal mean and

covariance as p u . Gibbs second theorem shows that a multivariate normal distribution maximises the entropy over all other non-normal distributions with the same covariance [20], so negentropy as defined will always be positive for non-normal distributions. An observation of random vector x = ( x1 , x 2 , consists of N latent variables s = ( s1 , s2 ,

, x

, s

N

)

T

) T is made such that the

N

vector x

projected onto a set of unknown

T

N  N  = , vectors, x (t ) a1 j s j , a Nj s j  = As , where A is an N × N full rank   j =1 j =1   matrix. An EPP will then transform the vector x via the matrix W which will maximise the distance of the transformed data from normality, with the transformed data given as u=Wx. The entropy of a multivariate normal distribution is defined as



 ∑

(

)

[



]

H pG ( u) = − pG ( u) log p G ( u)

(1)

where for zero mean vector E {u} = 0 then pG ( u) =

1

(2π ) (det(Cuu )) N 2

1 2

e

−1 u − 12 u T C uu

(2)

where C uu is the covariance matrix of the transformed vector u, using (2) in (1) after some manipulation gives

(

[

)

H p G ( u) = 12 log (2πe) det(C uu ) N

]

(3)

As negentropy is always positive and a function of the transforming matrix W we employ natural gradient ascent for maximisation [2, 21]. J ( pu (u), W) = H ( p G (u)) − H ( p u (u))

(4)

Using (3) in (4) and noting that in the discrete case the entropy of a distribution can be

(

)

{ [

(

)

]} we have for the multivariate negentropy

written as H p u ( u) = − E log p u ( u)

[

] {

[

J pu ( u), W = 12 log ( 2πe) det(C uu ) + E log p u ( u) N

]}

(5)

Taking the simple gradient of (5) and using instantaneous values yields dJ pu ( u), W ∇ p ( u) T d 1 N = log (2πe) det(C uu ) + u u x 2 dW dW pu ( u)

(

)

{ }

{ [

]}

{ }

(6)

Noting that C uu = E uu T = WE xx T W T = WC xx W T and using the following matrix

( )

determinant identities, det(AB ) = det(A )det(B ), det(A) = det A T then we have d dW =

{ log[(2πe) 1 2

adj( W)

T

det( W)

]}

(

[ ]

= WT

)

[

]

d 1 N log (2πe) det(C xx ) + log det( W)   dW  2 dJ pu ( u), W −1 −1 ∇ p ( u) T = WT + u u Insert into (6) and x dW pu ( u)

det(C uu ) =

N

(

)

[ ]

(7)

Employing the now familiar natural gradient [2, 4, 19, 21] we then have

(

dJ pu ( u), W dW

) W T W =  W T

[ ] 

−1

WT +

∇ u pu ( u) pu ( u)

 x T WT W 

 ∇ p (u ) T  = I + u u u W p u ( u)   Finally giving a stochastic weight update of  ∇ u pu ( u) T  dJ pu ( u), W ∆W ∝ W T W = I + u W (8) dW pu ( u)   We now have a stochastic gradient ascent algorithm which will maximise the negentropy of the transformed output data, that is maximise the distance from normality of transformed data. This algorithm is similar in form to that of Pearlmutter and Parra [3] for contextual ICA, which indicates that an EPP maximising negentropic distance will also perform an ICA. This in fact is the case given that independence of multivariate components denotes an upper bound on the entropy [22] that is

(

(

)

N

) ∑ H ( pu (ui )) with equality existing if I (ui ; u j ) = 0 ∀i ≠ j : i, j ∈1,

H pu ( u) ≤

i =1

, N

denoting the mutual information of u. So the negentropic distance of u from normality is maximised when the individual components of u are independent, assuming the latent variable s has independent components.

3 PARAMETERISATION OF THE OUTPUT PDF A parametric form for the by the conditional pdf’s

∇ u pu ( u ) pu (u )

term in (8) is required, the output pdf can be modelled N

(

p u (u) = ∏ p ui ui −1, , 1 i =1

).

If we use the product of univariate

independent densities to parametrise pu (u) we are effectively over-estimating the output entropy initially, however as we are seeking a differential distance this should not be a concern. So we have

∇ u p u (u) p u (u)

 p u' (u1 ) , =  p u (u1 ) 



p u' (u i ) p u (u i )

,



p u' (u N )   p u (u N )  

T

. There is a problem in

applying the ICA algorithm to sub-gaussian data such as the uniform pdf as detailed in [1]. For subgaussian densities the range of shape can be uniform to bimodal. We now wish to develop a parametric form of the nonlinearity in (8) which will be suitable for both sub-gaussian and super-gaussian data. This will allow the use of (8) to perform EPP or ICA on data which may possess latent marginal distributions exhibiting both subgaussian and super-gaussian distributions. If we assume the form of the Gram-Charlier expansion [23] the density will be a summation of the products of normal densities and orthogonal Hermite polynomials. The explicit form of the Gram-Charlier approximation to order four for a symmetric pdf such that

{ }

{ }

{ }

E {u} = 0, E u 2 = 1, E u 3 = 0, and kurtosis E u 4 − 3 = κ 4 , is derived as κ u2 κ u4 p u (u) = p G (αu) 87 − 42! + 44!  where p G (αu) =  



1

αu2 2σ 2

e (9) 2πσ 2 For the above conditions the Gram-Charlier and Edgeworth expansions are identical to fourth order. Now 3  7 − κ 4u 2 + κ 4 u4  u3 ' − − − κ α p u u up u ( ) ( ) − κ 4 u − u6 4 G G 6 2 24  p u ( u) 8   = ⇒ − αu κ u2 κ 4u4  p u ( u)  7 − κ 4 u2 + κ 4 u4  p G ( u)  87 − 42 + 24 2 24   8    When the kurtosis is negative, by tedious polynomial division we then have the following 3 5 7 pu' ( u ) u9 ≅ u − 23u + 38u − 732u + 64 + − αu (10) pu ( u ) 3 5 p' ( u ) noting that tanh (u) ≅ u − u3 + 215u so an approximation of (10) is pu (u) = −αu + tanh (u) u

[

]

(

[



]

)

When the kurtosis value is positive, and noting that

sec h(u) ≅ 1 −

then (9) can be approximated by pu (u) = pG (αu) sec h( u) , so then

u2 2!

(u) pu ( u ) pu'

For suitably scaled data where the truncated expansions are valid [ − 1,

+

5u4 4!

6



− 155u! +

= −αu − tanh( u)

, + 1] we can

then write a generalised approximation to the nonlinear term in (8) p u' (u) p u ( u)

= − sign(κ 4 ) tanh (u) − αu

[ and for flat or bimodal pdf’s (8) takes the form ∆W ∝ [I + tanh ( u)u

] ]W

For heavy tailed pdf’s we can then write (8) as ∆W ∝ I − tanh( u)u T − αuu T W T

− αuu T

If we wish to identify marginal distributions with varying pdf form within a multivariate density we can use the generalised form of (8)

[

]

∆W ∝ I − K 4 tanh ( u)u T − αuu T W

(11)

Where K 4 is the diagonal matrix whose entries are the directions in which the projection pursuit is to move. Alternatively we can consider the matrix K 4 as indicating the shape of the latent marginal pdf’s to be extracted. Figure (1) shows the form of the pdf, within the valid data range, given by (9). It is interesting to note the similarity in form between (11) and the EASI algorithms [19].

1

0 .9

0 .9

0 .8

0 .8

0 .7

0 .7 0 .6

0 .6 0 .5

0 .5

1.5

1.2

0.9

0.6

0

0.3

-0.3

-0.6

-0.9

-1.5

1.5

1.2

0.9

0.6

0.3

0

-0.3

0

-0.6

0 .1

0 -0.9

0 .2

0 .1 -1.2

0 .3

0 .2

-1.5

0 .3

-1.2

0 .4

0 .4

Figure 1: PDF’s with alpha = 1 , kurtosis = -1 and kurtosis = 1. Although the form of (9) provides, in some cases, a crude approximation of the data pdf our simulations have shown excellent results in transforming the data pdf to independent marginal pdf’s, these will be reported in the next section. The results reported by Cichocki et al [24, 25] confirm our simulations. We also note that for certain values of alpha the nonlinearity takes on a non-monotonic form similar to that of the information theoretic nonlinearity derived by Amari, Cichocki and Yang [2]. If we consider (11) in the expectation we notice an interesting form arising

{[

] }

E {∆W} = E I − K 4 tanh ( u) u T − αuu T W → 0

{[

]} = E{K tanh( u)u } ⇒ αE {u u } = κ E {u tanh (u )} ∀ i ≠ j ≅ E I − αuu T i

T

4

j 4

j

i

j

(12)

We note that (12) is essentially the Bussgang property for blind deconvolution

{

}

E{ui+ k ui } = E ui+ k f ( ui )

(13)

In the expectation the off-diagonal terms of the covariance matrix will tend to zero, as will the off-diagonal terms of the cross-covariance matrix of the data and an odd functional of the data. That is higher order odd cross moments are being set to zero, and so each component is independent.

4 NEGENTROPY MAXIMISATION SIMULATIONS What is now considered as a benchmark simulation for ICA type algorithms is the separation of two uniform distributions from their unknown mixture. Bell and Sejnowski [1] report on the inability of the original ICA algorithm to deal with uniform and subgaussian distributions. Pearlmutter and Parra [2] utilise a complex parameterised form of nonlinearity based on the summations of six sigmoidal functions to separate a mixture of two uniform distributions. Although they filter the mixtures, the use of memory in the contextual algorithm deals with the filtering, we use the simple form of (11) for separation of both uniform distributions and images (which are essentially sub-gaussian). Two zero mean source vectors each of length five thousand data points were generated from two uniform distributions, a mixture was formed using the matrix A. Using (11) we

can maximise the negentropic distance of the transformed data, and so perform an ICA, the output of the learning is shown in figure (2). The permutation and scaling matrix WA = PD is given below.

Figure 2: Original, Mixed and Extracted distributions

 3.93 0.12 0.3497 0.2149 A=  and WA =  − 0.11 4.03 0.3424 0.6207     We now consider a mixture of five images and noise, again applying (11) to the maximisation of the output data distance from normality yields an ICA and good separation. The images were 202 x 202 x 256 dimension with an 8 bit grey scale, the noise was uniformly distributed and had a 3 : 1 power ratio to all the other images. Figure 3 shows the original images and their respective histograms which indicates the data pdf. Figure 4 shows the mixed images and their respective histograms, we can see clearly that the images have been significantly degraded, but also we note the form of the mixed data histograms as tending to Gaussian, so the negentropy of the mixture has fallen and tends to zero as central limit effects start to dominate. Finally figure 5 shows the transformed output images and their histograms, it is clear that they have been restored to their original states. It is important to note that although the Gram Charlier approximation employed for the pdf in (11) is crude, the simulations show quite clearly that the algorithm is capable of maximally driving the transformed data pdf from normality and so to independent marginal distributions.

Figure 3 : Original Images and Histograms.

1400 2500

1400

1600

1200

1400

1200

1400

1200

2000 1000

1200

1000

1000

1000

800

1500

800

800 800

600 600

600 1000

600

400

400

400 400 500 200 200

0 -1.5

-1

-0.5

0

0.5

1

0 -1

1.5

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

0 -2

1

200

200

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 -1.5

-1

-0.5

0

0.5

1

0 -2

1.5

-1.5

-1

-0.5

0

0.5

1

1.5

2

Figure 4 : Mixed Images and Histograms.

2500

1600

3500

1800 1200

1400

1600

3000 2000

1000 1400

1200 2500

1200 800

1000

1500

2000

1000

800 600 800

1500

1000 600

600

400

1000 400 400

500

-25

-20

-15

-10

-5

0

5

10

15

20

0 -30

200

500

200

0 -30

200

-20

-10

0

10

20

30

0 -20

-15

-10

-5

0

5

10

15

20

0 -30

-25

-20

-15

-10

-5

0

5

10

15

20

0 -30

-20

-10

0

10

20

30

Figure 5 : Separated Images and Histograms. We can see clearly that the images have been restored and the output data pdf’s are indeed furthest from normal. An algorithm has been developed to perform an EPP based on the negentropic distance from Gaussianity, the form of this algorithm is similar to those found in [2, 3, 4]. Using a form of nonlinearity based on a Gram-Charlier approximation to the form of a generic non-gaussian pdf we have shown that an EPP performed on the data is equivalent to an ICA. We have performed simulations on data

mixtures which have both sub-gaussian and super-gaussian pdf’s, the form of (11) is capable of performing and ICA on this data.

5 A LATERALLY INHIBITED DEFLATIONARY EPP NETWORK We now consider a moment based criterion for projection pursuit, that is kurtosis as an EPP indice. The neural network model shown in figure 6 is proposed. U X

V

W Z1

Y1

1

X

2

Y2

X

3

Y3

Figure 6 : Extended Exploratory Projection Pursuit Network Architecture

[

The input to the network is x(t ) = x1 ( t ),

,

]

x N (t ) as defined in section 2. T

Now the input layer of the network removes all second order correlation’s in the data, and so effectively whitens the data. This is required for the moment based criterion, it should be pointed out that the negentropy based algorithm does not require whitened data due to the information theoretic basis of the algorithm. The output of the layer in matrix notation is given as (13) z = [I + U ]x ≡ U I x the learning rule for the U weight matrix is given as ∆U = α  I − zz T 

(14)

and so as ∆U = α ( I − C zz ) → 0 ⇒ C zz → I where Czz is the covariance matrix of z which is yielding a white vector. The z values are fed forward through the W weights to the output neurons where there is a second layer of lateral weights. However before the activation is passed through this layer it is passed back to the originating z values in a hierarchical manner as inhibition and then a nonlinear function of the inputs is calculated. This linear negative feedback is important and we shall examine this more fully in the next section. The linear weighted sum value at the output neuron is given as

act = W T U I As

(15)

The output neurons are nonlinear and a nonlinear functional is applied to the weighted sum. The outputs also have lateral connections to the nonlinear output neurons and so the

( )

N y = f act s + ∑ v ij f ( act ) and in matrix format j =1

output is defined as

y = VI f (act )

(16)

Simple hierarchical hebbian learning is used to update the feedforward weight values we can write this in vector notation as

(

[

∆W = β zy − W × upper W T z × y

])

(17)

Where the upper[ .] operator sets the matrix argument upper triangular. Similarly, antihebbian learning is applied at the output as at the input ∆V ≅ γ  I − yy T 

(18)

It is shown [18] that the net effect of the learning of (14), (17) and (18) is minimisation of the mutual information at the output neurons by performing an orthogonal rotation which maximises the sum of fourth order marginal cumulants in conjunction with minimisation of the sum of fourth order cross cumulants. This can be considered as an EPP finding the direction with maximal kurtosis [7]. If we now consider multi unit hierarchical learning [15], at time t, assume the s’th output neuron converges to a scaled version of one of the original source signals, for example the q’th. By using (15) we can write the residual as r = z − Wact s where act s ≡ [s1



si



sN ]

T

is the vector

(19) whose s’th

value is a scaled

’th

version of the q original source component, the scaling can be linked to the feed forward weights by (21). When full separation occurs we can write

WT U I As = PDs

(20)

It can be shown [10] that the form of (17) can be derived from maximisation of a function of the linear neuron activation under orthonormal weight constraints and as such for a square matrix W then W T W = WW T = I . We note however that the derivation is approximative and as such the algorithm is not mathematically guaranteed to converge. In practise, this has not proved to be a problem. At time t we have

act s = W T U I As = Ps Ds

(21)

where Ps is the matrix whose s’th row is equal to that of the final permutation matrix P.

Noting that W T W = WW T = I then from (19) and using (20), (21)

(

r = z − WW T U I As = z − WPs Ds ≡ U I A s − s q

(

)

)

vector s − s q : si − sqi ≡ si ∀i ≠ q ∧ si − sqi ≡ 0 : i = q N

rp =

N

∑ ∑ u pl

l =1

N

a lj s j −

j =1

N

∑ ∑ u pl

l =1

N

a lj sqj =

j =1

N

∑ ∑a u pl

l =1

lj s j

(22)

j =1 j ≠q

and in vector format  N r =  u1l  l =1 

N

∑ ∑a j =1 j ≠q

lj s j ,

  a lj s j   j =1 j≠q 

, ∑ u ∑ N

Nl

l =1

T

N

(23)

We can see that the residual used in the hebbian learning (17) for subsequent output neurons will be a whitened mixture of the original sources minus the q’th source. As the second neuron converges to another scaled source it is not difficult to see that the new residual will be deflated by yet another source, this continues until all original sources are removed. This is a generalisation of Sanger’s [26] hierarchic generalised hebbian learning algorithm (GHA) used for extracting the actual principal components from a data set. The form of hebbian learning in (17) for the linear feedback network, is similar to the robust hierarchic PCA learning derived by Karhunen and Joutsensalo [10]. As has been discussed [10] the robust PCA algorithm has insufficient nonlinearities to successfully separate mixtures of sources greater than three. The nonlinear PCA algorithm has an additional nonlinearity and as such is potentially capable of separation of large mixtures of sources. Due to stability criteria the nonlinear PCA algorithm has been restricted to separation of sub-gaussian sources, to allow separation of either sub or super-gaussian sources Girolami and Fyfe [16] propose a form of nonlinearity which removes this restriction. Without the nonlinear lateral inhibition at the output the network in figure 6, is a standard nonlinear version of the negative feedback network with a similar learning algorithm to the robust PCA algorithm. This network was used by Fyfe and Baddeley [7] for exploratory projection pursuit, and by Girolami and Fyfe [5] for blind separation of sources. To overcome the problem of poor separation when considering greater than three sources, nonlinear feedback could be considered. This would then yield an algorithm similar to the nonlinear PCA learning rule, with the attendant increase in separating performance. However the nonlinear feedback may then introduce errors into the deflation of the residual (19), due to the pdf ‘flattening’ effect of the nonlinearity [29], so to overcome this problem additional nonlinearities are included at the output neurons in the form of nonlinear lateral inhibitory learning. This still allows the use of linear negative feedback within the network and yet increases the separating performance. Karhunen and Pajunen [15] use a transition from nonlinear PCA learning

to robust PCA when separating sources, this technique outperforms the sole use of either nonlinear or robust PCA algorithms. Heuristically we can consider this as allowing a final exact orthogonal rotation after the nonlinear algorithm has approximately maximised the sum of fourth order cumulants. As mentioned in the previous section the form of the nonlinearity is critical for the BSS performance and dynamic stability of the algorithm (17). The form of the nonlinearity is dependent on the higher order statistics of the original source signals. In [27] Girolami and Fyfe show that the following form of nonlinearity will allow separation of both sub and super-gaussian sources, and also satisfy all dynamic stability requirements of Oja’s nonlinear PCA algorithm . Note the similarity between (24) and the generalised nonlinearity found in (11). We can also apply this to (17).  ( 4)  f ui = ρ ui − sign K s  tanh θ ui  

( )

( )

(24)

 (4)  ρ and θ are constants sign K s  is the sign of the original source signal kurtosis,   positive for super-gaussian signals and negative for sub-gaussian. A priori knowledge of the source signal statistics has been required to choose the sign for (24), with this knowledge mixtures of both sub and super-gaussian signals have been able to be separated [6, 15]. In keeping with the ‘blind’ form of the proposed separation we adopt the online kurtosis estimation, independently proposed by Cichocki et al [28] and Hyvarinen and Oja [14] that is

[

] [

[

]

] [

]

p  p ui ( t + 1) = 1 − η( t ) m  p ui ( t ) + η ( t ) ui ( t ) m

(25)

[ ] −3 [ui (t )]

(26)

k4 ui ( t + 1) =

 4 ui ( t ) m

m 22

(25) estimates online the pth order moments of the data, η(t ) is a small learning constant. (26) is an estimate of the kurtosis for zero mean data. Noting that

( [ ])

sign k4 ui (t ) =

[ ] [ui (t )]

(27)

[ ] tanh[θ u (t )] i k4 [ui ( t )]

(28)

k4 ui ( t ) k4

We can then use the adaptive nonlinearity of

[

]

f ui (t ) = ρ ui ( t ) −

k4 ui ( t )

as the output neuron activation function at each time step t.

6 DEFLATIONARY EPP NETWORK SIMULATION This simulation will deal with time varying signals, in particular two sources of natural speech, two sinusoidal tones of differing frequency and white noise. This is a problem with signals of varying kurtosis sign and again what is of interest here is the blind extraction of the original signals with no a priori knowledge of the signals or their statistics. Five seconds of speech was sampled at 8 kHz, two sinusoids of frequency 400hz and 200hz were also sampled at 8kHz for five seconds and finally five seconds of uniformly distributed noise was also recorded. The signal amplitude ranges were [-1, +1]. Each of these were mixed using a randomly generated 5 x 5 matrix. As the hierarchic feedback removes the original sources in a phased manner, sources with identical kurtosis should not present a problem to the algorithm. We give the details of the source signals the mixtures and the extracted sources in table 1. It is clear from the kurtosis of the mixed signals that they resemble gaussian noise. Source Kurtosis Mixed Signal Kurtosis Retrieved Signal Kurtosis Speech 1 Speech 2 200 Hz 400 Hz Noise

+6.293 +5.588 -1.500 -1.500 -1.200

1 2 3 4 5

-0.679 -0.578 -0.620 -0.478 -0.658

Out 4. Speech 1 Out 5. Speech 2 Out 1. 200Hz Out 2. 400Hz Out 3. Noise

+6.161 +5.559 -1.484 -1.498 -1.197

Error 2.1% 0.5% 1.1% 0.2% 0.2%

Table 1: Original, mixed and extracted signal statistics Figure 7 shows the kurtosis development at the network outputs, the extraction is almost perfect with a final contrast of 98% of original. Visual and audible checks of the separated outputs indicates the near perfect level of separation, figure 8. 6 5 4 Out 4 Out 5 Out 3 Out 1 Out 2

Kurtosis

3 2 1 0 1

2

3

4

5

6

7

8

9

10

11

-1 -2 Learning Epoch

Figure 7 : Kurtosis development at individual network outputs.

6

8

0

0

-6

1

0

4

3

2

x104

-8

0

1.5

1.5

0

0

-1.5

3.5 x10 4

-1.5

0

20

40

60

80

100

120

140

160

180

200

0

20

40

60

80

120

140

160

180

200

100

120

140

160

180

200

2

0

-2

0

20

40

60

80

100

Figure 8. Extracted original signals (scaled) at network outputs.

7 CONCLUSIONS Negentropy has been introduced as a projection pursuit indice whose maximisation, in the case of data with independent latent variables yields an ICA. We have proposed simple nonlinearities which approximately model the shape of the latent variable pdf, and have reported simulations which demonstrate the performance of the nonlinearity. A laterally inhibited deflationary EPP network and associated adaptive nonlinearity has been proposed. Although the nonlinearities are similar in form, the fundamental operations differ, one based on a pdf approximation the other as an implicit moment generating function. The significance of the requirement of pre-whitened data for (24) is indicative of the moment based EPP criterion, whereas the lack of this requirement for (11) suggests the information theoretic EPP indice. Further simulations which use complex mixtures of ten data sources with mixed sign of kurtosis are reported in [27].

REFERENCES [1] Bell, A and Sejnowski, T. An Information Maximisation Approach to Blind Separation and Blind Deconvolution. Neural Computation 7, 1129 - 1159, 1995. [2] Amari, S., Cichocki, A, and Yang, H. A new learning algorithm for blind signal separation. Neural Information Processing, Vol 8, M.I.T Press 1995. [3] Pearlmutter, B and Parra, L. A Context Sensitive Generalisation of ICA. International Conference on Neural Information Processing, Hong Kong, Sept. 24-27 1996. Springer.

[4] Mackay, D. Maximum likelihood and covariant algorithms for independent component analysis. Draft 3.1, Cavendish Laboratory, University of Cambridge, Aug 17, 1996. [5] Girolami, M and Fyfe, C. Blind Separation Of Sources Using Exploratory Projection Pursuit Networks. International Conference on the Engineering Applications of Neural Networks, (Ed A Bulsari) ISBN 952-90-7517-0, 249 - 252, 1996. [6] Girolami, M and Fyfe, C. Higher Order Cumulant Maximisation Using Nonlinear Hebbian and Anti-Hebbian Learning for Adaptive Blind Separation of Source Signals. IWSIP-96, International Workshop on Signal and Image Processing, Advances in Computational Intelligence, Manchester, November 1996. [7] Fyfe, C and Baddeley, R. Non-linear data structure extraction using simple hebbian networks. Biological Cybernetics, 72(6):533-541, 1995. [8] Jutten, C Herault, J. Blind Separation of Sources, Part 1: An Adaptive Algorithm Based On Neuromimetic Architecture. Signal Processing 24 1- 10, 1991. [9] Comon, P. Independent Component Analysis, A New Concept ?. Signal Processing, 36, 287 - 314. 1994. [10] Karhunen, J., Joutensalo, J. Representation and separation of signals using nonlinear PCA type learning. Neural Networks 7(1), pp.113-127, 1994. [11] Karhunen, J. Neural approaches to independent component analysis and source separation. Proc. ESANN’96, (4’th European Symposium on Artificial Neural Networks), Bruges, Belgium, April 24-26 1996. [12] Oja, E. The nonlinear PCA learning rule and signal separation-mathematical analysis. Research Report A26, Helsinki University of Technology, ISBN 951-22-27061, 1995. [13] Wang, L., Karhunen, J., Oja, E. A bigradient optimisation approach for robust PCA, MCA, and source separation. Proc IEEE Int. Conf. on neural networks and signal processing, Perth, Australia, 1995. [14] Hyvarinen, A and Oja, E. Simple neuron models for independent component analysis. technical report, Helsinki university of technology, laboratory of computer and information science, 1996. [15] Karhunen, J and Pajunen, P. Hierarchic nonlinear PCA algorithms for neural blind source separation. Norsig-96, I.E.E.E. nordic signal processing symposium, Espoo, Finland, September 24 - 27, 1996. [16] Girolami, M and Fyfe, C. Stochastic ICA contrast maximisation using Oja’s nonlinear PCA algorithm. International journal of neural systems, submitted, 1996. [17] Friedman, J. H. Exploratory projection pursuit. Journal of the American Statistical Association, 82(397):249-266, 1987.

[18] Girolami, M and Fyfe, C. An Extended Exploratory Projection Pursuit Network with Linear and Nonlinear Anti-Hebbian Connections Applied to the Cocktail Party Problem. Submitted to Neural Networks Journal. May 1996. [19] Cardoso, J,F. Belouchrani, A. and Laheld, B. A new composite criterion for adaptive and iterative blind source separation. Proceeding of ICASSP-94, vol4, 273-276. [20] Jones, M. C. and Sibson, R. What is projection pursuit. The Royal Statistical Society. 1987. [21] Amari, S. I. Differential-Geometrical Methods in Statistics. Lecture Notes in Statistics, vol.28. Springer, 1985. [22] Deco G. and Obradovic D. Linear ICA in arbitrary input distribution. In An information-theoretic approach to neural computing (pp 95 - 107). New York: SpringerVerlag, 1996. [23] Masters, T. ‘Advanced Algorithms For Neural Networks’, John Wiley & Sons, 1995. [24] Amari, S., Cichocki, A. Yang, H. Recurrent Neural Networks for Blind Separation of Sources. International Symposium on Nonlinear Theory and Applications Vol 1. 37 42, 1995. [25] Cichocki A., Kasprzak, W., and Amari S. Multi-layer neural networks with local adaptive learning rules for blind separation of source signals. International Symposium on Nonlinear Theory and Applications Vol 1. 61-65, 1995 [26] Sanger, T. Optimal unsupervised learning in a single-layer linear feedforward network. Neural networks, vol 2, pp. 459-473, 1989. [27] Girolami, M and Fyfe, C. Extraction of Independent Signal Sources using a Deflationary Exploratory Projection Pursuit Network with Lateral Inhibition. Submitted to I.E.E Proceedings on Vision, Image and Signal Processing Journal. Sept 1996. [28] Cichocki A., Kasprzak ., and Amari S. Neural network approach to blind separation and enhancement of images. Proceedings, EUSIPCO-96, Trieste, Italy, 10-13 Sept, 1996. [29] Sudjianto A. and Hassoun M. (1994). Nonlinear hebbian rule: A statistical interpretation. IEEE International conference on neural networks, Orlando, Florida. Vol 2 (1247-1252). [30] Haykin, S. Adaptive filter theory, 2nd ed., Prentice Hall, Englewood Cliffs, NJ. 1991.

Suggest Documents