SELF-ADAPTIVE INDEPENDENT COMPONENT ... - CiteSeerX

5 downloads 0 Views 800KB Size Report
1 Brain Science Institute Riken, Laboratory for Open Information Systems, JAPAN, cia@brain.riken.go.jp,. 2 Warsaw University of Technology, POLAND, ...
SELF-ADAPTIVE INDEPENDENT COMPONENT ANALYSIS for SUB-GAUSSIAN and SUPER-GAUSSIAN MIXTURES with an UNKNOWN NUMBER of SOURCES and ADDITIVE NOISE Andrzej CICHOCKI 1, Ireneusz SABALA 2, Seungjin CHOI 3 Bruno ORSIER 1 and Ryszard SZUPILUK 2 1

Brain Science Institute Riken, Laboratory for Open Information Systems, JAPAN, [email protected], 2 Warsaw University of Technology, POLAND, [email protected] 3 Chungbuk National University, KOREA, [email protected]

Abstract|

In this paper we derive and analyze

tion [1]- [17]

un-supervised adaptive on - line algorithms for instantaneous blind separation of sources (BSS) in the case when sensors signals are noisy and they are mixture of

sk

xk Hksk ( )=

( ) ( )+

 (k);

(1)

xk

unknown number of independent source signals with

where

unknown statistics. Nonlinear activation functions are

stochastically independent of source signals,

rigorously derived assuming that source have general-

am m - dimensional observable sensor vector, (with



( ) is an n-dimensional vector of unknown

Hk

m

tensive computer simulations con rmed that the pro-

and

posed family of learning algorithms are able to sepa-

Gaussian noise vector.

rate sources from mixture of sub and super-Gaussian

sensor vector is available and is necessary to design

( ) is a full rank

m

2n

ized Gaussian, Cauchy or Rayleigh distributions. Ex-

sources.

n),

( ) is

mixing matrix

 (k) is m dimensional uncorrelated with sources It is assumed that only the

a feed-forward neural network and associated adap-

I. INTRODUCTION The problem of independent component analysis (ICA) and/or blind separation or extraction of source signals from their mixtures has become increasingly important due to still some opened theoretical problems and many potential applications, e.g. in speech recognition and enhancements, telecommunication and biomedical signal analysis and processing (EEG, MEG, ECG). While several recently developed algorithms have shown promise to solve practical tasks, they may fail to separate on-line (nonstationary) signal mixtures containing both sub- and super-Gaussian distributed source signals, especially when number of sources is unknown and change dynamically over the time. The problem of on-line estimation of sources in the case when the number of sources is unknown is relevant in many practical applications like analysis of EEG signals and "cocktail party problem" where the number of source signals change usually over the time. In this paper we propose solution to this problem under assumption that the number of sources is less or equal to the number of sensors.

H

tive learning algorithm which enables estimation of sources and/or identi cation of mixing matrix

with

good tracking abilities. The problem is often referred

x n2m WT

as noisy ICA (independent component analysis): the tained by nding an

y

111 m T

= [x 1 x ] is ob, full rank, linear transfor-

ICA of a noisy random vector mation (un-mixing) matrix

y Wx

such that the output

111 n

= [y 1 y ] , de ned as = contains component that are as independent as possignal vector

sible, as measured by an information - theoretic cost function such as minimum Kullback - Leibler divergence.

W

In other words, it is required to adapt the

y t Wij t x t

synaptic weights of linear system

w

( ) =

of

m

2m

matrix

of a

( ) ( ) (often referred to as

a single-layer feed-forward neural network) to combine the observation

j

signals s ^ (t) =

yj (t)

Pmi

xi (t) =

to form estimates of source

ji xi (t).

=1 w

The optimal

weights correspond to the statistical independence of

yj (t).

II. DERIVATION OF BASIC ADAPTIVE ALGORITHMS

output signals

In order, to solve the above formulated ICA problems a key task is to formulate appropriate loss (cost) function which should be the function of the parameters of the speci ed neural network model. Minimiza-

Mathematically the problem is formulated as fol-

tion of such loss function should lead to satisfy desired

lows: the mixing model is described by matrix equa-

conditions (stochastic independence and/or temporal

and spatial mutual de-correlation) of the output ex-

The above two learning rule could be combined to-

tracted signals. Recently several researchers (see e.g.

gether to build up more general and exible (universal)

Amari [1, 2, 3], Inouye [15], Matsuoka [17], Cardoso

learning rule (see Cichocki at al. [8, 10]:

Wk fy

[7], Bell and Sejnowski [6], Girolami and Fyfe [14]) proposed useful criteria and loss functions (called also

1

contrast, cost or energy functions) for BSS. Di erential entropy maximization (DEM), independent com-

where

ponent analysis (ICA) and maximization of likelihood

functions.

(ML) lead to the same type of expected loss function which is measure of mutual stochastic independence of

W

3 k 0 f y k g yT k W k ; gy

 (k )[

( )=

( ) and

( )

[ ( )] (

( ))]

( ) are suitably designed nonlinear

sources is known we can assume that un-mixing matrix is

n 2 m, however, in the general case when number is m 2 m

W III. DESIGN OF ACTIVATION FUNCTIONS

output signals [6, 12, 3, 7]. The natural gradient search method developed by Amari [1, 2, 3] has emerged as a

quadratic matrix.

particularly-useful technique for solving iterative optimization problems. The suitable expected loss or risk function could be de ned as Kullback-Leibler diver-

Z

y W g 0 py y py y Q p i yi pM y py y ; W

E f( ; where

)

=

y dy; y

py ( pM (

( ) log

(2)

)

tribution of

( )=

(

(

) is the marginal dis-

).

y

a simple loss (cost) function

yW

( ; where

)=

pi (yi )

W

0 log det(

m W W 0X T

W

i=1

leads to

1T

and ( )

i i

log p (y );

(3)

) means the determinant of

is a transpose operator.

Taking

pressed as

W) = f (y)xT 0 W(WT W)01 rW (W) = @@(W =

(4)

=

where

3k

3k

( )

I

W ) WT W 0 W(k) = 0 @@(W  (k)[3(k ) 0 f (y(k ))yT (k )]W(k ); (5) = diag f1 (k )1 1 1n (k )g is any positiveWk (

+ 1)

f y k yT k g

de nite diagonal scaling matrix, e.g. ( ) =

2

+

diag f

( ( ))

( )

3 I  fy =

sents variance of additive Gaussian noise) and [f1 (y1 );

1 1 1 ; fn (yn )]T

with

nonlinearities

0d log(pi (yi ))=dyi = 0p_i (yi )=pi (yi). Alternatively,

we

can

use

the

Wk

( )

= =

0(k)W

( )=

fi (yi )

following

conditioning) ltering gradient [5; 8; 10] : 1

( )

( ))]

f (yi )

=

i yi + yi3

algorithm

(5) separates sub-Gaussian signals while the algorithm

xi (k )

contains mixtures of both sub-Gaussian

and super-Gaussian sources then these algorithms may fail to separate these signals reliably (cf. [13, 14]). In this paper we propose a new rigorous strategy for design exible, near optimal activation functions that enable source signals from arbitrary non -Gaussian distributions to be extracted from the measurements of the mixed signals.



1) exp

q 01i jyi j2 i



i

pi (yi ) = exp(0i 0

, where Lagrange multipliers and

2(

q 2 1

q ) i i

i =qi 0(1

=qi ) ;

1i

i;

i2

= 1=q

i 0 1)

exp(0

variance are expressed as follows: =

hjyi jq i. i

=

The locally optimal exible normalized nonlinear activation functions can be expressed in such case as

f i (y i ) = 0

d log(pi (yi )) dyi

=

jyi jq 01 sign(yi ) qi  1: i

(8)

qi can change from 1 (Laplace distribuqi = 2 - standard Gaussian distribution)

The parameter tion, through to

qi

going to in nity (for uniform distribution).

In

(pre-

standard Gaussian density (q = 2 - linear activation

edge about distribution of sources we can start from

i

functions) and adaptively change these parameters de-

)

( ) (

signals. Analogously for

the general case, when we do not have a priory knowl-

 @ W T (

the learning rule (5)

=

W @W 3 k 0 y k f yT k W k :

 (k )[

or

2 repre-

(where

tanh( i yi )

Let us assume that source signals have generalized

Amari we can derive basic learning rule [1, 3, 4, 8, 10] ( )

+

Gaussian distributions of the form:

and applying natural gradient approach developed by

Wk

i y i

=

are super-Gaussian signals while the learning rule (6)

signals

into account that gradient of the loss function can ex-

1

f (yi )

is able to successfully separate signals if all of them

(6) super-Gaussian signals. However, if the measured

are probability density functions (p.d.f.)

of output signals, det( matrix

)

signals. It can be proved that for speci c nonlinear-

could separate them if all of them are sub-Gaussian

Such risk function, which is in fact, equal to mutual information among outputs components of

Optimal

selection of nonlinearities depend on p.d.f. of source ity

( ) is the joint probability distribution of out-

put signals and

The performance of the learning algorithms strongly depends on shape of activation functions.

)

(7)

In the special case when the number of

of sources is unknown we assume that

gence [1, 3]

( )

( ) (6)

pending on estimated distance of density of actual output signals

yi (k )

from Gaussianity.

Analogous nonlinearity can be derived, for example, for generalized Cauchy distribution as

fi (yi ) = [(vqi +

1)=(v

jA(p)jq

i

jyijq )]jyi jq 01 sign(yi ) and for generalq 02 yi for distribution as fi (yi ) = jyi j i

+

ized Rayleigh

i

i

complex - valued signals and coecients.

are noiseless the

n

of output signals estimate all n un-

known sources, while the additional (m decay quickly to zero.

Alternatively, we can consider a novel `robust' gener-

 = i jq

pi (yi ) = e0i 01 exp i2

i

h

i

i yi))= i i, 1  2. qi = 1) is especially

(9)

= log(cosh(

Such model of p.d.f.

(for

ful for noisy natural speech signals. ance of the signal

y

0

−5

use-

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

0

1000

2000

3000

4000

5000 iterations

6000

7000

8000

9000

10000

5

For small vari-

we have log(cosh( y ))=

 y2 so





S2

where

01i j log(cosh( 2iyi ))

source signals 5

S1



alized Gaussian distributions

0 n) outputs

0

we can approximate the distribution (9) by the stan-

where

2

2

=

=

(silent period).

0 2y22

−5 5

For large variance

 jyj

of the signal log(cosh( y ))=

 jyj 

p21 exp

so we can ap-

S3

y

py (y )

dard Gaussian distribution

0

proximate the distribution (9) by Laplace distribution

py (y) =

1 2 2 exp



0 2

where

2

=

hjyji :

−5

Using this model and Amari's natural gradient ap-

3k

sensor signals

proach for noisy data we derived a new algorithm of

(10)

fi (yi ) gi (yi ) where

=

=

i yi )

tanh( for

(11)

4 (yi ) > 0 tanh( i yi ) for 4 (yi ) < 0;

4 (yi ) = E f i g   0

=E 2

of kurtosis and

f i g0 3 is normalized value y2

is a threshold.

The value of

kurtosis can be evaluated on-line as

E fyiq (k + 1)g = (1 0  )E fy q (k)g +  jyi jq (q = 2; 4)

sible. In this approach activation function are adaptive time-varying nonlinearities. It can be shown by mathematical analysis and computer simulation experiments that it is sucient to use robust (in respect to outliers and spiky noise) nonlinearities of the form:

fi (yi )

i yi) or gi (yi) = tanh( i yi) and

= tanh(

it is not necessary to use nonlinearities of the form

fi (yi ) = sign(yi )jyi jq 01 which are rather tive to outliers for qi  3. i

very sensi-

IV. COMPUTER SIMULATIONS

Extensive computer simulation experiments con rm validity and high performance of the proposed algorithms In the case where

m

 n and mixing sensors

2.8

2.9

3 4

2.1

2.2

2.3

2.4

2.5

2.6

2.7

2.8

2.9

3 4

2.1

2.2

2.3

2.4

2.5

2.6

2.7

2.8

2.9

3 4

x 10 2.1

2.2

2.3

2.4

2.5

2.6

2.7

2.8

2.9

3 4

x 10 2.1

2.2

2.3

2.4

2.5

2.6

2.7

2.8

2.9

3 4

x 10

0 2.1

2.2

2.3

2.4

2.5

2.6

2.7

2.8

2.9

3 4

x 10

X1

0 2

2.1

2.2

2.3

2.4

2.5 iterations

2.6

2.7

2.8

2.9

3 4

x 10

output signals Y1

0

Y2

−2 22

Y3

2.1

2.2

2.3

2.4

2.5

2.6

2.7

2.8

2.9

2.1

2.2

2.3

2.4

2.5

2.6

2.7

2.8

2.9

3 4

x 10

0

−2 22

2.1

2.2

2.3

2.4

2.5

2.6

2.7

2.8

2.9

3 4

x 10

0

−2 22

2.1

2.2

2.3

2.4

2.5

2.6

2.7

2.8

2.9

3 4

x 10

0

−2 22

2.1

2.2

2.3

2.4

2.5

2.6

2.7

2.8

2.9

3 4

x 10

0

−2 22

2.1

2.2

2.3

2.4

2.5

2.6

2.7

2.8

2.9

3 4

x 10

0

−2

3 4

x 10

0

−2 22

Y4

separation of all non-Gaussian source signals is pos-

2.7

2

Y5

ear activation functions, such that successful (stable)

2.6

0

−10

Y6

sis automatically select (or switch) suitable nonlin-

2.5

0

−5 102

Y7

depending on sign or value of its normalized kurto-

2.4

x 10

−5 52

The above learning algorithm (7), (10)-(13) monitors and estimates the statistics of each output signal and

2.3

0

−5 52

(12)

2.2

x 10

−2 52

4 (yi ) >  4 (yi ) <  for

for

y4

X3

and exible (adaptive) nonlinear activation functions

2.1

0

−5 22

X4

i

iyi (k)))

X5

log(cosh(

X6

 yyii

1

0

−5 52

X7

i (k ) = i2 +

( ) as

X2

the form (7) with entries of diagonal matrix

5

2

2.1

2.2

2.3

2.4

2.5 iterations

2.6

2.7

2.8

2.9

3 4

x 10

Fig.1 Exemplary computer simulation results In more realistic scenario of noisy sensor signals the

additional (m

0 n) outputs collects only additive noises

(under condition that the noise is not too large, usu-

[7] J.-F. Cardoso and B. Laheld, \Equivariant adap-

the desired source signals with reduced noise. This fea-

IEEE Trans. Signal Processing, Vol. SP-44, no. 12, December 1996, pp.

ture is very desirable property of the proposed learn-

3017-3030.

ally less than 3%) while the other outputs estimates

ing algorithms. Fig.1 illustrates typical simulation results.

Three unknown acoustical signals (two natu-

ral speech signals with positive kurtosis and a single tone signal with negative kurtosis) where mixed us-

H

ing randomly selected 7

2 3 full rank mixing matrix

. To sensors signals were added 2% i.i.d. Gaussian

noises. The mixing matrix, the number of sources as well as their statistics were assumed to be completely

unknown. The learning algorithm (7), (10) - (13) with self- adaptive learning rate

2

= 0:025,



 (k ) [9] and parameters i = 10 8i was able

= 0:1 and xed

successfully estimate the number of active sources and their waveforms and also to `shift' noise signals to free channels.

V. CONCLUSIONS

tive source separation",

[8] A. Cichocki, R. Unbehauen, L. Moszczy nski and E. Rummert, \A new on-line adaptive learning algorithm for blind separation of source signals", in

Proc. of ISANN-94, Taiwan, 1994, 406{411.

[9] A. Cichocki, S. Amari, M, Adachi, W. Kasprzak, \Self-adaptive neural networks for blind separa-

1996 IEEE International Symposium on Circuit and Systems, IEEE, Piscataway, tion of sources",

Vol.2, 1996, pp.157-160. [10] A.Cichocki and R. Unbehauen, \Robust neural networks with on-line learning for blind identi cation and blind separation of sources", IEEE Trans. on Circuits and Systems - I Vol.43, Nov. 1996 (submitted in June 1994), pp.894-906.

In this paper we have proposed on - line adaptive learn-

[11] A. Cichocki, \Blind separation and extraction of

ing algorithms for independent component analysis in

source signals { recent results and open prob-

the general case when sensor signals are noisy and the

lems", Proc of 41th Annual Conference of the In-

number of sources as well as their statistics are com-

stitute of Systems, Control and Information En-

pletely unknown. Locally optimal nonlinear functions

gineers, ISCIE-97 May 21-23, Osaka-Japan, May

are derived for various generalized distributions mod-

1997, pp. 43- 48.

els.

Extensive computer simulations have con rmed

validity and excellent performance of the developed learning algorithms.

References

[1] S. Amari, A. Cichocki, and H.H. Yang, \A new learning algorithm for blind signal separation",

Advances in Neural Information Processing Systems 1995 (Boston, MA: MIT Press, 1996), pp. 752-763. [2] S. Amari \Natural gradient works eciently in learning",

Neural Computation, 1997, (in print).

[3] S. Amari, T.-P. Chen, and A. Cichocki, \Stability analysis of adaptive blind source separation", Neural Networks, 1997 (in print).

[12] P. Comon, \Independent component analysis: a new concept?", Signal Processing, vol.36, 1994, pp.287-314. [13] S.C. Douglas, A. Cichocki and S. Amari, \Multichannel blind separation and deconvolution of sources with arbitrary distributions", in Proc. IEEE Workshop on Neural Networks for Signal Processing, Sept. 97 IEEE New York, pp. 436444. [14] M. Girolami and C. Fyfe, \Extraction of independent signals sources using a de ationary exploratory projection pursuit network with lateral inhibition", IEE Proceedings on Vision, Image and Signal processing, 1997 (in print). [15] Y. Inouye and T. Sato, \On line algorithms for blind deconvolution of multichannel linear time-

[4] S. Amari, T.-P. Chen, and A. Cichocki, \Nonholonomic orthogonal learning algorithms for blind separation" ICONIP-97 (in print). [5] J. J. Attick and A. N. Redlich, \Convergent algorithm for sensory receptive elds development",

Neural Computation, Vol. 5, 1993, pp.45-60.

[6] A. J. Bell and T. J. Sejnowski \An informationmaximization approach to blind separation and blind deconvolution", 1995, pp. 1129-1159.

Neural Computation, Vol.7,

invariant systems". In Proc. IEEE Signal Processing Workshop on HOS, pp. 204-208, 1997. [16] Z. Maluche and O. Macchi, \Adaptive separation of unknown number of sources", Proc. of IEEE Workshop on Higher Order Statistics, IEEE Computer Society, Los Almitos 1997, pp. 295-299. [17] K. Matsuoka, M. Ohya and M. Kawamoto, \A neural net for blind separation of non-stationary signals",

Neural

pp.411-419.

Networks,

1995,

Vol.8,No.3,

Suggest Documents