SELF-ADAPTIVE INDEPENDENT COMPONENT ANALYSIS for SUB-GAUSSIAN and SUPER-GAUSSIAN MIXTURES with an UNKNOWN NUMBER of SOURCES and ADDITIVE NOISE Andrzej CICHOCKI 1, Ireneusz SABALA 2, Seungjin CHOI 3 Bruno ORSIER 1 and Ryszard SZUPILUK 2 1
Brain Science Institute Riken, Laboratory for Open Information Systems, JAPAN,
[email protected], 2 Warsaw University of Technology, POLAND,
[email protected] 3 Chungbuk National University, KOREA,
[email protected]
Abstract|
In this paper we derive and analyze
tion [1]- [17]
un-supervised adaptive on - line algorithms for instantaneous blind separation of sources (BSS) in the case when sensors signals are noisy and they are mixture of
sk
xk Hksk ( )=
( ) ( )+
(k);
(1)
xk
unknown number of independent source signals with
where
unknown statistics. Nonlinear activation functions are
stochastically independent of source signals,
rigorously derived assuming that source have general-
am m - dimensional observable sensor vector, (with
( ) is an n-dimensional vector of unknown
Hk
m
tensive computer simulations con rmed that the pro-
and
posed family of learning algorithms are able to sepa-
Gaussian noise vector.
rate sources from mixture of sub and super-Gaussian
sensor vector is available and is necessary to design
( ) is a full rank
m
2n
ized Gaussian, Cauchy or Rayleigh distributions. Ex-
sources.
n),
( ) is
mixing matrix
(k) is m dimensional uncorrelated with sources It is assumed that only the
a feed-forward neural network and associated adap-
I. INTRODUCTION The problem of independent component analysis (ICA) and/or blind separation or extraction of source signals from their mixtures has become increasingly important due to still some opened theoretical problems and many potential applications, e.g. in speech recognition and enhancements, telecommunication and biomedical signal analysis and processing (EEG, MEG, ECG). While several recently developed algorithms have shown promise to solve practical tasks, they may fail to separate on-line (nonstationary) signal mixtures containing both sub- and super-Gaussian distributed source signals, especially when number of sources is unknown and change dynamically over the time. The problem of on-line estimation of sources in the case when the number of sources is unknown is relevant in many practical applications like analysis of EEG signals and "cocktail party problem" where the number of source signals change usually over the time. In this paper we propose solution to this problem under assumption that the number of sources is less or equal to the number of sensors.
H
tive learning algorithm which enables estimation of sources and/or identi cation of mixing matrix
with
good tracking abilities. The problem is often referred
x n2m WT
as noisy ICA (independent component analysis): the tained by nding an
y
111 m T
= [x 1 x ] is ob, full rank, linear transfor-
ICA of a noisy random vector mation (un-mixing) matrix
y Wx
such that the output
111 n
= [y 1 y ] , de ned as = contains component that are as independent as possignal vector
sible, as measured by an information - theoretic cost function such as minimum Kullback - Leibler divergence.
W
In other words, it is required to adapt the
y t Wij t x t
synaptic weights of linear system
w
( ) =
of
m
2m
matrix
of a
( ) ( ) (often referred to as
a single-layer feed-forward neural network) to combine the observation
j
signals s ^ (t) =
yj (t)
Pmi
xi (t) =
to form estimates of source
ji xi (t).
=1 w
The optimal
weights correspond to the statistical independence of
yj (t).
II. DERIVATION OF BASIC ADAPTIVE ALGORITHMS
output signals
In order, to solve the above formulated ICA problems a key task is to formulate appropriate loss (cost) function which should be the function of the parameters of the speci ed neural network model. Minimiza-
Mathematically the problem is formulated as fol-
tion of such loss function should lead to satisfy desired
lows: the mixing model is described by matrix equa-
conditions (stochastic independence and/or temporal
and spatial mutual de-correlation) of the output ex-
The above two learning rule could be combined to-
tracted signals. Recently several researchers (see e.g.
gether to build up more general and exible (universal)
Amari [1, 2, 3], Inouye [15], Matsuoka [17], Cardoso
learning rule (see Cichocki at al. [8, 10]:
Wk fy
[7], Bell and Sejnowski [6], Girolami and Fyfe [14]) proposed useful criteria and loss functions (called also
1
contrast, cost or energy functions) for BSS. Dierential entropy maximization (DEM), independent com-
where
ponent analysis (ICA) and maximization of likelihood
functions.
(ML) lead to the same type of expected loss function which is measure of mutual stochastic independence of
W
3 k 0 f y k g yT k W k ; gy
(k )[
( )=
( ) and
( )
[ ( )] (
( ))]
( ) are suitably designed nonlinear
sources is known we can assume that un-mixing matrix is
n 2 m, however, in the general case when number is m 2 m
W III. DESIGN OF ACTIVATION FUNCTIONS
output signals [6, 12, 3, 7]. The natural gradient search method developed by Amari [1, 2, 3] has emerged as a
quadratic matrix.
particularly-useful technique for solving iterative optimization problems. The suitable expected loss or risk function could be de ned as Kullback-Leibler diver-
Z
y W g 0 py y py y Q p i yi pM y py y ; W
E f( ; where
)
=
y dy; y
py ( pM (
( ) log
(2)
)
tribution of
( )=
(
(
) is the marginal dis-
).
y
a simple loss (cost) function
yW
( ; where
)=
pi (yi )
W
0 log det(
m W W 0X T
W
i=1
leads to
1T
and ( )
i i
log p (y );
(3)
) means the determinant of
is a transpose operator.
Taking
pressed as
W) = f (y)xT 0 W(WT W)01 rW (W) = @@(W =
(4)
=
where
3k
3k
( )
I
W ) WT W 0 W(k) = 0 @@(W (k)[3(k ) 0 f (y(k ))yT (k )]W(k ); (5) = diag f1 (k )1 1 1n (k )g is any positiveWk (
+ 1)
f y k yT k g
de nite diagonal scaling matrix, e.g. ( ) =
2
+
diag f
( ( ))
( )
3 I fy =
sents variance of additive Gaussian noise) and [f1 (y1 );
1 1 1 ; fn (yn )]T
with
nonlinearities
0d log(pi (yi ))=dyi = 0p_i (yi )=pi (yi). Alternatively,
we
can
use
the
Wk
( )
= =
0(k)W
( )=
fi (yi )
following
conditioning) ltering gradient [5; 8; 10] : 1
( )
( ))]
f (yi )
=
i yi + yi3
algorithm
(5) separates sub-Gaussian signals while the algorithm
xi (k )
contains mixtures of both sub-Gaussian
and super-Gaussian sources then these algorithms may fail to separate these signals reliably (cf. [13, 14]). In this paper we propose a new rigorous strategy for design exible, near optimal activation functions that enable source signals from arbitrary non -Gaussian distributions to be extracted from the measurements of the mixed signals.
1) exp
q 01i jyi j2 i
i
pi (yi ) = exp(0i 0
, where Lagrange multipliers and
2(
q 2 1
q ) i i
i =qi 0(1
=qi ) ;
1i
i;
i2
= 1=q
i 0 1)
exp(0
variance are expressed as follows: =
hjyi jq i. i
=
The locally optimal exible normalized nonlinear activation functions can be expressed in such case as
f i (y i ) = 0
d log(pi (yi )) dyi
=
jyi jq 01 sign(yi ) qi 1: i
(8)
qi can change from 1 (Laplace distribuqi = 2 - standard Gaussian distribution)
The parameter tion, through to
qi
going to in nity (for uniform distribution).
In
(pre-
standard Gaussian density (q = 2 - linear activation
edge about distribution of sources we can start from
i
functions) and adaptively change these parameters de-
)
( ) (
signals. Analogously for
the general case, when we do not have a priory knowl-
@ W T (
the learning rule (5)
=
W @W 3 k 0 y k f yT k W k :
(k )[
or
2 repre-
(where
tanh( i yi )
Let us assume that source signals have generalized
Amari we can derive basic learning rule [1, 3, 4, 8, 10] ( )
+
Gaussian distributions of the form:
and applying natural gradient approach developed by
Wk
i y i
=
are super-Gaussian signals while the learning rule (6)
signals
into account that gradient of the loss function can ex-
1
f (yi )
is able to successfully separate signals if all of them
(6) super-Gaussian signals. However, if the measured
are probability density functions (p.d.f.)
of output signals, det( matrix
)
signals. It can be proved that for speci c nonlinear-
could separate them if all of them are sub-Gaussian
Such risk function, which is in fact, equal to mutual information among outputs components of
Optimal
selection of nonlinearities depend on p.d.f. of source ity
( ) is the joint probability distribution of out-
put signals and
The performance of the learning algorithms strongly depends on shape of activation functions.
)
(7)
In the special case when the number of
of sources is unknown we assume that
gence [1, 3]
( )
( ) (6)
pending on estimated distance of density of actual output signals
yi (k )
from Gaussianity.
Analogous nonlinearity can be derived, for example, for generalized Cauchy distribution as
fi (yi ) = [(vqi +
1)=(v
jA(p)jq
i
jyijq )]jyi jq 01 sign(yi ) and for generalq 02 yi for distribution as fi (yi ) = jyi j i
+
ized Rayleigh
i
i
complex - valued signals and coecients.
are noiseless the
n
of output signals estimate all n un-
known sources, while the additional (m decay quickly to zero.
Alternatively, we can consider a novel `robust' gener-
= i jq
pi (yi ) = e0i 01 exp i2
i
h
i
i yi))= i i, 1 2. qi = 1) is especially
(9)
= log(cosh(
Such model of p.d.f.
(for
ful for noisy natural speech signals. ance of the signal
y
0
−5
use-
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
0
1000
2000
3000
4000
5000 iterations
6000
7000
8000
9000
10000
5
For small vari-
we have log(cosh( y ))=
y2 so
S2
where
01i j log(cosh( 2iyi ))
source signals 5
S1
alized Gaussian distributions
0 n) outputs
0
we can approximate the distribution (9) by the stan-
where
2
2
=
=
(silent period).
0 2y22
−5 5
For large variance
jyj
of the signal log(cosh( y ))=
jyj
p21 exp
so we can ap-
S3
y
py (y )
dard Gaussian distribution
0
proximate the distribution (9) by Laplace distribution
py (y) =
1 2 2 exp
0 2
where
2
=
hjyji :
−5
Using this model and Amari's natural gradient ap-
3k
sensor signals
proach for noisy data we derived a new algorithm of
(10)
fi (yi ) gi (yi ) where
=
=
i yi )
tanh( for
(11)
4 (yi ) > 0 tanh( i yi ) for 4 (yi ) < 0;
4 (yi ) = E f i g 0
=E 2
of kurtosis and
f i g0 3 is normalized value y2
is a threshold.
The value of
kurtosis can be evaluated on-line as
E fyiq (k + 1)g = (1 0 )E fy q (k)g + jyi jq (q = 2; 4)
sible. In this approach activation function are adaptive time-varying nonlinearities. It can be shown by mathematical analysis and computer simulation experiments that it is sucient to use robust (in respect to outliers and spiky noise) nonlinearities of the form:
fi (yi )
i yi) or gi (yi) = tanh( i yi) and
= tanh(
it is not necessary to use nonlinearities of the form
fi (yi ) = sign(yi )jyi jq 01 which are rather tive to outliers for qi 3. i
very sensi-
IV. COMPUTER SIMULATIONS
Extensive computer simulation experiments con rm validity and high performance of the proposed algorithms In the case where
m
n and mixing sensors
2.8
2.9
3 4
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
3 4
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
3 4
x 10 2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
3 4
x 10 2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
3 4
x 10
0 2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
3 4
x 10
X1
0 2
2.1
2.2
2.3
2.4
2.5 iterations
2.6
2.7
2.8
2.9
3 4
x 10
output signals Y1
0
Y2
−2 22
Y3
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
3 4
x 10
0
−2 22
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
3 4
x 10
0
−2 22
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
3 4
x 10
0
−2 22
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
3 4
x 10
0
−2 22
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
3 4
x 10
0
−2
3 4
x 10
0
−2 22
Y4
separation of all non-Gaussian source signals is pos-
2.7
2
Y5
ear activation functions, such that successful (stable)
2.6
0
−10
Y6
sis automatically select (or switch) suitable nonlin-
2.5
0
−5 102
Y7
depending on sign or value of its normalized kurto-
2.4
x 10
−5 52
The above learning algorithm (7), (10)-(13) monitors and estimates the statistics of each output signal and
2.3
0
−5 52
(12)
2.2
x 10
−2 52
4 (yi ) > 4 (yi ) < for
for
y4
X3
and exible (adaptive) nonlinear activation functions
2.1
0
−5 22
X4
i
iyi (k)))
X5
log(cosh(
X6
yyii
1
0
−5 52
X7
i (k ) = i2 +
( ) as
X2
the form (7) with entries of diagonal matrix
5
2
2.1
2.2
2.3
2.4
2.5 iterations
2.6
2.7
2.8
2.9
3 4
x 10
Fig.1 Exemplary computer simulation results In more realistic scenario of noisy sensor signals the
additional (m
0 n) outputs collects only additive noises
(under condition that the noise is not too large, usu-
[7] J.-F. Cardoso and B. Laheld, \Equivariant adap-
the desired source signals with reduced noise. This fea-
IEEE Trans. Signal Processing, Vol. SP-44, no. 12, December 1996, pp.
ture is very desirable property of the proposed learn-
3017-3030.
ally less than 3%) while the other outputs estimates
ing algorithms. Fig.1 illustrates typical simulation results.
Three unknown acoustical signals (two natu-
ral speech signals with positive kurtosis and a single tone signal with negative kurtosis) where mixed us-
H
ing randomly selected 7
2 3 full rank mixing matrix
. To sensors signals were added 2% i.i.d. Gaussian
noises. The mixing matrix, the number of sources as well as their statistics were assumed to be completely
unknown. The learning algorithm (7), (10) - (13) with self- adaptive learning rate
2
= 0:025,
(k ) [9] and parameters i = 10 8i was able
= 0:1 and xed
successfully estimate the number of active sources and their waveforms and also to `shift' noise signals to free channels.
V. CONCLUSIONS
tive source separation",
[8] A. Cichocki, R. Unbehauen, L. Moszczy nski and E. Rummert, \A new on-line adaptive learning algorithm for blind separation of source signals", in
Proc. of ISANN-94, Taiwan, 1994, 406{411.
[9] A. Cichocki, S. Amari, M, Adachi, W. Kasprzak, \Self-adaptive neural networks for blind separa-
1996 IEEE International Symposium on Circuit and Systems, IEEE, Piscataway, tion of sources",
Vol.2, 1996, pp.157-160. [10] A.Cichocki and R. Unbehauen, \Robust neural networks with on-line learning for blind identi cation and blind separation of sources", IEEE Trans. on Circuits and Systems - I Vol.43, Nov. 1996 (submitted in June 1994), pp.894-906.
In this paper we have proposed on - line adaptive learn-
[11] A. Cichocki, \Blind separation and extraction of
ing algorithms for independent component analysis in
source signals { recent results and open prob-
the general case when sensor signals are noisy and the
lems", Proc of 41th Annual Conference of the In-
number of sources as well as their statistics are com-
stitute of Systems, Control and Information En-
pletely unknown. Locally optimal nonlinear functions
gineers, ISCIE-97 May 21-23, Osaka-Japan, May
are derived for various generalized distributions mod-
1997, pp. 43- 48.
els.
Extensive computer simulations have con rmed
validity and excellent performance of the developed learning algorithms.
References
[1] S. Amari, A. Cichocki, and H.H. Yang, \A new learning algorithm for blind signal separation",
Advances in Neural Information Processing Systems 1995 (Boston, MA: MIT Press, 1996), pp. 752-763. [2] S. Amari \Natural gradient works eciently in learning",
Neural Computation, 1997, (in print).
[3] S. Amari, T.-P. Chen, and A. Cichocki, \Stability analysis of adaptive blind source separation", Neural Networks, 1997 (in print).
[12] P. Comon, \Independent component analysis: a new concept?", Signal Processing, vol.36, 1994, pp.287-314. [13] S.C. Douglas, A. Cichocki and S. Amari, \Multichannel blind separation and deconvolution of sources with arbitrary distributions", in Proc. IEEE Workshop on Neural Networks for Signal Processing, Sept. 97 IEEE New York, pp. 436444. [14] M. Girolami and C. Fyfe, \Extraction of independent signals sources using a de ationary exploratory projection pursuit network with lateral inhibition", IEE Proceedings on Vision, Image and Signal processing, 1997 (in print). [15] Y. Inouye and T. Sato, \On line algorithms for blind deconvolution of multichannel linear time-
[4] S. Amari, T.-P. Chen, and A. Cichocki, \Nonholonomic orthogonal learning algorithms for blind separation" ICONIP-97 (in print). [5] J. J. Attick and A. N. Redlich, \Convergent algorithm for sensory receptive elds development",
Neural Computation, Vol. 5, 1993, pp.45-60.
[6] A. J. Bell and T. J. Sejnowski \An informationmaximization approach to blind separation and blind deconvolution", 1995, pp. 1129-1159.
Neural Computation, Vol.7,
invariant systems". In Proc. IEEE Signal Processing Workshop on HOS, pp. 204-208, 1997. [16] Z. Maluche and O. Macchi, \Adaptive separation of unknown number of sources", Proc. of IEEE Workshop on Higher Order Statistics, IEEE Computer Society, Los Almitos 1997, pp. 295-299. [17] K. Matsuoka, M. Ohya and M. Kawamoto, \A neural net for blind separation of non-stationary signals",
Neural
pp.411-419.
Networks,
1995,
Vol.8,No.3,