Neural Learning and InfoMax Orthonormal Independent ... - CiteSeerX

0 downloads 0 Views 98KB Size Report
Masses mi move in the space Rp where a physical point. , with negligible mass, moves too; its position with re- spect to O is described by an independent vector ...
‘Mechanical’ Neural Learning and InfoMax Orthonormal Independent Component Analysis Simone Fiori and Pietro Burrascano  Dept. of Industrial Engineering – University of Perugia (Italy) E-mail: [email protected], [email protected] Abstract With this paper we aim to present a new class of learning models for linear as well as non-linear neural leayers, deriving from the study of the dynamics of an abstract rigid mechanical system. The set of equations describing the motion of this system may be readily interpreted as a learning rule for orthogonal networks. As a simple example of how to use the new learning theory, a case of Orthonormal Independent Component Analysis based on the Bell-Sejnowski’s InfoMax principle is discussed through simulations.

1. Introduction In this paper we present a new class of semi-general purpose learning algorithms for neural networks whose derivation is based on the analysis of the dynamics of a rigid system of masses in an abstract space. Then, we show the connection with the network’s theory observing that the set of equations describing the dynamics of such a system may be directly interpreted as a learning algorithm for linear and nonlinear neural layers. Orthonormal learning [5], allows to solve Ortho-Normal Problems (ONPs). ONPs arise in several context, such as Principal Component/Subspace Analysis, [7, 9, 12], Orthonormal Independent Component Analysis [3, 4, 6, 8], Direction Of Arrival (DOA) estimation, frequency tracking [3, 7], ‘Best Basis’ search and orthogonal transform like wavelet packets or local cosines (see [11] and references therein), and several various Signal Processing applications. In an ONP the target of the adaptation rule for neural networks is to learn an orthonormal matrix (i.e. a matrix W such that WT W = I) related in some a way to the input signal. Since it is a-priori known that the final state must belong to the subset H of the whole space of searching containing orthonormal matrices, we can strongly bind the evolution of W to always belong to H. In this way  This research was supported by the Italian MURST.

many wrong searching steps can be avoided. Furthermore, the presence of local extrema in the cost/objective function driving the network’s learning becomes less dangerous, since the orthonormal local extrema only may affect the network’s learning. As we shall see, the equations describing the dynamics of the mechanical system possess a fixed structure, and the flexibility of the algorithm (which makes it a class) is due to the freedom in the choice of the ‘forcing terms’ which causes the global motion of the system. Since we suppose such forces deriving by a Potential Energy Function, we briefly discuss the existing relationship between the choice of that function and the tasks we can perform by means of the associated neural system. In practice, we solve the strongly-binding problem by adopting as columns of the weight matrix W the position vectors of some masses of a rigid system: because of the intrinsic rigidity of the system the constraint required is always fulfilled.

2. Dynamics of a rigid system in

Rp

Let S ? = f2mi wi ]g be a rigid system of masses, where the m vectors wi represent the instantaneous positions of the m masses mi , in a coordinate system. Such masses are positioned at constant (unitary) distances from the origin O, still in the space Rp , and along mutually orthogonal axes. Masses mi move in the space Rp where a physical point , with negligible mass, moves too; its position with respect to O is described by an independent vector x. The point  exerts a force 2fi on each i:th mass and the set of the forces so generated causes the motion of the global system S ? . Furthermore, masses move in a homogeneous and hysotropic fluid endowed with a non-negligible viscosity whose resistance, braking the system motion, makes the system dissipative and stabilizes its dynamics. Since by definition system has been assumed rigid with the axes origin O fixed, masses mi are allowed only to instantaneously rotate around such point, while they cannot

translate with respect to it. For this reason the massive S ? is dynamically equivalent to the adjoint system denoted in the following with S , defined as fmi  wi ] mi ;wi ]g. By construction this is symmetric, i.e. if a physical mass mi is positioned in w i and on it the force fi is applied, then a fictitious ghost-mass is positioned in ;w i and on it a force ;fi is applied. In this way the resultant of the active forces is null at any time, therefore the system cannot translate, coherently with that specified above. On the other hand, the resulting momenta of the active forces applied on S is equivalent to the resulting momentum of the forces applied on S ? . For this reason in the following we shall refer to the adjoint system S instead of S ? . The dynamics of this system is described by the following Theorem: Theorem 1 (Dynamics of the system S .) Let S be the physical system described above: denote with F the matrix of the active forces, with P the matrix of the viscosity resistance, with H the angular speed matrix, with M the diagonal matrix of the masses and with W the matrix of the instantaneous positions of the masses (see the proof). In the special case where M = I the motion of the system obeys the following equations:

dW = HW  P = ;HW dt dH = 1 (F + P)WT ; W(F + P)T ]  dt 4

(1) (2)

with  being a positive parameter said viscosity coefficient. Sketch of proof. To ensure the rigidity of the system at any time it is sufficient to give the equation describing the motion of any single vector w i . Particularly, we require that the evolution of each position vector is described by the equation:

dwi (t) = H(t)w (t)  i dt

(3)

for i = 1, 2, . . . , m, with HT = ;H. The condition imposed on the matrix H(t) is sufficient. To asses this statement we need to show that dwiT (t)wk (t)]=dt = 0 at any time, providing that it was w iT (0)wk (0) = ik , where ik is the Kronecker’s ‘delta’: in this way it will be w iT (t)wk (t) = ik at any time. To this aim, let us evaluate the first derivative of a generic d(wiT wk ) = dwiT w + wT dwk . By replacing scalar product i dt dt dt k the derivatives with the terms at the right hand of (3) we d(wiT wk ) = wT (HT + H)w = 0 because of the obtain: k i dt skew-symmetry of H. Now, the instantaneous acceleration ai of each mass arises by the following basic equations:

mi ai = fi + ri + pi  ;mi ai = fi ; (+)

(+)

(+)

( )

+ r ;) + p(;)  (

i

i

(4)

where fi is the active force exerted by the point  on mass mi , ri is the resultant of the rigidity constraints and of the eventual internal forces on the mass mi and pi is the viscosity resistance exerted by the fluid on the same mass. The superscript (+) denotes a force applied on mi , while the superscript (;) denotes a force applied on the ‘ghost’ of mi in ;wi . By subtracting hand-by-hand the second equation from the first one, we obtain:

2mi ai = fi + ri + pi 

(5)

where fi is the active force exerted by the point  on mass mi , ri is the resultant of the rigidity constraints and of the eventual internal forces on the mass mi and pi is the viscosity resistance exerted by the fluid on the same mass, and (;) (+) (;) fi := fi(+) ; fi(;), ri := r(+) i ; ri , pi := pi ; pi . The instantaneous acceleration ai can be expressed in terms of H and wi by differentiating each of the equation (3):    

ai =

d dwi = dH + H2 w : i dt dt dt

(6)

Plugging equation (6) in (5) yields:





2 ddtH + H2 wi mi = fi + ri + pi : of the equations (7) for i 2 f1 : : : mg

(7)

The set can be rewritten in a compact way. Introducing the following matrices: F := f1 f2 : : : fm ], R := r1 r2 : : : rm ], P := p1 p2 : : : pm ], M := diag(m1  : : : mm ), equations (7) rewrite:

  2 ddtH + H2 WM = F + R + P :

(8)

At any time matrices W, F, P and H are known, the unknown is dH=dt. Matrix R could be evaluated, but this is not needed since (as it will be shown later) it doesn’t affect the system motion. Now, let us look for a solution of the differential equation _ + H2 = XWT , with X unknown to (8) in the form H 1 be determined . By replacing in (8) the last equation and recalling that at any temporal instant W T W = I, we find X satisfies 2XM = F + R + P, whereby the equation:

dH + H2 = 1 (F + R + P)M;1 WT dt 2

(9)

follows. Let us introduce now the skew-symmetry constraint, by transposing both members of (9). In this way we obtain:

; ddtH + H2 = 21 WM;1 (F + R + P)T :

(10)

1 This expression intrinsically implies that XW T is a correct solution: such fact is not obvious, but it can be formally proven [5].

Subtracting hand-by-hand from equation (9) its hand-byhand transposition, introducing condition M = I and observing that matrix H2 is symmetric, we have:

4 ddtH = (F + P)WT ; W(F + P)T + RWT ; WRT :

(11) The rigidity of the system is ensured by the skew-symmetry of the matrix H and the internal forces cannot produce motion, then the RW T ; WRT = 0 must hold 2 , thus formula (11) leads to equation (2). Finally, we need to determine the structure of P. By (+) definition, pi is a viscosity resistance, therefore it can be (+) wi , where  is a positive assumed equal to: pi = ; 12  ddt parameter said viscosity coefficient. Then it holds true that:

1 1 (;) 2 Hwi  pi = + 2 Hwi  hence pi = ;Hwi and P = ;HW. p(+) i =;

Theorem just proved ensures that if at t = 0 WT W the same property holds at any time t > 0, too.

2

= I,

3. The MEC engine and the potential energy function It is now interesting to note that the set of equations (1–2) may be assumed as an adapting rule (that we called MEC) for neural layers with weight matrix W. The MEC learning algorithm applies indistinctly to linear neural layers as well as to non-linear ones, therefore the network’s input-output transference writes y = SWT x + w0 ], where x 2 Rp , W is p  m with m < p, w0 is a generic biasing vector in Rm and S] is an arbitrarily chosen m  m diagonal operator. The MEC learning rule possesses a fixed structure, the only modifiable part is the computation rule of active forces applied to the masses. We suppose that forcing terms derive from a Potential Energy Function (PEF) U , therefore we (+) (;) @U assume: fi = ; @@U wi and fi = ; @ (;wi) , hence:

F := ;2

@U @W :

(12)

Generally we can suppose U dependent upon W x and y: U = U (W x y). Recalling that a (dissipative) mechanical

system reaches the equilibrium when its own potential energy U is at its minimum (or local minima), we can assume U := +Jc , with Jc being a cost function to be minimized, or U := ;Jo , where Jo is an objective function to be maximized, both under the constraint of orthonormality. Vector w0 adapts by means of any arbitrary learning rule. 2 This statement can be easily proven by observing that if the system is still (hence H 0 and P 0 ) and the active forces are null (F 0) then the system cannot start moving, thus from equation (11) the condition follows, as in 3-dimensional mechanics.

=

=

=

In a neural point of view, the mechanical rigidity of the system may be interpreted as a kind of competition between the neuronal units.

4. Computer simulations The learning theory for neural layers developed in the previous sections applies to several concrete cases of current interest in the Neural Network’s field. Here we aim to show a simple example in connection with the Independent Component Analysis by InfoMax principle developed by Bell and Sejnowski in [2]. This example shows how the original learning rule deriving from InfoMax dramatically simplifies under the MEC context, at least like in the Natural Gradient [1] one. To perform simulations, we use the well-known result stating that an ICA stage may be decomposed into two subsequent stages [3], a pre-whitening and an orthonormal separation. The first operation can be performed by means of a PCA network [9, 10] (also realizable with a MEC structure [5]) followed by a simple scaler; therefore we can concern ourselves directly with the second operation, that is an Orthonormal Independent Component Analysis, studied independently be several authors [3, 8, 4, 11]. The aim is to separate 3 independent signals from 3 their mixtures. In formulas we have x = QT s, where Q is the orthonormal square matrix containing mixing coefficients, and s is a vector containing input signals. The separating linear neural network is described by y = W T x, and another matrix B = W T QT is defined. When separation is performed, matrix B has only one entry per column different from zero. In our simulations Q was generated randomly and s contains three statistically independent subgaussian random processes defined as in [1]. Since the signals to be separated have the same kurtosis’ (1)   sign ( 4  = ;1:20, (2) = ;0:71, (3) = ;2:00), the 4 4 Bell-Sejnowski’s criterion [2] defined as:

"

#

3 Y Jbs (W) = 12 ln j det(WT )j tanh0 (wiT x)  i=1

(13)

may be used. It has to be minimized with respect to W, therefore we just assume U := kbs Jbs , where kbs is a positive scaling factor. By definition, the resulting active force (12) has the expression:

F = ;kbs xtanhxT W] 

(14)

where the tanh] acts component-wise. Notice that U assumes a simplified expression with respect to Jbs since the MEC’s property det(W T W) = 1 ) j det(WT )j = 1 holds true. (Compare this expression with the learning term in [2].) Through several simulations we found the suitable

References

1 0 -1 0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

1 0 -1 0

1 0 -1 0

Figure 1. Three linear mixtures of three independent sources.

[2] A.J. B ELL , T.J. S EJNOWSKI, An Information Maximisation Approach to Blind Separation and Blind Deconvolution, Neural Computation, Vol. 7, No. 6, pp. 1129 – 1159, 1995 [3] P. C OMON, Independent Component Analysis, A New Concept ?, Signal Processing, Vol. 36, pp. 287 – 314, 1994 [4] P. C OMON AND E. M OREAU, Improved Contrast Dedicated to Blind Separation in Communications, Proc. ICASSP, pp. 3453 – 3456, 1997

0.2 0 -0.2 0

[1] S.- I . A MARI , A. C ICHOCKI , H.H. YANG A New Learning Algorithm for Blind Source Separation, Advances in NIPS 8, MIT Press, 1996

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

1

[5] S. F IORI, The MEC Neural Engine – Applications to PCA and ICA, Technical Report 1/SF/97, Dept. of Electronics and Automatics, Univ. of Ancona (Italy)

0 -1 0

0.05 0 -0.05 0

Figure 2. Recovered independent source signals.

learning parameters to solve the above source-separation problem. Figure 1 shows the mixtures of the independent source signals, while the result of the separation process is depicted in Fugure 2.

5. Conclusions In this paper a new kind of learning rules, based on a mechanical paradigm, was presented. Some applications of the proposed approach were briefly suggested, and a simple case of Orthonormal Independent Component Analysis was tackled through simulations. About the proposed learning theory, a wide study was started by the authors, both in the theoretical direction for discovering relationships with other theories found in the literature and for making the algorithm somehow more flexible; some studies are being under consideration in ordert to apply the obtained results to concrete cases. Some additional issues may be found in [5].

[6] J. K ARHUNEN, Neural Approaches to Independent Component Analysis and Source Separation. Proc. of 4th European Symposium on Artificial Neural Networks (ESANN), pp. 249 – 266, 1996 [7] J. K ARHUNEN AND J. J OUTSENSALO, Representation and Separation of Signals Using Nonlinear PCA Type Learning, Neural Networks, Vol. 7, No. 1, pp. 113 – 127, 1994 [8] B. L AHELD , J.F. C ARDOSO, Adaptive Source Separation with Uniform Performances, Signal Processing VII: Theories and Applications, Vol.1, pp. 183 – 186, 1994 [9] E. O JA, Neural Networks, Principal Components, and Subspaces, International Journal of Neural Systems, Vol. 1, pp. 61 – 68, 1989 [10] A. PARASCHIV-I ONESCU , C. J UTTEN , AND G. B OUVIER, Neural Network Based Processing for Smart Sensor Array, Artificial Neural Networks (ICANN), 1997, pp. 565 – 570, Springer-Verlag [11] J.-C. P ESQUET AND E. M OREAU, Measures of independence for orthonormal mixtures, Tech. Rep. LSS/Univ. Paris 11, Jul. 1996 [12] L. X U, Theories for Unsupervised Learning: PCA and Its Nonlinear Extension, Proc. of International Joint Conference on Neural Networks (IJCNN), 1994, pp. 1252 – 1257

Suggest Documents