Information Geometry of-Projection in Mean Field Approximation

1 downloads 0 Views 247KB Size Report
expectation with respect to some relevant p 2 E. .... Figure 2: Pythagoras' theorem .... We denote by m q] the expectation of x with respect to p , that is Ep x].
Information Geometry of -Projection in Mean Field Approximation Shun-ichi Amari, Shiro Ikedayand Hidetoshi Shimokawaz Abstract Information geometry is applied to mean eld approximation for elucidating its properties in the spin glass model or the Boltzmann machine. The -divergence is used for approximation, where -geodesic projection plays an important role. The naive mean eld approximation and TAP approximation are studied from the point of view of information geometry, which treats the intrinsic geometric structures of a family of probability distributions. The bifurcation of the -projection is studied, at which the uniqueness of the -approximation is broken.

1 Introduction Mean eld approximation uses a simple tractable family of probability distributions to calculate quantities related to a complex probability distribution including mutual interactions. Information geometry, on the other hand, studies intrinsic geometrical structure existing in the manifold of probability distributions (Chentsov, 1982; Amari, 1985; Amari and Nagaoka, 2000). It was used for analyzing performances of learning in the Boltzmann machine (Amari, Kurata and Nagaoka, 1992), the EM algorithm (Amari, 1995), multilayer perceptrons (Amari, 1998), etc. A number of new works appeared recently  RIKEN,

BSI

y PRESTO, z Univ.

JST

of Tokyo

1

which treated mean eld approximation from the point of view of information geometry (Tanaka [this volume], Tanaka [2000], Kappen [this volume], and Bhattacharyya and Keerthi [1999, 2000]). The present article studies the relation between mean eld approximation and information geometry in more detail. We treat a simple spin model like the SK model or the Boltzmann machine, and study how the mean values of spins can be approximated. It is known (Tanaka, this volume) that, given a probability distribution including complex mutual interactions of spins or neurons, the mean value of each spin is kept constant when it is projected by the m-geodesic to the submanifold consisting of independent probability distributions. On the other hand, the m-projection is computationally intractable for a large system. Instead, its projection by the e-geodesic is easy to calculate. However, the mean value is changed by this, so that only an approximate value is obtained. This gives the naive mean eld approximation. A family of divergence measures named the -divergence is de ned invariantly in the manifold of probability distributions (Amari, 1985; Amari and Nagaoka, 2000). The = ?1 divergence is known as the Kullback-Leibler divergence or cross entropy, = 1 as the reverse of the K-L divergence, and = 0 as the Hellinger distance. This concept is closely related to the Renyi entropy (R'enyi, 1961; see also Cherno , 1952; the f-divergence of Csiszar, 1975). These divergence functions give a unique Riemannian metric to the manifold of probability distributions. It moreover gives a family of invariant ane connections named the -connections where - and ? -ane connections are dually coupled to each other with respect to the Riemannian metric (Nagaoka and Amari, 1982; Amari, 1985; Amari and Nagaoka, 2000). The -geodesic is de ned in this context. It should be remarked that the Tsallis entropy (Tsallis, 1988) is closely connected to the -geometry. We use the -geodesic projection to elucidate various mean eld approximations. The -projection is the point in the tractable subspace consisting of independent distributions that minimizes the -divergence from a given true distribution to the subspace. It is given by the -geodesic that is orthogonal to the subspace at the -projected point. This gives 2

a family of the -approximations, where = ?1 is the distribution giving the true mean value and = 1 is the naive mean approximation. Therefore, it is interesting to know how the -approximation depends on . We also study the TAP approximation in this framework. We can prove that the m-projection ( = ?1) is unique, while -projection ( 6= ?1) is not necessarily so. The e-projection ( = 1), that is the naive mean eld approximation, is not necessarily uniquely solved. Therefore, it is interesting to see how the -approximation bifurcates depending on . We calculate the Hessian of the -approximation which shows the stability or the local minimality of the projection.

2 Geometry of Mean Field Approximation Let us consider a system of spins or Boltzmann machine, where x = (x1 ;    ; xn) ; xi = 1 denotes the values of n spins. The equilibrium probability distribution is given by

q(x ; W; h) = exp fW  X + h  x ? q g ; where W = (wij ) and X = (xixj ) are symmetric matrices and X W  X = 21 wij xixj (wii = 0) ; h  x = X hixi:

(1) (2) (3)

Here, W denotes the mutual interactions of the spins, h the outer eld, and e? q is the normalization constant, Z = e q is the partition function, and q = q (W; h) is called the free energy in physics or the cumulant generating function in statistics. Let S be the family of probability distributions of the above form (1), where (W; h) forms a coordinate system to specify each distribution in S . Let Eq be the expectation operator with respect to q. Then, the expectations of X and x,

m[q] = Eq [x] = (Eq [xi]) :

Kq = Eq [X ] = (Eq [xi xj ]) ;

(4)

form another coordinate system of S . Our theory is based on information geometry and is applicable to many other general cases, but for simplicity's sake, we stick on this simple problem of obtaining a good approximation of m[q] for a given q. 3

Let us consider the subspace distribution p 2 E is written as

E of S such that xi's are independent or W

p(x; h) = exp fh  x ? pg :

= 0. A (5)

This is a submanifold of S speci ed by W = 0, and h is its coordinates. The expectation

m = Ep[x]

(6)

is another coordinate system of E . Physicists know that it is computationally dicult to calculate m[q] from q(x; W; h). It is given by (7) m[q] = @@h q (W; h)

but the partition function Zq = e? q is dicult to obtain when the system size is large. On the other hand, for p 2 E , it is easy to obtain m = Ep[x] because xi are independent. Hence, the mean eld approximation tries to use quantities obtained in the form of expectation with respect to some relevant p 2 E . Physicists established the method of approximation, called the mean eld theory, including the TAP approximation. The problem is formulated more naturally in the framework of information geometry (Tanaka [this volume], Kappen [this volume] and Bhattacharyya and Keerthi [2000]). The present paper tries to give another way to elucidate this problem by information geometry of the -connections introduced by Nagaoka and Amari [1982], Amari [1985] and Amari and Nagaoka [2000].

3 Concepts from Information Geometry Here, we introduce some concepts of information geometry without entering in details. Let y be a discrete random variable taking values on a nite set f0; 1;    ; N ? 1g, and let p(y) and q(y) be two probability distributions. In the case of spins, y represents 2nx's where N = 2n.

4

-divergence The -divergence from q to p is de ned by ! X 1? 1+ 4 2 2 ; 6= 1 D [q : p] = 1 ? 2 1 ? q p y and for = 1 X D?1[q : p] = q log pq ; X D1[q : p] = p log pq :

(8) (9) (10)

The latter two are the Kullback-Leibler divergence and its reverse. When = 0, it is the Hellinger distance, X (11) D0[q : p] = 2 (pp ? pq)2 : The divergence satis es

D [q : p]  0;

(12)

with equality when and only when q = p. However, it is not symmetric except for = 0, and it satis es D [q : p] = D? [p : q]: (13) The -divergence may be calculated in the following way. Let us consider a curve of probability distributions parameterized by t, ) ( 1 ? t 1+ t t p 1 (14) p(y; t) = e? (t) q 2 p 2 = exp 2 log pq + 2 log q ? (t) ; which is an exponential family connecting q and p. Here, e? (t) is the normalization constant. We then have   D [q : p] = 1 ?4 2 1 ? e ( ) ; 6= 1: (15) We also have # " 0 ( ) = 1 E log p ; (16) 2 2 q 3 !2 00 ( ) = E 4 1 log p 5 ? f 0 ( )g2 ; (17) 4 q 2( ! " #)3 3 p p 1 000 ( ) = E 4 log 5; (18) 8 q ? E log q 5

where E denotes expectation with respect to p(y; ). For = 1, (1) = 0. By taking the limit, we have D = 2 0( ); = 1: (19) The family of the -divergences gives invariant measures provided by information geometry.

4 Parametric model and Fisher information When p(y) is speci ed by a coordinate system  , it is written as p(y;  ). The N ? 1 quantities pi = Probfy = ig; i = 1;    ; N ? 1 (20) form coordinates of p(y). There are many other coordinate systems. For example (21) i = log ppi 0 is another coordinate system. Let 8 > < i (y) = > :

1; y = i; 0; otherwise:

(22)

pii(y) + p00 (y);

(23)

Then, we have

p(y;  ) =

X

p(y; ) = exp

nX

o

ii (y) + log p0 :

(24)

The -divergence for two nearby distributions p(y;  ) and p(y;  + d ) is expanded as X (25) D [p(y;  ) : p(y;  + d )] = 12 gij ( )didj ; i;j where the right-hand side does not depend on . The matrix G( ) = (gij ( )) is positivede nite and symmetric, given by " # @ log p ( y;  ) @ log p ( y;  ) gij ( ) = E : (26) @i @j This is called the Fisher information. It gives the unique invariant Riemannian metric to the manifold of probability distributions. 6

-projection Let M be a submanifold in S . Given q 2 S , the point p 2 M is called the -projection of q to M , when function D [q; p], p 2 M takes a critical value at p , that is @ D [q : p(y;  )] = 0; (27) @ at p where  is a coordinate system of M . The minimizer of D [q : p], p 2 M , is the -projection of q to M . We denote it by

p =

Y



q:

q

(28)

S α -geodesic

M

α11 00 00p* 11

Figure 1: -projection In order to characterize the -projection, we need to de ne the -ane connection and -geodesic derived therefrom. We do not explain them here (see Amari, 1985; Amari and Nagaoka, 2000). We show the following fact. See Fig.1. Theorem 1. A point p 2 M is the -projection of q to M , when and only when the -geodesic connecting q and p is orthogonal to M in the sense of the Fisher Riemannian metric G. 7

S q α -geodesic −α -geodesic

α

p

r

1 0

Figure 2: Pythagoras' theorem

Exponential family A family of distributions is called an exponential family, when its probability distributions are written as nX o p(y; ) = exp iki(y) ? () (29) by using an appropriate coordinate system , where k = ki(y) are adequate functions of y. The spin system or Boltzmann machine (1) is an exponential family, where

and k consists of

 = (W; h)

(30)

k = (X; x):

(31)

The exponential family forms an = 1 at manifold, that is, = 1 RiemannChristo el curvatures vanish identically, but this is a non-Euclidean space. There exist = 1 ane coordinate systems in such a manifold. The above  is an = 1 ane coordinate system, called the e-ane (exponential-ane), because the log probability is 8

linear in . An e-geodesic ( = 1 geodesic) is linear in . More generally, for any two distributions p(y) and q(y), the e-geodesic connecting them is given by log p(y; t) = (1 ? t) log p(y) + t log q(y) ? (t):

(32)

Let us denote the expectation of k by ,

 = E [k]:

(33)

It is known that this  forms another coordinate system of an exponential family. This is an = ?1 ane coordinate system, or m-ane (mixture ane) coordinate system. The two coordinate systems are connected by the Legendre transformation,

@ ();  = @ @ '();  = @

(34) (35)

where '() is the negative of entropy function, and () + '() ?    = 0

(36)

holds. Any linear curve in  is an m-geodesic. An important property is given by the following theorem. See Fig.2. Theorem 2. Let p; q; r be three probability distributions in an  - at manifold S . When the -geodesic connecting p and q is orthogonal at q with respect to the Riemannian metric to the ? -geodesic connecting q and r,

D [p : q] + D [q : r] = D [p : r]:

(37)

From this follows

Theorem 3. Let M be a smooth submanifold in an  - at manifold S , and let p

be the -projection from q to M . Then, the -geodesic connecting q and p is orthogonal to M and vice versa.

9

5 Geometry of E Since E consists of all the independent distributions, it is easy to show the geometry of E. Moreover, E itself is an exponential family, n o p(x; h ) = exp h  x ? 0(h ) ;

(38)

 Y e 0 (h ) = eh i + e?h i

(39)

where

i

or 0





(h ) =

X

i





log eh i + e?h i :

(40)

This h = h 1 ;    ; h n is the e-ane coordinates of E . Its m-ane coordinates are given by (41) m = Ep[x] = @@h 0(h ); which is easily calculated as h i ?h i (42) mi = Ep[xi ] = @@h 0 (h ) = eh i ? e?h i = tanh h i: e +e i This is solved as s + mi : i h e = 11 ? (43) mi In terms of m, the probability is written as Y (44) p(x; m) = 1 + 2mixi ; xi = 1: The Riemannian metric or the Fisher information G = (gij ) is i gij = @m j @ h   (45) = 1 ? m2i ij : Its inverse G = G?1 = (gij ) is

!

gij = 1 ?1m2 ij : i Let l(x; m) = log p(x; m). We then have @mi l = x1i??mm2i ; i 2 @mi l = ? (@mi l)2 : 10

(46) (47) (48)

6 The -projection and mean eld approximation Given q 2 S , its -projection to E is given by

p =

Y



q = arg minD [q : p]: p2E

(49)

We denote by m [q] the expectation of x with respect to p , that is Ep [x]. Then it is given by @ D [q : p (x; m )] = 0: (50) @m

From the point of view of information geometry, Q q = p (x; m ) 2 E is the -geodesic projection of q to E in the sense that the -geodesic connecting q and p is orthogonal to E at p = p (x; m ). When = ?1, p?1 is the m-projection ( = ?1-projection) of q to E . We have

m?1 = m[q]

(51)

which is the quantity we want to obtain. This relation is directly calculated by solving

@ D [q : p (m )] = 0; ?1 @ m ?1

(52)

because this is equivalent to

@

Z

@ m q log p(x; m)dx = 0;

(53)

or

@ E [l(x; x)] = 0 = constE [x ? m] : q @m q Hence,

m?1 = Eq [x]

(54) (55)

which is the quantity we have searched for. But we cannot calculate Eq [x] explicitly, due to the diculty in calculating Z or q for q. Physicists tried to obtain m?1 by mean eld approximation in an intuitive way. If we use the e-projection of q to E instead of the m-projection, we have the naive mean eld 11

approximation (Tanaka, 2000). To show this, we calculate the e-projection (1-projection) of q to E . For = 1, h i D1[q : p] = D?1[p : q] = Ep h  x ? p ? (WX + h  x ? q ) = h  m ? p ? W  M ? h  m + q

because of M = Ep[X ] = mmT . Hence, @D1 = @ h @ h  m ?  ? W m ? h p @m @ m @ h = tanh?1 m ? W m ? h: This gives

m1[q] = tanh [W m1[q] + h] ;

(56)

(57) (58)

known as the \naive" mean- eld approximation. In the component form, this is

mi = tanh

X



wij mj + hi :

(59)

This equation can have a number of solutions. It is necessary to check which solution minimizes D1 [q : p]. The minimization may be attained at the boundary of mi = 1. Similarly, we have the -projection m [q] by solving

@ D [q : p(x; m)] = 0: @m

(60)

However, it is not tractable to obtain m explicitly except for = 1.

7

-trajectory

For a xed q, its -projection m [q] is considered as a path in E connecting the true m?1 = m[q] and the mean- eld approximation m1[q]. This is called the -trajectory of q in E . See Fig.3. The tangent direction of the trajectory is given by dm =d . This is given from

@m D [q : p (m )] = 0;

@m D +d [q : p (m + dm )] = 0 12

(61) (62)

S

q

E

1 0 0 1

1 0

α =1

α=-1

mα[q ]

Figure 3: -trajectory so that, by Taylor expansion,

@m @ D d + @m2 D dm = 0: Here, @m = @=@ m and @ = d=d . We then have

(63)

dm = ? n@ 2 D o?1 @ @ D : (64) m m d Starting from the naive approximation, we may improve it by the expansion m ( ? 1) + 1 d2m ( ? 1)2 +    (65) m [q] = m1[q] + dd 2 d 2 where the derivatives are evaluated at = 1, provided they are calculated easily. Another idea is to integrate dm =d , provided the derivative at can be calculated. We cannot solve these methods now. In order to study the -trajectory, we show some preliminary calculations. The divergence from q to p(x; m) 2 E is written as   (66) D [q : p(m)] = 1 ?4 2 1 ? e ( ;m) 13

where

( ; m) = log

2 2 = log Ep 4

X 1? 1+

q

2

p

We rst calculate @m , since m is given by

q p

! 1?2 3 5:

(67)

@m ( ; m ) = 0:

(68)

We have

@m

We then have

X 1? 1? = 1 +2 e? ( ;m) q 2 p? 2 @m p 2 ! 1?2 3 q 1 + 5: e? Ep 4@m l p = 2 2

@m 2

= ? (@m ) + 1 +2 e? @m Ep 4@m l pq 2

(69)

! 1?2 3 5;

(70)

where @m2 is a matrix and (@m )2 implies (@m ) (@m )T . At m = m , the rst term vanishes and 2 1 + e? E 4(@ l)2 + @ 2 l p m m

@m2 = 2 We also have

q p

! 1?2

? 1 ?2 (@m l)2 p

2

! 1?2 3 5+

2

! 1?2 3 5:

@m @ ( ; m) = 1 +4 e? Ep 4@m l log pq pq = 1 + e? Ep 4@m l log p q 4 q p From this we have

! 1?2 3 q 5:

1 @ 1+ m

! 1? 3

2

dm = ? 1 A?1E 4@ l log p q 2 5 p m d 2 q p 2 1? ! 1?2 3   q! 2 1 ? q 2 2 A = Ep 4 (@m l) + @m2 l p ? 2 (@m l) p 5 : For = 1, we have

h

(72)

(73) (74)

i

p=q)g2 dm = ? 1 hn Ep @m l flog( i o d 2 Ep (@m l)2 + @m2 l log(p=q) + (@m l)2 : 14

(71)

(75)

8

-Hessian

The -projection m [q] is given by the point p in E that is the orthogonal projection of q to E by the -geodesic. Such projection is not necessarily unique. The projection is not necessarily the minimizer of D [q : p] but is a saddle or even the maximizer of D .   To elucidate this we calculate the Hessian H = Hij 2 Hij = @m@@m D [q : p(m)]

i

j

(76)

at the -projection m [q]. When H is positive-de nite, the -projection m gives a local minimum, but it is otherwise a saddle or local maximum. We have    2 1 + Hij = ? 1 + Ep @i l@j l + @i@j l f 2    = ?Ep @i l@j l + 1 +2 @i @j l f ; (77) where @i = @=@mi and f = 1?2

  1?2 q . p

From

@il = 1 ?1m2 (xi ? mi) ; i @i@j l = ?ij 1 2 2 (xi ? mi )2 ; (1 ? mi )

(78) (79)

we nally have

?1

 Ep [(xi ? mi ) (xj ? mj ) f ] (1 ? m2i ) 1 ? m2j = ?giigjj fEp [xixj f ] ? mimj g ; i 6= j

Hij =

(80)

because of and for i 6= j

m i [q] = Ep [xi f ]

(81)

(g )2 E h(x ? m )2 f i : Hii = 11 ? i + ii p i

(82)

We calculate the two special cases = 1. For = ?1, Z ? 1 Hij = @i @j q log pq dx i h = (gii)2 ij Eq (xi ? mi )2  = giiij = G; 15

(83)

where G is the inverse of the Fisher information matrix of E . This is diagonal and positivede nite. Because E is e- at, we know that = ?1-projection gives the global minimum, is unique (that is no other critical points) and gives the true solution m?1[q] = Eq [x]. For = 1, we have Z Hij1 = @i @j p log pq dx

= giiij ? Ep [(@i @j l + @i l@j l) log q] :

(84)

Hence,

Hii1 = gii = 1 ?1m2 ;

(85)

i

Hij1 = ?gii gjj Ep [(xi ? mj ) (xj ? mj ) log q] 

= ?gii gjj Ep (xi ? mi ) (xj ? mj ) +

X

X

hk xk ?

= ?wij :

wklxk xl 

q

(86)

This shows that H 1 is not necessarily positive-de nite. This fact is related to the curvature of E . When n = 2 (two neurons), it is positive de nite when and only when w = w12 satis es (87) w2 < 1 ?1m2 1 ?1m2 : 1

2

Otherwise, it is a saddle. When h1 = h2, there exist one or two local minima other than this. This fact implies that the naive mean eld approximation might give a pathological solution in some cases. When jwj is large, the above two-spin system is dynamically bistable, having two stable solutions x1 = x2 = 1, x1 = x2 = ?1 (w > 0) or x1 = ?x2 (w < 0). When = ?1, the -projection m?1 is unique. Starting from = ?1, the trajectory m bifurcates at some , and then bifurcates further, depending on W . See Fig.4. When W is large, the naive mean eld approximation (59) may have an exponentially large number of solutions. It is interesting to study the diagram of bifurcation for the -trajectory. 16

S

q

α=-1

0 1 11 00

1 0

11 00

α=1

mα[q ]

0 1 1 0

E Figure 4: Bifurcation

9 Small w approximation of the -trajectory We give an explicit formula for the -projection m assuming that wij are small. The TAP solution corresponds to m?1[q] under this approximation. Let us consider an exponential family fp(x; )g. For two nearby distributions q = p(x;  + d) and p = p(x; ), we have X ? X T didj dk ; D (q; p) = D ( + d; ) = 21 gij didj + 3 12 (88) ijk where 2 @3 @ (89) gij = @@i @j (); Tijk = @i @ j @k () = @k gij : In our case,  = (W; h), and for q = q(x; dW; h) and p = p(x; 0; h ), we have

d = (dW; dh) ;

(90)

where W = dW is assumed to be small and dh = h ? h . We can calculate gij and Tijk at p 2 E , for example, for I = (i; j ) and j = (k; l),

gIJ = Ep [(xi xj ? mi mj ) (xk xl ? mk ml )] : 17

(91)

The metric G consists of the three parts gIJ , gIk , gkl where I; J etc are index pairs corresponding to dW I = wij , and small letter indices i; j etc refer to dhi = hi ? i. Note that

gij = E [(xi ? mi ) (xj ? mj )] Tijk = E [(xi ? mi ) (xj ? mj ) (xk ? mk )]

(92) (93)

etc. By using this, we have the following expansion,

D [q; p] = 12 gIJ dwI dwJ + 21 gij dhidhj + gIk dwI dhk  3 ? T dwI dwJ dwK + 3TIjk dwI dhj dhk + 12 IJK  I J k i j k +3TIJk dw dw dh + Tijk dh dh dh ;

(94)

where the summation convention is used for repeated indices. In order to obtain the -projection, we solve @D = 0; (95) @ h i where indices i of di are decomposed into indices pairs I = (i; j ), etc. for W I = wij and single indices i; j;    for hi . For example, we have  1 ? ? 4 TIJldwI dwJ    i j I k +Tijldh dh + 2TIkldw dh + O w3 ;

0 = ?gil dhi ? gIldwI

(96)

where we used

@ dhi = ? : il @ h l The rst-order solution to (96) does not depend on , hi ? h i = gil gIldwI

or

i = hi +

X

which is the naive mean eld approximation. 18

wij mj

(97)

(98) (99)

In order to calculate higher-order corrections, we note 

gij = ij 1 ? m2i 



(100)



gIk = 1 ? m2i fkimj + kj mi g ; I = (j; j )

(101)

gIJ = E [(xixj ? mi mj ) (xk xl ? mk ml )] ; I = (i; j ); J = (k; l)

(102)

etc. Quantities T can be calculated similarly. The second-order correction term Al , which is given by the second term of (96), is obtained after painful calculations as

Al = ml

X





(wlk )2 1 ? m2k :

(103)

Some easy terms are 



Tijldhidhj = ?2ml 1 ? m2l dwlk mk 



TIkldwI dhl = 2ml 1 ? m2l dwlkmk 



2

(104)

2



? 1 ? m2l 1 ? m2k dwlkdwksms:

(105)

After all, we have   X X h l = ml + wlk mk + 1 ?2 ml (wlk )2 1 ? m2k :

or



m l = tanh hl +

X

wek m k + 1 ?2 m l

X



(106) 

(wlk )2 1 ? m k 2 :

(107)

This gives the -projections in terms of parameter , where = 1 is for the naive approximation and = ?1 is for the TAP approximation. This is small w approximation of the -trajectory and is valid for small w.

Conclusions The present article studies the geometrical structure underlying mean eld approximation. Information geometry is used for this purpose which has the Riemannian metric together with dual pairs of ane connections. Information geometry gives the -structure to the manifold of probability distributions of the SK-spin glass system or the Boltzmann 19

machine. The -divergence is de ned in the manifold which is invariant under a certain criterion. The mean eld approximation is a method of calculating quantities related to a complex probability distribution, by using a simple tractable model such as the family of independent distributions. The = ?1 projection of the distribution to submanifold consisting of independent distributions is known to give the correct answer, but it is intractable. The = 1 approximation is tractable, but it gives only an approximation. We search for possibility of using the -approximation that minimizes the -divergence. It is unfortunately dicult to calculate. But its properties are studied for future study. We have also shown the information-geometric meaning of the TAP approximation. We have elucidated the fact that = ?1 projection is unique, giving the true solution but -approximation ( 6= ?1) is not necessarily unique. When we study the trajectory consisting of the -projections, there are a number of bifurcations where the -projection bifurcates. It is an interesting problem to study the properties of such bifurcation and its implications. The present article is preliminary to further studies on interesting problems connecting information geometry and statistical physics.

References [1] S. Amari, Di erential-Geometrical Methods in Statistics, Lecture Notes in Statistics 28, Springer-Verlag, 1985. [2] S. Amari, Dualistic geometry of the manifold of higher-order neurons, Neural Networks, 4:443{451, 1991. [3] S. Amari, Information geometry of EM and em algorithms for neural networks, Neural Networks, 8:1379{1408, 1995. [4] S. Amari, Natural gradient works eciently in learning, Neural Computation, 10:251{ 276, 1998. 20

[5] S. Amari, K. Kurata, and H. Nagaoka, Information geometry of Boltzmann machines, IEEE Transactions on Neural Networks, 3:26|271, 1992. [6] S. Amari and H. Nagaoka, Methods of Information Geometry, AMS and Oxford University Press, 2000. [7] C. Bhattacharyya and S.S. Keerthi, Mean- eld methods for Stochastic Connectionist Networks, TR No. IIsc-CSA-00-03. [8] C. Bhattacharyya and S.S. Keerthi, Plefka's mean- eld theory from a Variational viewpoint, TR NO. IISc-CSA-00-02. [9] H. Cherno , A measure of asymptotic eciency for tests of a hypothesis based on a sum of observations, Annals of Mathematical Statistics, 23:493{507, 1952. [10] N.N. Chentsov (C encov), Statistical Decision Rules and Optimal Inference, American Mathematical Society, Rhode Island, U.S.A., 1982, (Originally published in Russian, Nauka, Moscow, 1972). [11] I. Csiszar, I -divergence geometry of probability distributions and minimization problems, The Annals of Probability, 3:146{158, 1975. [12] H. Nagaoka and S. Amari, Di erential geometry of smooth families of probability distributions, Technical Report METR 82-7, Dept. of Math. Eng. and Instr. Phys, Univ. of Tokyo, 1982. [13] A. Renyi, On measures of entropy and information, In Proceedings of the 4th Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages 547{561, University of California Press, 1961. [14] T. Tanaka, Information geometry of mean eld approximation, Neural Computation, to appear. [15] C. Tsallis, Possible generalization of Boltzmann-Gibbs statistics, Journal of Statistical Physics, 52:479{, 1988. 21

Suggest Documents