Information Geometry of Neural Networks | New Bayesian Duality Theory | Shun-ichi Amari RIKEN Frontier Research Program, RIKEN, Wako-shi, Hirosawa 2-1, Saitama 351-01, Japan
[email protected]
Abstract|
Information geometry is a method of analyzing the geometrical structure of a
family of information systems.
A family of neural networks forms a neuromanifold.
It is
important to study its geometrical structures for elucidating its capabilities of information processing. The present paper proposes a new mathematical theory of dynamic interactions of a lower and a higher neural systems by feedback and feedforward connections. Here, a new duality structure is introduced in the Bayesian framework from the point of view of information geometry.
1
Introduction
Theoreticians have so far speculated some basic but primitive mechanisms of information processing in the brain. We can mention, among others, the autocorrelation associative memory model or the Boltzmann machine (its stochastic version), and the formation of topological maps by self-organization, and so on. However, it is believed that a fundamental role is played by a much more complex hierarchical structure: It is, for example, dynamic interactions of a lower and a higher systems connected by feedforward and feedback connections. Recent physiological ndings by Miyashita's group (Miyashita [1988]) and by Tanaka's group (Ito et al. [1994]) also suggest importance of such a hierarchical structure. This was also emphasized theoretically by Kawato et al. [1991], by Mumford [1991, 1992], and by many others. On the other hand, there are new structured theoretical models such as the mixture of expert nets, Helmholtz machine (Dayan, Hinton and Neal [1995]) and Ying-Yang machine (Xu [1996]). These are proposed in relation to learning or self-organization, and we need to construct models responsible for dynamic interactions of activation. The present paper is the rst attempt towards the construction of a new theoretical framework accountable for such dynamic interactions of a lower and a higher systems. It is based on the Bayesian statistics and information geometry. To this end, a new dualistic framework is given to Bayesian statistics, and the EM or em type dynamical interactions are introduced in the manifold of neural activities. This is only a rst step and a lot of works are necessary in future in this direction. 2
Bayesian Duality in Exponential Families
Let us consider an exponential family of probability distributions of a vector random variable x = (xi ) parameterized by the natural vector parameter = (i ) : p(xj ) = expf 1 x 0 k~ (x) 0 ( )g: (1) The set of all such distribution forms a manifold S which is equipped with the Riemannian metric gij ( ) given by the Fisher information matrix and moreover which is equipped with dually at ane connections (Amari [1985, 1995], Amari et al. [1992]). There are two fundamental coordinate systems and in it. Here the natural parameter is e-ane, and the expectation parameter = E [x] (2) is m-ane, where E denotes the expectation with respect to p(xj ). Bayesian statistics introduces a prior distribution p( ) in the parameter space S , by considering as a random variable chosen subject to p( ). We can then consider a joint probability distribution ~(x) + ()g p(x; ) = expf 1 x 0 ( ) 0 k (3) where () = log p(). The previous p(xj ) is the conditional distribution therefrom. The marginal distribution of x is given by Z p(x) =
p(x; )d :
(4)
We then have the conditional distribution p(jx) of conditioned on observed x which is also called a posterior distribution. An abuse use of notations such as p(x), p( ), p( jx) etc. will be taken in the paper where the meaning is clear from the notation of variables. The conditional distribution is p( jx) = expfx 1 0 k ( ) 0 ~(x)g; (5) where k () = () 0 ( )
~(x) = k~(x) + ~ (x); ~ (x) = log p(x): By considering x as a parameter vector to specify the distributions of , the set of conditional probabilities
fp(jx)g can be regarded again as an exponential family, where the natural parameter is given by ~ = x,
and the expectation parameter is by
~ = Ex []; (6) the conditional expectation of with respect to the posterior distribution corresponding to an observed x. The set of such distributions forms another dually at manifold S~, provided x is a continuous vector variable. It is our intuitive idea that and x represent the neural activities of a higher-order system and a lowerorder system, respectively. Hence, p(xj ) and p( jx) represent, respectively, the dynamic interactions caused by feedback and feedforward connections. Before explaining the idea in more detail, we give a duality between S and S~. There is a natural bijective mapping between them that maps a point = (i ) in S to ~ = (~i ) in S~ by ~ = , or that maps = (i ) to x = (xi ) in S~ by x = . Here, it should be noted that the natural and expectation parameters are interchanged by the mapping. This implies that the e-geodesic (m-geodesic) is mapped to m-geodesic (e-geodesic). We summarize the roles of e- and m-projections in S . Let us consider a subanifold M in S . It is written in the form M = f (u)g where u is the parameters (coordinates) inside M , whose dimensionality is smaller than that of S . Given ^ be the point in M that minimizes the KL-divergence a point in S , let u D[ : (^ u)] = min D [ : (u)]; (u) M where p(xj ) D [ : ] = E log : (7) p(xj ) Then, the m-geodesic connecting and (^ u) is orthogonal to M at (^u). This implies that (^u) is the ^ gives the maximum likelihood estimator when m-geodesic projection of to M . In terms of statistics, u the observed data point is . On the contrary, let D be another submanifold of S de ned by D = f (v )g in terms of the -coordinates. Here, v is the inner coordinates in D . Given a point in S , let v^ be the point in D that minimizes D[ (v ) : ], that is, min D[ (v ) : ] = D[ (^v ) : ]: (v ) D Then, (^v ) is given by the conditional expectation of x with respect to (^v ) = E [xjx 2 D] under a certain condition (Amari [1995]). Similar discussions hold in S~ where the roles of m and e are interchanged. 2
0
0
2
3
Dual structure in layered Boltzmann machines
Let us consider a Boltzmann machine consisting of mutually connected lower and higher networks. Let x be an activity pattern of the lower machine (sensory machine) and y be an activity pattern of the higher machine (concept machine). Here, the components xi and yi take on 0 and 1, but we consider their activation rates (or averages over the ensembles of neurons playing the same role) so that they take real values between 0 and 1. We rst consider the Boltzmann machine of symmetric connections where matrix W X = (WijX ) denotes the connections between neurons i and j in the lower machine, W Y = (WijY ) those in the higher machine, and W XY = (WijXY ) those between the two machines. Then, the stationary distribution of activities in the machine is given by p(x; y ) = expfy W XY x + W X 1 X + W Y 1 Y 0 (W )g; (8) where we used the notation 1 (9) W X 1 X = x W Xx 2 and so on, being the transposition. By putting (y ) = (W XY y ), we de ne the joint probability of x and (y ) given by ~(x)g; p(x; ) = expf 1 x 0 k( ) 0 k (10) 0
0
0
where
1 (W ); (11) 2 ~(x) = 0W X 1 X + 1 (W ): (12) k 2 This distribution is derived from the Boltzmann machine. Moreover, this de nes the two conditional distributions p(xj ) and p( jx) of the exponential type. They represent the dynamic interactions of neural activities between the two systems. Let us consider the case where the number of neurons is smaller in the higher-order system. Then, the set of determined by the neural activities y of the higher system forms an e- at submanifold M in S , M = f (y )j (y ) = W XY y g: (13) The parameter (y ), which is the activations of the higher system, speci es the probabilities of activations of x by the conditional probability p(xj). This represents the eects by the feedback connections. The lower system receives sensory inputs (or inputs from a further lower system). Let s be the sensory input. The lower system analyzes it, but the sensory inputs are usually imperfect and ambiguous. Let us consider the set of all possible excitations x in the system compatible with s. This is the data set Ds speci ed by s, Ds = fxj x is compatible with sg: (14) Ds is a submanifold of S . Let h shows the activities of hidden neurons in the lower system which are not uniquely determined by s. More precisely s and h can be any parameters to determine x = x(s; h), where s is observable but h is hidden and is to be determined by dynamics. When s is observed, dynamical interactions take place in the lower system to obtain plausible x 2 Ds . On the other hand, the rst candidate x stimulates the higher system by the stochastic dynamics pf(y )jxg to generate the higher-order interpretation of s giving a concept corresponding to it. The higher-order system has its inner connections to form an associative memory system. There are a number of peaks of stationary probabilities in it, each corresponding to a concept or a memorized pattern. This is the stochastic version of the autoassociative memory model. The activities y thus aroused from x in turn stimulate the lower system from the higher-order point of view to give a new x. Thus the dynamical interactions take place, where the feedforward ow generates quickly a hypothetical conceptual recognition and the feedback ow checks the concept thus recognized to complete the lower data. This is the interpretation given by many researchers (for example, Kawato and Inui, Mumford). Now we give its mathematical formulation in terms of information geometry. 4
Dynamical procedures of
k () = 0W Y 1 Y +
e- and m-projections
A. feedforward estimation Given a sensory information s, the lower net generates the rst rough pattern x 2 Ds . From this, the rst candidate (y ) 2 M should be searched for quickly. The maximum posterior estimate (^y ) is the one that maximizes the posterior distribution p(jx) = p( )p(xj )=p(x); or equivalently the one that maximizes log p() + log p(xj ) (15) under the constraint 2 M . The maximum likelihood estimator y^ mle is the one that maximizes the second term, and is given by m-projecting x to M in the space S . The rst term transforms y^ mle further into the maximum posterior y^ in M by using the dynamical ow r log(y ) in M . This second transformation depends only on y^ mle in the case that M is e- at. Since the concepts are highly stable equilibrium points, it is expected that y^ has a sharp distribution. Hence, y^ is very close to the conditional expectation of y with respected to the posterior p( jx). This latter is given by e-projecting x to M in S~. B. Feedback estimation Given a candidate (y ), pfxj(y )g is rather broadly distributed. Since there are many neurons playing similar roles in the lower system, it is natural to think that the conditional expectation E (y ) [xjx 2 Ds ] (16) is realized by the feedback interactions. This minimizes arg min D [xj(y )]; (17) x Ds and is given by e-projecting (y ) to Ds . This dynamical process of feedforward and feedback interactions is sketched in S as follows. 1) Given s, generate one x 2 Ds . 2
2) m-project x to S and make necessary corrections by the associative dynamics of the higher system to give (y ). 3) e-project this (y ) to Ds to give a new x 2 Ds . 4) Repeat 2) and 3) until it converges. The convergence of the process will be analyzed geometrically. We can prove that the above procedure minimizes the function F (x; ) = Dfxj (y )g 0 log p() (18) under the condition that x 2 Ds and 2 M . 5
Conclusions
A primitive but new promising idea is given to the dynamic interactions of lower and higher systems in terms of information geometry. The Boltzmann machine is used for de ning the conditional probabilities p(xj) and p( jx) of their interactions. However, it is not plausible to assume symmetric connections Y X between the lower and higher systems. It is possible to de ne two conditional distributions WijXY = Wji even when the symmetry does not hold. This gives a more realistic framework. There remain many researches to be done along this line. One is to give more rigorous mathematical descriptions. Another is to show interesting properties of this system in more detail. It is important to apply the framework to give plausible models to physiological ndings such as Miyashita's and Tanaka's. References
[1] S. Amari. Dierential-Geometrical Methods in Statistics, Lecture Notes in Statistics, vol.28, Springer, 1985. [2] S. Amari. Information geometry of the EM and em algorithms for neural networks, Neural Networks, 8, No.9, 1379{1408, 1995. [3] S. Amari, K. Kurata, H. Nagaoka. Information geometry of Boltzmann machines, IEEE Trans. on Neural Networks, 3, 260{271, 1992. [4] P. Dayan, G. E. Hinton, R. M. Neal and R. S. Zemel. The Helmholtz machine, Neural Computation, 7, 889{904, 1995. [5] M. Ito, I. Fujita, H. Tamura, K. Tanaka. Processing of contrast polarity of visual images in inferotemporal cortex of the Macaque Monkey, Cerebral Cortex, 5, 499{508, 1994. [6] M. Kawato, T. Inui, S. Hongo and H. Hayakawa. Computational theory and neural network models of interaction between visual cortical areas, ATR Technical Report, TR-A-0105, ATR, Kyoto, 1991. [7] Y. Miyashita. Neuronal correlate of visual associative long-term memory in the primate temporal cortex, Nature, 335, 817{820, 1988. [8] D. Mumford. On the computational architecture of the neocortex, I The role of the thalamo-cortical loop, Biological Cybernetics, 65, 135{145, 1991. [9] D. Mumford. On the computational architecture of the neocortex, II The role of cortico-cortical loops, Biological Cybernetics, 66, 241{251, 1992. [10] L. Xu. A uni ed learning scheme: Bayesian-Kullback Ying-Yang machine, Advances in Neural Information Processing Systems, 8, eds., David S. Touretzky, Michael Mozer, Michael Hasselmo, MIT Press, Cambridge MA, 1996.