William B. Levy and Hakan Delic. Department of Neurological Surgery. Health Sciences Center Box 420. University of Virginia. Charlottesville, Virginia 22908.
A GENERALIZED THEORY OF MAXIMUM ENTROPY PREDICTION BY NEURONS William B. Levy and Hakan Delic Department of Neurological Surgery Health Sciences Center Box 420 University of Virginia Charlottesville, Virginia 22908
Abstract: The maximum entropy procedure is an axiomatically derived inference technique, producing density functions from moment constraints. In this paper we consider the implications of a hypothesized neuronal prediction based on the maximum entropy principle and investigate the computational and biological consequences of this hypothesis. The ultimate goal of the paper is to translate, as generally as possible, the maximum entropy inference technique into the context of neural computation. I. INTRODUCTION In 1989 and 1990 [2, 5], we introduced the hypothesis that certain neurons could perform probability inference via the maximum entropy (M.E.) technique, an axiomatically derived inference procedure [1, 6]. As introduced, this neural maximum entropy inference (N.M.E.) hypothesis concentrated its exposition on a specific example. This example was partially inspired by: (1) a limited number of physiological studies of long-term potentiation/depression from the hippocampus; (2) a specific, but arguably arbitrary, set of neuronal properties. As a follow-up to these initial reports, the purpose of the present article is two-fold. First, the N.M.E. hypothesis is a much more general idea than just the specific example that was developed. Moreover, it is quite possible that the general hypothesis is correct, but the specific example given was false. That is, because of our biological naivety and ignorance, we failed to seize upon the critical specifics of biological plausibility for items (1) and (2) above. Second, there are many restrictions and requirements associated with the application of the M.E. method to neurons which have not been presented. These restrictions and requirements are, by implication, predictions by the N.M.E. hypothesis as to what neurons need to do in order to implement the M.E. inference procedure. The ultimate goal of this paper is to elaborate on the specific example of the hippocampus in [2] generalizing the results to a generic neuron, and to translate the maximum entropy inference procedure, which was axiomatically derived by Shore and Johnson (1980), into the neural computation context as generally as possible. II. THE MAXIMUM ENTROPY INFERENCE Consider a set of expectations, E[ . ], over arbitrary functions gi (X) that are not themselves expectations or inferable, as expectations, from the others. Specify the set of available expectations as M = {E[gi (X)], i = 1, 2, . . . , m > 0}. The maximum entropy inference procedure states that, of all the densities that satisfy the constraints in the set M, one should choose the density with the largest entropy [6]. Then, as given by Jaynes (1957), M.E.: M → f* (X) = exp {− λ0 −
m
Σ λi gi (X)}, i=1
(1)
where f* (X) is a density function dependent on the constraint set M and ∫ * dX f* (X) = 1. The values in the con{X: f (X) > 0} straint set and the last condition imply the values of the variables λ0 through λm . This inferred density function, f* (X), is the unique density that fits all the information in the constraint set M while implying no more information than what is in this constraint set. Considering f* (X) as the result, any other density function either fails to satisfy the statistics in the constraint set or implies a larger set of moment constraints than are actually in (or implied by) the given constraint set. hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
This paper appeared in the Proceedings of the World Congress on Neural Networks, Vol. II, Portland, Oregon, July 11-15, 1993, pp. 131134.
1
III. THE NEURONAL MAXIMUM ENTROPY INFERENCE HYPOTHESIS Associated with each postsynaptic neuron k is a scalar variable Zk . This postsynaptic neuron receives a set of inputs whose values Xi , i = 1, . . . , m, will be used to predict the value of Zk that will occur a little bit into the future. An important point is that the Xi ’s do not generate the Zk that they predict. (For further details of the prediction problem and the required neurobiology, see [3].) Upper case letters are random variables or vectors; lower case ones are specific realizations. In order to apply the M.E. inference to neural computation, we must explicitly identify the members of the constraint set as storable quantities and the integration (summation) space as a legitimate, biologically representable state space. A crucial assumption is that Zk and X are joint random variables that have been sampled enough to create reliable statistics on these variables (e.g. E[g(Zk )h(Xi )]/E[g(Zk )], ∀ i, where g and h are bounded functions) and that these statistics are stored via unsupervised associative synaptic modification (see e.g. [5]). In many of the developments, we consider the neurally plausible idea that an individual neuron can store and use its average activity E[Zk ]. A maximum entropy inferred prediction formula depends on the elements of the constraint set and the support intervals of the random variables. That is, it is possible to infer as many density functions as there are combinations of constraint sets and support intervals that imply a density. However, not all constraint sets imply a density. For instance, as mentioned in [4], the constraint set M = {E[Zk ], E[Zk Xi ], ∀ i} fails to produce a legitimate density function for Zk ε {0, 1} and Xi ε [0, ∞], ∀ i. Thus, certain sets are inherently unsuitable for probabilistic inference and nature should not use them. Most superficially, the N.M.E. hypothesis goes like this: synapses store statistical correlations; maximum entropy uniquely infers a probability density function from this set of statistics; neurons generate decisions based on these densities. Thus, where knowledge from biological experiments implies the stored statistic at a synapse, the N.M.E. hypothesis predicts the appropriate M.E. density function from the inputs. The N.M.E. hypothesis also predicts the computational form, i.e. the neuronal input-to-output (I-O) function that goes with the stored statistic. The maximum entropy inference based theory of prediction by neurons is flexible in that there are a variety of suitable moment constraints that could be stored at synapses or anywhere else in an individual neuron. For {0, 1} binary Xi , i = 1, . . . , m, and Zk , the inference using {E[Xi | Zk = 1], ∀ i}, or using the combination {E[Xi | Zk = 1], E[Xi | Zk = 0], ∀ i}, are equally valid (but not identical) so long as the indicated statistics are appropriately accessible. In an idealized situation, neuron k would generate the conditional probability that k will be in the one-state at some time in the future given that its inputs X = [X1 . . . Xm ] are in a given state now. Because the dimension of X (that is, the set of input lines to neuron k) is quite large, there will be many configurations of X never experienced before the prediction time t. Moreover, even with sufficient sampling, it is impossible to store the exponentially many statistics, for example the 2m expectations that might be desired. The solution to this problem is to use the M.E. inference on a smaller set of lower-order moments. In accord with the definition of conditional probability and leaving the time dependence implicit, P* (Zk = 1 | X) = f* (Zk = 1, X)/f* (X)
(2)
where P* (Zk | X) is a M.E. inferred probability, f* (Zk , X) is a joint density function inferred from correlations between Zk and Xi , i = 1, . . . , m, and f* (X) is the M.E. inferred marginal density of the random vector X. All quantities are computed from sample averages that have nearly converged to the population based expectations. The most likely value of Zk can be decided by finding the largest posterior probability in equation (2), that is the maximum-a-posteriori decision rule. In fact, since the denominator in equation (2) does not depend on Zk , margination of f* (Zk , X) is not necessary. The M.E. inferred joint density function f* (Zk , X) provides the sufficient statistics for neuronal decision making. IV. M.E. INFERRED PREDICTION FORMULAS FOR SPECIFIC CONSTRAINT SETS In this section, we present some of the possible M.E. formulas that use the variables Xi , i = 1, . . . , m, to predict the variable Zk . In each case, we give with the constraining state space and the available averages. We shall restrict ourselves to binary-valued predictive output problems; that is, Zk ε {0, 1}. However from the mathematical viewpoint, Zk can take values in other spaces, including continuous intervals such as [0, 1], or even (− ∞, ∞), although we seriously question the validity of an infinite range for a biological system.
2
1) The first example uses a smaller set of statistics than the one presented in reference [2]: Zk ε {0, 1}, Xi ε {0, 1}, ∀ i; M1 = {E[Zk ], E[Zk Xi ], ∀ i}. The maximum entropy inferred posterior probability is then P* (Zk = 1 | X) =
1 1 + eα(X)
h hhhhhhh
(3)
where α(X) =∆ log
I K L
1 − E[Zk ]
h hhhhhhhh
m
2 E[Zk ]
M N
+
O
m
Σ log i=1
E[Zk ] E[Z ] k − E[Zk Xi ] L I K
h hhhhhhhhhhhhh
M
+
N O
m
Σ Xi log i=1
M E[Zk ] − 1 N. E[Z X ] k i O L I K
hhhhhhh
The prediction formula in equation (3) is the neurally desirable sigmoidal I-O function and α is essentially dependent on the addition of individual synaptic activations. Indeed, many of the M.E. results can be reduced to this form. 2) The prediction generating variables may be continuous: Zk ε {0, 1}, Xi ε [0, 1]; M2 = {E[Zk ], E[Zk Xi ], ∀ i}. In this case, Xi , i = 1, . . . , m, act as probabilities. Unfortunately, only an implicit solution appears to be available because the M.E. principle produces the following joint density: f* (X, Zk ) = exp{− λ0 − λZ Zk −
m
Σ λi Zk Xi }, i=1
(4)
where the Lagrange multipliers are solutions to e
λ0
=1+e
− λZ
m
Π i=1
−λ E[Zk Xi ] h hhhhhh 1 e i 1 −λ − λi M hhhhhhh + hhh , i = 1, . . . , m; e 0 = 1 − E[Zk ]. = 1 − e ; −λ O λi E[Zk ] λi L 1−e i
h hh I
(5)
Our failure to produce an explicit solution does not imply that a neuron can not generate the density represented by equations (4) and (5). The neuron only needs a mapping that is the solution to the equation. Indeed, a look-up table can be used to solve prediction formulas which are represented by transcendental or special functions. This look-up table is just the I-O function for a neuron, and thus, this I-O function is required by the N.M.E. hypothesis for a given type of constraint set. 3) Some contraint sets are a waste of the resources: Zk ε {0, 1}, Xi ε Ii , ∀ i; M3 = {E[Zk ], E[gi (Xi )], ∀ i}. In this case, note that the moment constraints are not overlapping. Ii is the support of the random variable Xi , for i = 1, . . . , m, and it can be a finite set or a continuous interval. If any one of the Ii ’s is continuous however, it must satisfy the following condition:
∫ dXi e I
− λi gi (Xi )
< ∞.
(6)
i
The posterior probability can be shown to be P* (Zk = 1 | X) = E[Zk ]
(7)
for any form of gi as long as equation (6) holds. The conclusion is that if the moment constraints are not overlapping and one of the moments is the expected value of Zk , then the resulting prediction formula is independent of the inputs, as demonstrated by equation (7). Since, for Zk ε {0, 1}, P(Zk = 1) = E[Zk ], this result is intuitively pleasing; but, it shows such constraint sets to be essentially worthless. 4) Here is an alternative to (4) and (5) that has some appeal for its tractability and information: Zk ε {0, 1}, Xi ε [0, 1], ∀ i; M4 = {E[Zk ], E[Zk log Xi ], ∀ i}. With the condition that λi < 1, i = 1, . . . , m, hold, the M.E. inferred prediction formula is sigmoidal as in equation (3) with α(X) = log{E[Zk ]} +
m
Σ i=1
E[Zk ] E[Z k log Xi ] L I
log
K
hhhhhhhhhh
M N O
−
m
Σ i=1
I
1+
J L
M E[Zk ] Jlog Xi . E[Zk log Xi ] O
hhhhhhhhhh
Note that λi = 1 + E[Zk ]/E[Zk log Xi ], ∀ i. Hence, the condition λi < 1, implies that E[Zk log Xi ] < 0, which is correct since log Xi < 0 for 0 ≤ Xi < 1.
3
V. EXPANDING THE CONSTRAINT SET - MORE INFORMATION Consider two constraint sets Ma and Mb such that Ma ⊂ Mb . Then, Mb contains more information than Ma does in the sense that the density function inferred from Mb has less entropy. Thus, it is possible to do better prediction by increasing the information content in the constraint set. The main concern with an enlarged constraint set is the mathematical manageability of the ensuing probability calculations. Consequently, at times when the existing constraint set requires intractable computations, it is possible to use a workable subset of the constraint set by discarding the incompatible constraints that cause intractability. Caution is also required to avoid introducing redundancy into the constraint set by adding new members which supply no new information. For instance, the two constraint sets Mha = {E[Zk ], E[Zk X], E[X]} and h h Mb = {E[Zk ], E[Zk X], E[Zk X], E[X]}, where Zk =∆ 1 − Zk , are obviously equivalent since E[Zk X] can be inferred from the members of Ma . The trouble with redundant constraints is that they produce under-determined systems of equations. This is because there are more Lagrange multipliers than equations in such an event, since the redundant constraint can be derived from the rest and yet it is assigned a distinct Lagrange multiplier. Under-determined systems of equations have infinitely many solutions, and hence, in the presence of redundancy in the constraint set, it is impossible to infer a unique prediction formula. However, a neural network could blunder into such a situation and still provide a solution to the M.E. problem. It is certainly desirable to extend the contents of the constraint set to include information that compensates for the statistical dependence between and among the inputs to a neuron. This can be accomplished by adding higher order correlations, such as E[Zk Xi Xj ], ∀ i, j, i ≠ j, into the constraint set. The resulting computations may turn out to be hard to implement for a neuron. For example, the constraint set {E[Zk ], E[Zk Xi ], ∀ i, E[Zk Xi Xj ], ∀ i, j, i ≠ j}, for Zk ε {0, 1} and Xi ε [0, ∞], ∀ i, produces a legitimate density function which takes into account the interactions between the two inputs and the output; however, evaluation of the implied density function requires special functions and power series such that it can not be implemented by an I-O mapping as was suggested for solving the transcendental equation in an earlier example. In general, a simple look-up table is not adequate, and any recursive algorithm between a set of look-up tables implies a tremendous computational burden for a neuron to handle. Thus, any theory suggesting that neurons make use of triple correlations must also explain how a neuron can solve equations that require complex numerical procedures with a biologically plausible I-O function. VI. CONCLUDING REMARKS In this paper, we presented a neural network theory of the maximum entropy based prediction problem. To recapitulate, the issues that biology must address are: 1) What can be stored? For instance, the storage of higherorder correlations by synapses is highly unlikely. 2) What can be represented? That is, discrete versus continuous support sets. 3) What can be computed? That is, the computational complexity involving the solution of simultaneous equations for determining the Lagrange multipliers. A further prediction of the theory is the matching of stored statistics with the appropriate I-O function of the cell. A burden upon neuroscience for investigating this theory is to understand the message passed by an individual presynaptic cell to an individual postsynaptic cell. That is, is the postsynaptic interpretation of the input just binary or is it a continuously valued function; is it continuously valued on a finite space or an infinite space, and so on. REFERENCES [1]
E. T. Jaynes (1957): "Information Theory and Statistical Mechanics", Physical Review, vol. 106, pp. 620-630.
[2]
W. B. Levy (1989): "A Computational Approach to the Hippocampal Function", in Computational Models of Learning in Simple Neural Systems, R. D. Hawkins and G. H. Bower, Editors, Orlando, Florida: Academic Press.
[3]
W. B. Levy (1990): "Maximum Entropy Prediction in Neural Networks", Proceedings of the International Joint Conference on Neural Networks, Washington, D.C., pp. I.7-I.10.
[4]
W. B. Levy and H. Delic (1992): "Minimum Relative Entropy Aggregation of Individual Opinions", IEEE Transactions on Systems, Man, and Cybernetics; submitted.
[5]
W. B. Levy, C. M. Colbert and N. L. Desmond (1990): "Elemental Adaptive Processes of Neurons and Synapses: A Statistical/Computational Perspective", in Neuroscience and Connectionist Theory, M. Gluck and D. Rumelhart, Editors, Hillsdale, New Jersey: Lawrence Erlbaum Associates.
[6]
J. E. Shore and R. W. Johnson (1980): "Axiomatic Derivation of the Principle of Maximum Entropy and the Principle of Minimum Cross-Entropy", IEEE Transactions on Information Theory, vol. 26, pp. 26-37. 4
ACKNOWLEDGMENTS This research was supported by the NIH grants NS15488, MH00622 and MH48161. The authors would also like to thank Dr. John Jane, Chairman, Department of Neurosurgery, University of Virginia, for his support.
5