Associative Arithmetic with Boltzmann Machines

Associative Arithmetic with Boltzmann Machines: the Role of Number Representations Ivilin Stoianov1, Marco Zorzi1, Suzanna Becker 2, and Carlo Umilta1 1

University of Padova (Italy) and 2 McMasters University (Canada)

Abstract This paper presents a study on associative mental arithmetic with mean-field Boltzmann Machines. We examined the role of number representations, showing theoretically and experimentally that cardinal number representations (e.g., numerosity) are superior to symbolic and ordinal representations w.r.t. learnability and cognitive plausibility. Only the network trained on numerosities exhibited the problem-size effect, the core phenomenon in human behavioral studies. These results urge a reevaluation of current cognitive models of mental arithmetic.

1

Simple Mental Arithmetic

Research on mental number processing has revealed a specific substrate for number representations in the inferior parietal cortex of the human brain, where numbers are thought to be encoded in a number-line (NL) format (Fig.2a,b) [1]. Simple mental arithmetic, however, is thought to be based on an associative network storing arithmetic facts in a verbal (symbolic) format [2]. Psychometric studies show that in production or verification of single-digit arithmetic problems (e.g., addition or multiplication), reaction times (RTs) and errors increase as a function of the size of the problem (problem-size effect) [3]. Problem size can be indexed by some function of the two operands, such as their sum or the square of their sum. The latter is the best predictor of the reaction time data for simple addition problems [4]. The first connectionist attempt to model simple multiplication was based on the autoassociative Brain-State-in-a-Box (BSB) network [5]. Numbers were represented jointly as NL and symbolic codes. Learning performance was far from optimal, in spite of the fact that the problem was simplified to computing “approximate” multiplication (to reduce the computational load). A later model of simple multiplication was MATHNET [6]. This model used NL representations and was implemented with a Boltzmann Machine (BM) [7]. The network was exposed to the arithmetic problems according to a schedule that roughly followed the experience of children when learning arithmetic, i.e., facts with small operands came before larger facts. However, fact frequency was manipulated in a way that did not reflect the real distribution: small facts were presented up to seven times as often as large problems. Therefore, MATHNET exhibited a (weak) problem-size effect, which was entirely produced by the specific training schedule and by the implausible frequency manipulation. In the present study, we used mean-field BMs trained with the contrastive divergence learning algorithm [8] to model the simple addition task and to contrast three different hypotheses about the representation of numbers. We found that numerositybased representations facilitate learning and provide the best match to human reaction times. We conclude that the traditional view of symbolic mental arithmetic should be reevaluated and that number representations for arithmetic should incorporate the basic property of cardinal meaning.

Hidden layer Visible Layer

1st Operand

2nd Operand

Result

1110000000000000 1111110000000000 1111111110000000

Figure 1. BMs for Mental Arithmetic. Patterns are encoded at the visible layer. To recall a fact, its two arguments are fixed at the visible layer and the network iterates until convergence.

2

Mean-Field BMs with Contrastive Divergence Learning

In line with previous connectionist attempts to model mental arithmetic, we assume that arithmetic facts are learned and stored in an associative NN. One typical associative network is the Boltzmann Machine, consisting of binary neurons with stochastic dynamics, fully connected with symmetric weights that store correlations of activations between connected units [7]. Data are encoded by visible neurons, whereas hidden neurons capture high-order statistics (Fig.1). BMs recall stored patterns by synchronous or asynchronous iterations, starting from initial activations representing partial patterns. Learning derives from the probabilistic law governing the net freerunning state and targets a weight set W* resulting in free-running distribution PBM∞(s) similar to the data distribution PBM0(s) = Q(s). The learning procedure minimizes the Kullback-Liebler divergence between the distributions at time zero and equilibrium by computing derivatives w.r.t. the weights and applying gradient descent: ∆wij = η ( 0 - ∞) (1) The update of a weight connecting two units is proportional to the difference between the average of the correlations between these two units, computed at time 0 (positive, or fixed phase) and after reconstructing the pattern (negative, or free-running phase). Since the stochastic BM is computationally intractable, [9] replaced the correlations with mean field approximation: = mi mj , where mi is the mean field activity of neuron i and is given by the solution of a set of n coupled mean-field equations (2). Such an approximation turns the stochastic BM into a discrete NN since we can operate entirely with mean-field values, which also allows graded values. mi = σ ( Σj wij mj + θj) (2) Hinton [10] replaced the correlations computed in the free-running phase with correlations computed after one-step data reconstruction (contrastive divergence learning), which was shown also to drive the weights toward a state in which the data will be reproduced according to their distribution. This was followed by the fast Contrastive Divergence Mean-Field learning (3) [8], that we use for our simulations. ∆wij = η ( mi0 mj0 - mi1 mj1) (3) To learn a set of patterns, the learning algorithm presents them to the network in batches. For each pattern, the network performs a positive (wake) phase, when only the hidden layer settles, and a negative (sleep) phase, in which the network further reconstructs the visible pattern and then once again settles the hidden layer. After each phase, statistics for the corellations between the activations of each pair of connected neurons is collected. The weights can either be updated after each pattern or at the end of a batch. Here we used batch learning. The network recalls patterns by initializing the visible layer with a part of a pattern and iterating by consequent updating of the hidden and the visible layer until convergence. The number of steps to converge corresponds to RTs. We used the unsupervised learning mode, which in the sleep phase

allows the network to reconstruct the entire input pattern. To improve the stability of learning, the weights were decayed toward zero after each learning iteration with a decay of 0.001. Every 20 epochs the network performance was tested.

3

Properties of Cardinal, Ordinal and Symbolic Representations

This section studies cardinal and ordinal number codes, showing that from a statistical point of view cardinal representations inherently bias associative architectures toward longer processing time for larger numbers. The analysis is based on the fact that BMs learn to match the probability distribution of the data. We examine two factors that can affect network RTs. One is the empirical distribution Pi C of activations of individual neurons i, for a number encoding C. This is a “static” factor causing a neuronal bias bi toward the mean E(Pi C) of that distribution. The time to activate a neuron i to a target state ti , starting from rest state ri (zero), is proportional to (ti - ri ) and is inversely proportional to its bias bi. Since activating the representation RC,n of a number n needs convergence of all bits Ri C,n in this code, the time to activate RC,n depends on the time to activate its “slowest” bit. In addition, for the result-part of addition facts, the neuron activation distribution will also reflect the frequency of a given sum (e.g., for table [1…9]+[1…9], the most frequent sum is 10). The second factor is pattern overlap. Addition arithmetic facts consist of the combination of two numbers (n1,n2)n1,n2=1…9 , appended with their sum (n1+n2). There is a great a deal of pattern overlap among these facts, hence, some interference (i.e., cross-talk) will arise at the time of pattern recall. To account for this, we will measure the mean pattern overlap Dn1,n2 for all addition facts with a normalized vector dot-product and then project it to the sum (n1+n2): D(n1+n2). The network RT for a given sum is expected to be inversely proportional to the overlap D(n1+n2). In Cardinal number codes N (e.g, Numerosity [11], Fig.2a), the representation of a number n embodies the pattern of numbers m