Document not found! Please try again

Computing with Uncertainty in Probabilistic Neural ... - CiteSeerX

0 downloads 0 Views 103KB Size Report
Oct 30, 1999 - The search for a practical approach to repres- enting uncertainty in neural networks has led from Bayesian inference to progressive simplific-.
Computing with Uncertainty in Probabilistic Neural Networks on Silicon Robin Woodburn, Alex Astaras, Ryan Dalzell, Alan F Murray and Dean K McNeill October 30, 1999

Abstract This paper describes two stochastic neural networks and how they might be instantiated as analogue VLSI circuits. We explain why stochastic neural networks have not been well represented in the hardware field in the past and define a new approach.

1 Overview

of providing both reliable unsupervised feature extraction and inherent, autonomous compensation of sensor drift. This paper explains our approach to 1.1 A novel computational style this important task, which has to date concentrated It is well known that biological neurons exhibit and primarily on the components necessary for the imuse stochastic behaviour. Spontaneous activity, for plementation of the Helmholtz machine [5], and inexample, is not accounted for by deterministic mod- vestigates the possibility of enhancing the architecels, and noise is known to be essential to the stabil- ture for other applications. ity of spiking neural systems [1]. Hence stochastic models, though difficult to design and manipulate, can provide insights into real-world processes and 2 Opportunities and obstacles offer new computational opportunities. Engineers with an interest in learning from bio- There are distinct opportunities in this approach logy see potential parallels between the two discip- and there is a comprehensive body of research on lines, such as: tolerance of imprecision; the perform- stochastic processes on which to depend. Furtherance of enormously-complex computational tasks in more, the benefits would be threefold : real time and in a robust manner; the use of internal  instantiating a probabilistic model in hardware mechanisms for dealing with uncertainty, sparse data offers true parallelism; and noise; and adaptation over short and long time scales. Engineering solutions must work in the real  insights will be gained into natural, probabilworld, where tractability is often more important istic processes, most notably biology, which than precision. This is certainly true of VLSI cirbehave in ways that are not entirely predictcuits and we can expect, as devices become smaller able; and smaller, that noise problems will become a ma computing with uncertainty offers new apjor concern. Circuits that allow for stochasticity, or proaches to dealing with events in submicron that even exploit stochastic processes, are therefore designs that inevitably produce ‘noisier’ and an exciting alternative to traditional techniques. less predictable results. While the general corpus of research in probabilistic modelling is strong. and considerable work has been carried out on computation using stochastic neural networks, few hardware implementations exist. The two prime reasons for this are that :

1.2 Applications We are developing probabilistic VLSI architectures to perform robust data-fusion in systems comprising many, possibly diverse, sensors, with the aim

 Robin Woodburn, [email protected], is a Research Fellow in the University of Edinburgh’s Electrical Engineering Department.

1

stochastic networks have never demonstrated the power of the more familiar, deterministic networks; and providing truly-stochastic noise sources with which to drive the neurons is difficult, as the noise sources must be replicated at every neuron and must be uncorrelated. We consider we have solutions to each of these objections that are, at this stage, acceptable enough to make our investigations well warranted, and that ultimately might demonstrate a new approach to computing on silicon. Furthermore, probabilistic models offer a new benefit to the analogue VLSI design community in that they may well represent a class of neural computational paradigms that demand analogue hardware. It is now well-known that while deterministic, supervised networks such as the MLP and RBF can be implemented as analogue hardware, it is not clear that real benefits accrue from doing so [2, 3]. The computational demands of probabilistic hardware point to analogue VLSI, while the potential ‘front-end’ applications in integrated sensor fusion show a need for a compact, low-cost, analogue solution.

ive network (see Figure 1). Training minimises the Kullback-Leibler divergence, or cross-entropy, between data drawn from the generative model and that drawn from measurements (training and test data). This process also minimises an informationtheoretic measure of performance, the Helmholtz Free Energy (HFE) [7], which can be used as a measure of training error. The Helmholtz machine is trained in a two stage process via the wake-sleep algorithm, whereby the two complementary untrained networks can each provide the target vectors necessary for the other to learn. Over time the recognition network learns to extract latent variables from the data using the hidden units, and the generative network learns to reproduce the data distribution from these variables. A sensor-fusion application would apply a simple classifier to the output of the recognition network to extract interesting information from the system. Furthermore, post-training examination of the conditional probabilities in the generative model may provide insight into how the data-fusion task is being performed and expose any sensor drift as it occurs over time. Wake-sleep offers a rather crude approximation to minimisation of the Kullback-Leibler divergence. While it does not fully satisfy statisticians, it has been shown to work in many circumstances. This is a more important criterion in the context of our work!

3 Useful stochastic networks 3.1 The Helmholtz machine 3.1.1

Description of the machine

The search for a practical approach to representing uncertainty in neural networks has led from Bayesian inference to progressive simplifications, notably stochastic sampling methods such as Markov Chain Monte Carlo sampling[4] and the Boltzmann machine. However, these methods require very long settling times in the sampling, which cannot be compensated for by a hardware implementation due to the complexity of the algorithms. The Helmholtz machine uses a variant of the Expectation Maximisation algorithm, along with simple binary stochastic units and a learning rule which is simple and purely local. Hence, the Helmholtz machine is very amenable to implementation in hardware. The machine [5, 6] is an unsupervised, stochastic neural network which attempts to build a probabilistic, hierarchical model of data generated by a hierarchically-structured set of physical mechanisms. By a probabilistic model, we mean that the hidden units of the network choose states according to a probability distribution rather than deterministically (as in an MLP). The Helmholtz machine comprises two complementary networks, a bottomup, recognition network and a top-down, generat-

3.1.2 The application of the Helmholtz machine to sensor-tracking To test the hypothesis that the Helmholtz machine can be used for tracking non-stationarities in an environment without serious alteration to its original internal model, some simulations were performed. The data set used was a mixture model with four mixtures, each of which specified a probability distribution over nine independent binary variables. The details of the distributions used are the same as in [4]. A two-layer Helmholtz machine, with nine input units and six hidden units, was trained on the data. The performance measure used was Helmholtz Free Energy, the cost function minimised by the Helmholtz machine. Wake-sleep learning was performed until the HFE stabilised, at which point one of the input unit’s distributions was allowed to drift slowly. Two experiments were performed, one with learning during drift and another without. The results are presented in Figure 2. As can be seen, the drift occurred between epochs 120000 and 150000. Figure 2(a) plots the HFE during the simulation run 2

Recognition Network

Generative Network

State of unit i: si

q11

1

q21

N

qji

j

2

Sleep phase target

f11

qMN

f12

i

fji

j

M

{ {

1

i

Wake phase target

Fantasy Vector

Input Data

(a) Wake phase target

Recognition Network

1 2

Generative Network Unit i

Unit i

c

c ficjd

1 2 3 4

d

d

{

qjdic

Sleep phase target

Unit j

Unit j

Unit states

(b)

Figure 1: (a) A binary-valued Helmholtz machine. (b) A two-layered, discrete Helmholtz machine, where the input-layer units have four states and the hidden-layer units have two. when training was stopped once the HFE had stabilised on the static data (epoch 100000), clearly illustrating the detrimental effect of a single drifting input on the performance of the network. Figure 2(b) is identical except that training continues throughout the simulation. The onset of drift is clear, but the wake-sleep algorithm quickly begins to account for it, and to adapt the internal model to match the data, keeping the HFE low, and eventually returning it to its original value.

dressing the initial difficulties with this approach.

3.2 The Product-of-Experts (PoE) machine

The Product of Experts architecture [8] describes a generic family of probabilistic generative models that aim to model real data as a conjunctive, as opposed to an additive, mixture of the individual models. In other words, the probability of a data element is the normalised product, rather than the sum, 3.1.3 Deficiencies of the Binary Helmholtz maof a set of individual probabilities drawn from a set chine of models or ‘experts’. This brings about two key Although the above results are encouraging, the advantages: (1) Sharp distributions can result from data set is constructed to be suitable for a binary- products of broad distributions. In a conventional valued network. The distributions modelled by the additive mixture model, sharp distributions can only stochastic units within a layer are independent of one be generated by including narrow models in the mixanother and the input data is inherently binary. Al- ture. (2) Individual models can be allowed to generthough it may be possible to have an array of bin- ate improbable data outside their range of validity, ary valued sensors, each acting effectively as feature provided that other models ‘veto’ the spurious data. detectors, indicating whether a value is above or be- This permits the individual expert models to develop low a particular threshold, or indicating the presence less complex explanations of the underlying strucor absence of a particular substance, this is likely ture of the data, even if this occasionally results in to be impractical. Hence some thought has been incorrect responses in regions where other models given to expanding the representational power of the have higher, ie dominant, probability. Helmholtz machine and in particular of being able In addition, a simple instance of the Productto use continuous data, as commonly available from of-Experts model that uses the same unit as the a sensor, rather than binary input data. Helmholtz machine has been shown to train successIn addition, we are investigating the use of Gibbs fully using an algorithm drawing upon ‘brief Gibbs sampling as an alternative to wake-sleep, and are ad- sampling’. The PoE architecture can be viewed use3

Drift with wake-sleep learning

Drift without wake-sleep learning 9.2

9

9

8.8

8.6

8.6

Helmholtz Free Energy

Helmholtz Free Energy

8.8

8.4 8.2 8 7.8

8.4

8.2

8

7.8

7.6

7.6

7.4

7.4 7.2

7.2 0

20000

40000

60000

80000

100000

120000

140000

160000

180000

200000

0

20000

40000

60000

80000

100000

120000

140000

Number of Epochs

160000

180000

200000

Number of Epochs

Drift

Drift

Figure 2: Simulations without and with, continuous wake-sleep learning. fully, in our context, as a generalisation of the Helmholtz machine. It uses the same computational elements, and has a more reliable and principled training algorithm that is also more amenable to on-chip implementation. The proof-of-circuit work reported in this paper is therefore applicable directly to the PoE architecture and we are building a small, VLSI, PoE machine using these techniques.

becomes biased); substantial (to imbue the system with non-deterministic behaviour); and zero-mean (the noise has to sum to zero over a period of time much larger than its frequency). The implementation of multiple, uncorrelated noise sources on a chip is not a trivial problem. Possible solutions are an amplified analogue noise source [9], asynchronous sampling of an analogue signal [10], and, more commonly, pseudo-random, bit-stream generators [9, 11, 12]. 4 VLSI circuit techniques for The problem in the analogue domain is that the ‘normal’ noise sources in a MOSFET (flicker, probabilistic computation thermal and shot noise) are of low amplitude in curThis section considers how the binary Helmholtz rent processes. When amplified, the noise becomes machine might be implemented in hardware with bandwidth-limited and thus correlated. This situpulse-based neural techniques, where a neural state ation will change in deep-sub-micron circuits, where is represented by a stream of pulses of variable fre- the level of noise is likely to be much higher. While this is a source of concern to conventional compuquency, width, and/or amplitude. tational paradigms, it may prove advantageous to probabilistic techniques. The work reported in this 4.1 Choice of approach to stochasticity paper is also, therefore, a step towards a new computational style for noisy, deep-sub-micron, VLSI Some form of stochastic element, with useful and devices, where the coding and computational style controllable probabilistic behaviour, must be introworks with the noise, as opposed to against it. duced. The probability that the neuron will turn ON must be determined by the squashed sum of its bias and weighted inputs. This probability determines 4.2 Towards a Stochastic Binary the neuron’s state, and is thus used during training Neuron to calculate the changes to all the weighted connecFor now, we propose a radically different approach tions to that neuron. The method used to introduce a stochastic ele- to extracting the probability from the total neuronment to the signal affects the design choices involved activation signal. By using the activation signal as an in storing the probability as an analogue quantity. input to a current-controlled oscillator, we can transOne proposed method is the introduction of noise late it to a pulse-width modulated form, which, if to create a predictable margin of uncertainty around randomly sampled, will give a logical HIGH with a a sampled value of the modified signal. In the probability that is linearly related to the initial input. case of current-mode circuits this is a straightfor- This serves both to extract the state of the neuron and ward process, but the noise sources supplying each to squash the input function within upper and lower neuron must be: uncorrelated (otherwise the over- limits that are controlled by the designer. all propagation of the signal through the network While additional circuitry is still needed to de4

Mark-to-period ratio

1.0 0.8 0.6 0.4 0.2 0.0 0.0

0.05

0.1 0.15 Input current (uA)

0.2

Mark-to-period ratio

(b) 0.7 0.6 0.5 0.4 0.3 0.2 0.0

(a)

0.2

0.4 0.6 0.8 Ratio of Iin to Iref

1

1.2

(c)

Figure 3: (a) Schematic diagram of the full pulse-width probabilistic oscillator circuit. (b) H SPICE plot showing ideal output characteristics of the oscillator. (c) A similar plot showing results from a test chip. The reasons for the poor dynamic range, compared to the H SPICE simulations, are being investigated. rive the random sampling time, a single random source is enough, as opposed to individual random sources local to each neuron. This reduces network size and allows a larger network implementation. Figure 3(a) depicts such an oscillator, designed to perform the current to pulse-width transformation. Transistors M6, M7, MP5, MP6 and MP7 form a hysteretic comparator, or Schmitt Trigger. A rising voltage in capacitor C1 will eventually cause the Schmitt trigger to change states, turning on transistor M5 and causing the capacitor to start discharging. The switching threshold depends entirely on the design of the Schmitt trigger and can be altered to a certain extent using the reference voltage supplied to the gate of M7. The oscillator produces a fixed, linear relationship between the magnitude of the current input and the fraction of the period that the output pulse spends at a logical HIGH, as shown in Figure 3(c). Stochastic, uncorrelated behaviour is achieved if (and only if)

chip. In order to minimise the risk of noise affecting other components through the substrate, so eventually synchronising the oscillators, there are guard rings surround the circuits, with an extra guard ring around each oscillator, comprising an n-well moat covered with well-taps tied to the analogue power rail. Finally, the unit’s activation modulates the pulse period as well as the mark-to-period ratio, thus minimising the probability of coincident edges from different oscillators, and thus the probability of locking.

4.3 Non-linearity in the neuron

Both the Helmholtz and PoE machines require the neuron outputs to be a stochastic, non-linear function of the activations. If we drive the oscillator with a non-linear current source, then we can make the oscillator output a non-linear function of its input current. To illustrate the point, Figure 4 shows the mark-to-period ratio of the oscillator when driven by  sampling times are independent of oscillator ideal, voltage-controlled current sources. signals; and

 oscillators are independent of one another.

4.4 Stochasticity using sampling

‘Locking’ of the oscillators on the chip is extremely undesirable as this would translate into biased sampling from the conditional probabilities. We have seen no evidence of such locking in previous pulse-stream chips, but we have taken particular measures to reduce the oscillators’ propensity to lock. The location of the two current mirrors on the chip has been chosen to minimise the effect of current spikes, induced when the Schmitt trigger changes state. The digital and analogue power supplies are separate and on opposite sides of the

Provided each oscillator, in a set of neuron oscillators, operates independently of the others, a simple sampling procedure provides a truly stochastic, binary measure of the oscillators’ outputs. Figure 5 illustrates this point. The choice of sampling time must also, of course, be uncorrelated with any of the oscillators. We have thereby transformed an ordinary, linear, deterministic oscillator into a non-linear, stochastic oscillator. 5

Mark-to-period ratio

1.0

1.0

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0 0

2

4

6

8

10

0

2

4

6

Voltage (steps)

Voltage (steps)

(a)

(b)

8

10

Figure 4: Illustration of how the oscillators mark-to-period ratio can be varied by stepping an ideal, voltagecontrolled current source through 11 voltage steps to achieve (a) a sigmoidal non-linearity and (b) a Gaussian non-linearity. 0.07 osc 1

0

osc 2

1

0.80

0.67 osc 3

.....

0

0.50 osc n

0

sample

Figure 5: The sampling system that converts a current, representing a value between zero and one, into a binary output according to the probability of the value of the current. Provided the oscillators operate in an uncorrelated way, the binary outputs are truly stochastic, and can be non-linear depending on how the current is controlled.

5 Conclusions and future work

atively speaking, the binary version performs well in restricted circumstances, relying on our ability to frame the problem in a particular way; the discrete version opens up a means of learning solutions to a potentially-exciting range of problems, but we have as yet no reliable technique for training such a machine. Our aim is further investigation of the characteristics of both the binary and the discrete Helmholtz machine, using Gibbs sampling and other alternatives to wake-sleep. For hardware, the Helmholtz machine is an attractive candidate for long-term monitoring and automatic-calibration applications. We have demonstrated a preliminary version of true stochastic behaviour in analogue VLSI. Using elements developed for MLP and RBF circuits [13, 14], we can implement synapse and weight-changing circuitry and implement integrated Helmholtz machine

Obstacles to practical applications of stochastic neurons remain. The potential of such systems is, however, high in terms of modelling and recognition applications and as a new approach to performing useful computation with noisy elements. For the binary Helmholtz machine the machine will train successfully with ‘binary-friendly data’, for example data that describe the probabilities that a set of sensors are OFF or ON. Such a machine, once trained, will track slow drift in the sensors, meaning that the OFF-ON probabilities are changing over time. The discrete Helmholtz machine is intended to overcome the limitations of binary data. However, in learning, as the network attempts to ‘explain’ the ‘causes’ of the data, in fact it arrives at many explanations, with no indication that any one of these is to be preferred. Compar6

and PoE devices with on-chip training driven by the hardware-friendly wake-sleep and brief Gibbs sampling algorithms respectively. We will report this next stage of the work in a future paper.

Applied Materials plc; sponsorship of D.K.McNeill through the Natural Sciences and Engineering Research Council of Canada.

Acknowledgements

Note for reviewers

The authors wish gratefully to acknowledge the following : support from the EPSRC, through grant no. GR/M40288; support for Alex Astaras from

Additional results from actual silicon will be available for the conference, and additional simulation results will also be presented.

References [1] W. Gerstner, “Rapid signal transmission by populations of spiking neurons,” in Proceedings of the Ninth International Conference on Artificial Neural Networks (ICANN99), Edinburgh, Scotland, pp. 7 – 12, IEE, 1999. [2] T. Lehmann and R. Woodburn, “Biologically-inspired on-chip learning in pulsed silicon neural networks,” Analog Integrated Circuits and Signal Processing, vol. 18, no. 2-3, pp. 117 – 131, 1998. [3] A. F. Murray and R. Woodburn, “The prospects for analogue neural VLSI,” International Journal of Neural Systems, vol. 8, no. 5-6, pp. 559–579, 1998. [4] R. M. Neal, “Connectionist learning of belief networks,” Artificial Intelligence, vol. 56, pp. 71–113, 1992. [5] P. Dayan, P. E. Hinton, R. M. Neal, and R. Zemel, “The Helmholtz Machine,” Neural Computation, vol. 7, pp. 889–904, September 1995. [6] G. E. Hinton, P. Dayan, B. J. Frey, and R. M. Neal, “The ‘wake-sleep’ algorithm for unsupervised neural networks,” Science, vol. 268, no. 5214, pp. 1158 – 1161, 1995. [7] P. Dayan, G. E. Hinton, R. M. Neal, and R. S. Zemel, “The Helmoltz Machine,” Neural Computation, vol. 7, no. 5, pp. 889 – 904, 1995. [8] G. E. Hinton, “Products of experts,” in Proceedings of the Ninth International Conference on Artificial Neural Networks (ICANN99), Edinburgh, Scotland, pp. 1 – 6, IEE, 1999. [9] J. Alspector, J. W. Gannett, S. Haber, M. B. Parker, and R. Chu, “A VLSI-efficient technique for generating multiple uncorrelated noise sources and its application to stochastic neural networks,” IEEE Transactions on Circuits and Systems, vol. 38, pp. 109–123, Jan. 1991. [10] T. G. Clarkson, C. K. Ng, and J. Bean, “Review of hardware pRAMs,” in Proceedings of the Weightless Neural Network Workshop ’93, (University of York, UK), pp. 18 – 23, 1993. [11] Y. Kondo and Y. Sawada, “Functional abilities of a stochastic logic neural network,” IEEE Transactions on Neural Networks, vol. 3, no. 3, pp. 434 – 443, 1992. [12] M. van Daalen, P. Jeavons, J. Shawe Taylor, and D. Cohen, “Device for generating binary sequencing for stochastic computing,” Electronics Letters, vol. 29, no. 1, pp. 80 – 81, 1993. [13] Hamilton A, Murray A F, Baxter D J, Churcher S, Reekie H M, and Tarassenko L, “Integrated pulse stream neural networks : results, issues and pointers,” IEEE Transactions on Neural Networks, 3, (3), 385 – 393, 1992. [14] D. Mayes, A. F. Murray, and H. M. Reekie, “Non-Gaussian kernel circuits in analogue VLSI,” IEE Proceedings: Circuits, Devices and Systems, vol. 146, no. 4, pp. 169 – 175, 1999.

7