Probabilistic Learning Models

6 downloads 0 Views 373KB Size Report
∗Published in:Foundations of Bayesianism, David Corfield and Jon Williamson (eds),. Klumer Academic ...... [18] Mark Gibbs and David J. C. MacKay. Efficient ...
Probabilistic Learning Models∗ Peter M Williams School of Cognitive and Computing Sciences University of Sussex, Falmer, Brighton BN1 9QH, UK email: [email protected]

Abstract The paper reviews some uses of Bayesian methods in supervised and unsupervised learning in artificial intelligence. A key issue is the need to find a balance between model complexity and information content of the data. Examples of the Bayesian approach to this problem are discussed.

1

Introduction

The purpose of this review is to provide a brief outline of some uses of Bayesian methods in artificial intelligence, specifically in the area of neural computation. Prior to the 1980s, much of knowledge representation in Artificial Intelligence was concerned with rule-based or expert systems. The aim was to write rules and facts, expressing human knowledge in some domain, in a quasi-logical language. Although achieving some successes, this approach came to be seen by the 1980s as cumbersome in adapting to changing circumstances, partly owing to the explosion in the list of rules and exceptions needed to cover novel cases. In the mid 1980s a new paradigm emerged, referred to variously as parallel distributed processing, connectionism or neural computing, largely through the publication of Rumelhart and McClelland [39]. One of the motivating ideas was that living creatures are programmed by experience, rather than by spelling out every step in a process, and that representations of human knowledge and learning need not to be restricted to areas in which rule-based algorithms can be found. An important tool in implementing this programme was the neural network which, in a very rudimentary way, might be seen as sharing some of the characteristics of the brain. Although having its origins in the biological and cognitive sciences, significant contributions to neural computation were soon made by physicists and statisticians. Furthermore applications were made to practical problems outside the field of the biosciences. The list includes prediction (weather and utility demand forecasting, medical diagnosis, derivative and term-structure models in finance) navigation and control (aircraft landing, plant monitoring, autonomous robots) pattern recognition (speech and signal processing, ∗ Published in:Foundations of Bayesianism, David Corfield and Jon Williamson (eds), Klumer Academic Publishers, 2001, pp.117–134.

1

hand-written character recognition, finger-print identification, mineral exploration) etc. Possibly the reasons for this widespread interest is that neural networks can be used wherever a linear model is used, that they can model non-linear relationships, and that they require no detailed model of the underlying process. During the ten years or so following the publication of [39] much attention was given to improving the algorithms used for fitting neural network models. At the same time, it was appreciated that there were close links with established statistical machine learning and pattern recognition methods. As the subject has developed, it has grown closer to information theory, statistics, image and signal processing, as regards both its engineering and neuroscience concerns. Inevitably its maturing expression in statistical form has led to an application of the Bayesian approach. Neural computing methods have been applied to both supervised and unsupervised learning. The practical applications mentioned above have been largely in the area of supervised learning, which includes classification, regression and interpolation. In the past five years, however, much work at the forefront of neural computation has been in the area of unsupervised learning, which traditionally includes clustering and density estimation. Both of these areas will be touched on briefly.

2

Neural networks

The simplest form of neural network provides a way of modelling the relationship between an independent variable x and a dependent variable y. For example, x could be financial data up to a certain time and y could be a future stock index, exchange rate, option price etc. Or x could represent geophysical features of a mining prospect and y could represent mineralization at a certain depth. In general x and y can be any vectors of continuous or discrete numerical quantities. Such a network implements an input-output mapping y = f (x, w) from x = (x1 , . . . , xm ) to y = (y1 , . . . , yn ). The mapping depends on the values of the connection strengths or weights w in the network. The connections between processing elements can be arbitrary but a layered (non-cyclic) architecture, as shown on the left of Figure 1, is commonest. Each non-input unit in the network has input weights w1 , . . . , wn and a bias w0 as shown on the right in Figure 1. For hidden units, namely those that are neither input nor output units, y = tanh (w0 + w1 x1 + · · · + wn xn ) where the transfer function, in this case the hyperbolic tangent, squashes the output into the interval (−1, 1) as shown in Figure 2. For output units, we assume a direct linear relationship: y = w0 + w1 x 1 + · · · + wn x n 2

y1 6

y2

yn

···

6

6

   • • • *  YH H  3 kQ  Q  HH I @    K A  AKA 6     AH Q  @  Q   @ A HH Q  A Q A  @ A   HH Q    H   A  @ QA H • • • • • • *  YH H    kQ Q 3  I H6  K A KA@ A K A   A HQ  A   @ H Q   A  A @ A HH  Q  Q  A A @ A  HH  Q     A @ A  H QA  H • • •    6 6 6

x1

x2

y 6  w0   6 @ I @ @ wn w1 w2   @  @  @ x1 x2 xn

xm

···

Figure 1: The diagram on the left shows a layered feed-forward network with inputs x1 , . . . , xm and outputs y1 , . . . , yn separated by a single layer of hidden units. In general there can be any number of hidden layers. The diagram on the right shows an individual processing unit, with input weights w1 , . . . , wn and bias w0 .

1 k

saturated

0 saturated

−1

s

K linear

Figure 2: Diagram showing the hyperbolic tangent squashing function. Other functions of similar form may be used. Typically such functions are monotonic, approximately linear in the central range and saturate to asymptotic values at the extremes.

3

A network of the type shown in Figure 1 can have several internal layers of hidden units connecting input units and output units. A feedforward neural network is therefore a composition of linear and non-linear transformations like a multi-layer sandwich linear . . . . . . . squash . . . . . . . linear . . . . . . . squash . . . . . . . linear in this case having two hidden layers. Without non-linear squashing, the sandwich would collapse, by composition, to a single linear transformation. With the interposed non-linearities a neural network becomes a universal approximator capable of modelling an arbitrary continuous function [21, 4, 37].

2.1

Model fitting

Suppose we have a training set of pairs of observations (x1 , y1 ), . . . , (xN , yN ) where each xi is a vector of inputs and each yi is now a single scalar target output (i = 1, . . . , N ). Then least squares fitting consists of choosing weights w to minimise the data misfit E(w) =

N  2 1X yi − f (xi , w) 2 i=1

(1)

where f (x, w) is the function computed by the network for a given set of weights w. Since the gradient ∇E(w) is easily computed using the so-called backpropagation algorithm, standard optimisation techniques such as conjugate gradient or quasi-Newton methods [53, 7] can be applied to minimise (1). To make the link with Bayesian methods, recall first that least squares fitting is equivalent to maximum likelihood estimation, assuming Gaussian noise. To see this, suppose the target variable has a conditional normal distribution 1 1 p(y | x) = √ exp − 2 2πσ



y − µ(x) σ

2

where µ(x) is the conditional mean, for a given input x, and where the variance σ 2 is assumed, for the present, to be constant. Then, assuming independence, the negative log likelihood of the data (x1 , y1 ), . . . , (xN , yN ) is proportional to N  2 1X yi − µ(xi , w) 2 i=1

(2)

up to an additive constant. Comparison of (1) and (2) shows that least squares fitting is equivalent to maximum likelihood estimation of the weights, if the network output f (x, w) is understood to compute the conditional mean µ(x). More generally, there is no need to assume that σ is constant for all inputs x. If we allow the network to have two outputs, we can allow them to compute 4

µ(x)

log σ(x)

6

6

network w 6

x1

6

···

xm

Figure 3: Schematic representation of a neural network with input x = (x1 , . . . , xm ) and weights w, whose output is interpreted as computing the mean and log variance of the target variable.

the two input-dependent parameters µ(x) and σ(x) of the predictive distribution for y. It is more convenient, however, in order to have an unconstrained parametrisation, to model log σ(x) rather than σ(x), so that the network can be visualised schematically as in Figure 3. The negative log likelihood of the data can now be written as N 1X L(w) = log σ(xi , w)2 + 2 i=1

(



yi − µ(xi , w) σ(xi , w)

2 )

(3)

and this can be considered as the generalised error function. Maximum likelihood fitting is obtained by minimising L(w) where ∇L(w) can also be computed by backpropagation.1

2.2

Need for regularisation

Neural networks, of the type considered here, differ from linear statistical models in being universal approximators, capable of fitting arbitrary continuous non-linear functions. This means (i) that there is greater potential for overfitting the data, leading to poor generalisation outside the training sample, and (ii) that the error surface can be of great complexity. Neural network modelling therefore calls for special techniques in particular (i) for some form of stabilisation, or regularisation, and (ii) for some form of integration over multiple local minima. The second of these is discussed in Section 2.6. Possible solutions to the overfitting problem include (a) limiting the complexity of the network; (b) stopping training early before overfitting begins; 1 In the general multivariate case the conditional density p (y | x) for the target variable  y = (y1 , . . . , yn ) is proportional to |Σ|−1/2 exp − 12 (y − µ)T Σ−1 (y − µ) where µ is vector of conditional means and Σ is the conditional covariance matrix. We can then model µ(x) and log Σ(x) as functions of x in ways that depend on the outputs of a neural network when x is given as input [57]. This permits modelling of the full conditional correlations in multivariate data. Applications to heteroskedastic (time-dependent) volatility in financial time series are given in [55].

5

(c) adding extra terms to the cost function to penalise complex models. The first of these (a) amounts to a form of hard structural stabilisation. For example, the number of weights might be limited to a certain proportion of the number of training items, using various rules of thumb for the exact proportion. A deeper treatment of this approach, including analysis of error bounds, forms part of statistical learning theory [46]. The second approach (b) observes that, at early stages of training, the network rapidly fits the broad features of the training set. For small initial values of the weights, the neural network is in fact close to being a linear model (see the central linear segment of Figure 2). As training progresses the network uses more resources to fit details of the training set. The aim is to stop training before the model begins to fit the noise. There are several ways of achieving this including monitoring performance on a test set (but see [13]). The third approach (c) is a form of Tikhonov regularisation [41, 9]. In the case of neural networks, (c) often takes the form of weight decay [19]. This adds an extra term to the cost function E(w) = L(w) + λR(w)

(4)

where L(w) expresses the data misfit and R(w) is a regularising term that penalises large weights. The aim becomes minimisation of the overall objective function (4) where λ is a regularising parameter that determines a balance between the two terms, the first expressing misfit and the second complexity. There remains the problem of locating this balance by fixing an appropriate value for λ. This is often chosen by some form of cross-validation. But performance on a test set can be noisy. Different test sets may lead to different values of λ. We therefore examine various Bayesian solutions to this problem.

2.3

Bayesian approach

Consider the case of Section 2.1 where we have training data D corresponding to observed pairs (x1 , y1 ), . . . , (xN , yN ). Suppose the aim is to choose the most probable network weights w given the data, in other words to maximise the posterior probability density p(w|D). Using Bayes’ theorem we have p(w|D) ∝ p(D|w) p(w)

(5)

where p(D|w) is the likelihood of the data and p(w) is the prior over weights. Maximising (5) is the same as minimising its negative logarithm E(w) = L(w) − log p(w) + constant

(6)

where the negative log likelihood L(w) = − log p(D|w) is given by (3) in the case of regression with Gaussian noise. Q Now suppose that w has a Laplace prior p(w|λ) ∝ i exp −λ|wi | where λ > 0 is a scale parameter [54]. Then, ignoring terms not depending on w, (6) becomes Eλ (w) = L(w) + λkwk1 (7) 6

where kwkp = ( i |wi |p )1/p is the Lp norm (p ≥ 1) of the weight vector w. Comparison of (7) with (4) shows that the regularising term λR(w) corresponds to the negative logarithm of the prior. The same is true assuming a Gaussian Q prior for weights p(w|λ) ∝ i exp −(λ/2)|wi |2 when (7) becomes P

Eλ (w) = L(w) + (λ/2)kwk22

(8)

The difficulty remains, however, that λ is still unknown.2

2.4

Eliminating λ

One approach is to eliminate λ by means of integration [12, 54]. Consider λ to be a hyperparameter with prior p(λ) so that Z

p(w) =

p(w|λ) p(λ) dλ

(9)

which no longer depends on λ and can be substituted directly into (6). Since λ > 0 is a scale parameter, it is natural to use a non-informative Jeffreys’ prior [23] for which p(λ) is proportional to 1/λ. It is then straightforward to integrate (9) and substitute into (6) to give the objective function E(w) = L(w) + W log kwkp

(10)

where W is the total number of weights and p = 1 for the Laplace prior or p = 2 for the Gaussian prior. (10) involves no adjustable parameters so that regularisation of network training, conceived as an optimisation problem, is automatic.3 This approach to the elimination of λ is independent of the form taken by the likelihood term L(w), which will depend on the appropriate statistical model for the data. Discrete models, for example, correspond to classification. Non-Gaussian continuous models are discussed in [8, 56] for example.

2.5

The evidence approach to λ

An alternative approach to determining of λ is to use its most probable value given the data. This is the value of λ which maximises p(λ|D). Since p(λ|D) is proportional to p(D|λ)p(λ), this is the same as choosing λ to maximise p(D|λ), assuming that p(λ) is relatively insensitive to λ. In this approach p(D|λ) is called the evidence for λ. p(D|λ) can be expressed as an integral over weight space Z p(D|λ) =

p(D|w, λ) p(w|λ) dw

2

(11)

In practice it may be assumed that different classes of weights in the network have different typical, if unknown, scales. The classes are chosen to ensure invariance under suitable transformations of input and output variables. For simplicity we deal here with a single class. 3 Instead of using the improper Jeffreys prior, we could use a proper conjugate prior. This is the gamma distribution, for either the Laplace or Gaussian weight priors, with shape and scale parameters α, β > 0 say. The regularising term is then (W + α) log(kwk1 + β) for the Laplace weight prior and (W/2 + α) log(kwk22 + 2β) for the Gaussian prior. Both reduce to W log kwkp (p = 1, 2) as α, β approach zero.

7

and, under suitable simplifying assumptions, this integration can be performed analytically. Specifically, assume that the integrand p(D|w, λ), which is proportional to the posterior distribution of weights p(w|D, λ), can be approximated by a Gaussian in the neighbourhood of a maximum w = wMP of the posterior density. In the case of a Gaussian weight prior, it can then be shown [29] that, at a maximum of p(D|λ), λ must satisfy λ kwMP k22 =

X i

νi λ + νi

(12)

where νi are the eigenvalues of the Hessian of L(w) = − log p(D|w, λ) evaluated at w = wMP . (12) can be used as a re-estimation formula for λ, using λold on the right and λnew on the left, in iterative optimisation of (8).4 In the case of a Laplace prior, the evidence approach is essentially equivalent to the integration approach of the previous section [54, Appendix]. MacKay [30] argues that, in the case of a Gaussian prior, the evidence approach provides better results in practice. [30] also discusses the range of validity of the various approximations involved in this approach.

2.6

Integration methods

The preceding discussion has assumed that the aim is to minimise E(w) in (6). This is the same as finding a maximum, at w = wMP say, of the posterior distribution p(w|D). Predictions for the target value y, given inputs x, could then be made using the distribution p(y|x, D) = p(y|x, wMP ).

(13)

This can be unsatisfactory, however, since it amounts to assuming that the posterior distribution for w can be adequately approximated by a delta function at w = wMP . In practice, for a general neural network model, there may be several non-equivalent modes of the posterior distribution, i.e. local minima of E(w) as was noted in Section 2.2, and each may extend over a significant region. The correct procedure, from a Bayesian point of view, is to make predictions using an integral over weight space Z

p(y|x, D) =

p(y|x, w) p(w|D) dw

(14)

where the predictive distribution p(y|x, w), corresponding to a particular value of w, is weighted by the posterior probability of w.5 In some cases (14) can be integrated analytically if suitable assumptions are made. For example, suppose the statistical model p(y|x, w) is Gaussian 4

Note that since the Hessian is evaluated at a maximum of p(w|D, λ), rather than at a maximum of p(D|w, λ), the eigenvalues may not all be positive. Furthermore, since wMP depends on λ, the derivation of (12) strictly speaking ignores terms involving dνi /dλ [28]. 5 To emphasise a danger in (13) note that (10) will have a local minimum at w = 0, even for sufficiently small β > 0 as defined in footnote 3. This mode of p(w|D), however, will normally only have very local extent, so that it will contribute little to the integral in (14), except in cases where there is little detectable coupling between any yi and xi in D, when the opinion implied by w = 0 that p(y|x, D) ≡ p(y|D) would deserve some weight.

8

and that the posterior distribution p(w|D) can be adequately approximated by a Gaussian centered at w = wMP . Then the posterior distribution is again Gaussian with a variance in which the intrinsic process noise is augmented by an amount, corresponding to model uncertainty, which increases with the dispersion of p(w|D) around wMP . An interesting alternative is to approximate p(w|D) by a more tractable distribution q(w) obtained by minimising the Kullback-Leibler divergence Z

q(w) log

q(w) dw. p(w|D)

This is known as ensemble learning [20, 27, 3]. Typically q takes the form of a Gaussian whose parameters may be fitted using Bayesian methods [27, 3]. The advantage of this approach is that the approximating Gaussian is fitted to p(w|D) globally rather than locally at w = wMP . More generally, q may be assumed to have free form, or to be a product of tractable distributions of fixed form. It is normally still implied, however, that there is essentially only one significant mode for the posterior distribution. In some cases it may be necessary to attempt the integral in (14) using numerical methods. In general (14) can be approximated by p(y|x, D) ≈

M 1 X p(y|x, wi ) M i=1

(15)

provided {w1 , . . . , wM } is a sample of weight vectors which is representative of the posterior distribution p(w|D). The problem is to generate the {wi } by searching those regions of the high-dimensional weight space where p(w|D) is large and extends over a non-negligible region. This problem has been studied by Neal [34, 36] who has developed extensions of the Metropolis Monte Carlo method specifically adapted to neural networks. This method involves a large number of successive steps through weight space but, to achieve a given error bound, only a much smaller number of visited locations need be retained for predictive purposes. A less efficient method, though one that is simple and often effective, is to choose the {wi } as local minima obtained by some standard optimisation method from sufficiently many independently chosen starting points. Note that (15) expresses the resulting predictive distribution as a finite mixture of distributions. The variation between these distributions expresses the extent of model uncertainty. For example, the variance of the mixture distribution for y will be the mean of the predicted variances plus the variance of the predicted means. Specifically if µi and σi2 are the mean and variance according to wi , the predicted mean is hµi i and the predicted variance is hσi2 i + {hµ2i i − hµi i2 } where hµi i is the average of µ1 , . . . , µM etc. The first term hσi2 i represents modelled noise and the second term hµ2i i − hµi i2 represents model uncertainty.

9

3

Kernel-based methods

3.1

Gaussian processes

The previous discussion took as its basis the idea of a prior p(w) over network weights w. Now, for a given input vector x, the prior p(w) over weights will induce a prior p(y) over the output y = f (x, w) computed by the network, assuming for simplicity that the network has a single output unit. This is because y depends on network weights w as well as on the input x, so that uncertainty in the weights induces uncertainty in the output, for given x. More generally, if we consider the outputs y1 , . . . , yN computed for different inputs x1 , . . . , xN , the prior distribution over weights p(w) determines a joint prior p(y1 , . . . , yN ) over the outputs calculated for these inputs. Given the sometimes opaque role of individual weights in a network, it can be argued that it may be more natural to specify such a prior directly. Predictions can then be made using conditionalisation. Writing yN = (y1 , . . . , yN ) for the observed values at x1 , . . . , xN , the predictive conditional distribution for yN +1 at a new point xN +1 is given by p(yN +1 | yN ) =

p(yN +1 ) p(yN )

(16)

where, by assumption, the numerator and denominator on the right are known. Neal [35, 36] has shown that, for neural networks with independent and identically distributed priors over weights, the prior p(y1 , . . . , yN ) converges, for any N , to a multivariate Gaussian as the number of hidden units tends to infinity.6 Such a family of variables y(x) is called a Gaussian process.7 For a Gaussian process, the conditional predictive distribution (16) for yN +1 is also Gaussian with mean and variance which can be expressed as follows. If yN has covariance matrix ΣN and if the covariance matrix for yN +1 is written in the form ! ΣN σ ΣN +1 = σT σ then the predictive distribution for yN +1 has mean hyN +1 i = σ T Σ−1 N yN and −1 T variance σ−σ ΣN σ. Notice that hyN +1 i takes the form of a linear combination hyN +1 i = α1 y1 + · · · + αN yN of observed values. The weighting coefficients αi automatically take account of the correlations between values of y(x) at different input locations x1 , . . . , xN +1 . As might be expected, the predicted variance σ − σ T Σ−1 N σ is always less than the prior variance σ. To implement this model it is necessary to model the covariance matrix ΣN . Decisions must also be made about modelling the mean, if trends or drifts are permitted. Writing oN

n

ΣN = K(xi , xj , λ) 6

i,j=1

Explicit forms for the resulting covariance functions are derived in [50] Gaussian processes are already used in Wiener-Kolmogorov time series prediction [49] and in Matheron’s approach to geostatistics [31]. 7

10

where λ is the vector of parameters of the model, the process is said to be stationary if K(x, x0 , λ) depends only on the separation x − x0 for any x, x0 . For example, Gibbs and MacKay [18] consider a stationary process m 1X xk − x0k K(x, x , λ) = λ1 exp − 2 k=1 αk

(



0

2 )

+ λ2 + λ3 δ(x, x0 )

where λ = (λ1 , α1 , . . . , αm , λ2 , λ3 ) is the vector of adjustable parameters and m is the dimension of x. Williams and Rasmussen [51] include a further linear regression term λ4 xT x0 . The Bayesian predictive distribution now becomes p(yN +1 |y) =

Z

p(yN +1 |y, λ) p(λ|y) dλ

which must be integrated by Monte Carlo methods or by searching for maxima of p(λ|y) ∝ p(y|λ) p(λ) and using the most probable values of the hyperparameters [51, 33]. Considerable attention has been paid recently to reducing the computational complexity of Gaussian process modelling [18, 43, 52] and much of this work is also applicable to other kernel based methods. It should be noted that in geostatistics the “covariogram” for stationary processes, for which K(x, x0 ) = k(x − x0 ), is frequently estimated empirically from the data. If a parametric model is used, the form is chosen to reflect prior geological or geochemical knowledge [16]. Here too, Bayesian methods have been applied to the problem of estimating spatial covariance structures [24, 26].

3.2

Support Vector Machines

A recent significant development in classification and pattern recognition has been the introduction of the Support Vector Machine (SVM) [46, 17]. SVM methods have been applied to regression [47] and density estimation [48] but we shall concentrate on binary classification. Suppose a set of training examples (x1 , y1 ), . . . , (xN , yN ) is given where each xi is a vector of real-valued inputs and each yi ∈ {−1, +1} is the corresponding class label (i = 1, . . . , N ). Assume initially that the two classes are linearly separable, in other words there exists a linear functional f such that f (xi ) < 0 whenever yi = −1 and f (xi ) > 0 whenever yi = +1. The class label of a new item x can then be predicted by the sign of f (x). Where such a separating functional f exists, it will not be unique, if only because it is undetermined to within a positive scalar multiple. More importantly, the separating hyperplane {x : f (x) = 0} is generally not unique. The central tenet of the SVM approach is that the optimal hyperplane is one that maximises the minimum distance between the hyperplane and any example in the training data. The optimal hyperplane can then be found by convex optimisation methods. For data that is not directly linearly separable, an embedding of the input features x 7→ Φ(x) into some inner-product space H may allow the resulting training examples {(Φ(xi ), yi )} to be linearly separated in H. In that case it 11

turns out that detailed information about Φ and its range are not needed. This is because the optimal hyperplane in H, where distance is defined by the inner product, depends on Φ only through inner products K(x, x0 ) = Φ(x) · Φ(x0 ). In fact the classifying functional, corresponding to the optimal separating hyperplane, can be expressed as y(x) = w0 +

N X

wi K(x, xi )

(17)

i=1

where w0 , w1 , . . . , wN are the model parameters and K is the kernel function. The practical problem therefore becomes one of choosing a suitable kernel K rather than an embedding Φ. Suitable kernels, namely those deriving from some implicit embedding Φ, include K(x, y) = (x · y + 1)p and K(x, y) = exp −λkx − yk2 . To avoid overfitting, or in case the dataset is still not linearly separable, the extra constraint that |wi | < C for i = 1, . . . , N is imposed. C corresponds to a penalty for misclassification and is chosen by the user. The SVM typically leads to a sparse model in the sense that most of the coefficients wi vanish. The training items xi for which the corresponding wi is non-zero are called support vectors and, typically, lie close to the decision boundary. Support Vector Machines have their origins in the principle of structural risk minimisation [44, 22]. The aim is to place distribution-independent bounds on expected generalisation error [5, 45]. The motivation and results are somewhat different from those associated with Bayesian analysis. In the case of classification, for example, the optimal hyperplane depends only on the support vectors, which are extreme values of the dataset, whereas likelihood-based methods depend on all the data. A Bayesian approach always aims to conclude with a probability distribution over unknown quantities, whereas an SVM typically offers point estimates for regression or hard binary decisions for classification. Bayesian methods have nonetheless been applied to the problem of kernel selection in [40] where support vector classification is interpreted as efficient approximation to Gaussian process classification. By contrast, the problem of model selection based on error bounds derived from cross-validation is discussed in [14, 45]. A Bayesian treatment of a generalised linear model of identical functional form to the SVM is introduced in [42]. This approach, using ideas of automatic relevance determination [36], provides probabilistic predictions. It also yields a sparse representation, but one using relevance vectors, which are prototypical examples of classes, rather than support vectors which are close to the decision boundary or, in cases of misclassification, on the wrong side of it. The SVM performs well in practice and is becoming a popular method. This is partly due to the fact that, once parameters of the model are somehow fixed, SVM training finds a global solution, whereas neural network training, when considered as an optimisation process, may find multiple local minima, as discussed previously.

12

4

Unsupervised learning

So far we have dealt with labelled data. We assumed given a collection of observations {(x1 , y1 ), . . . , (xN , yN )} where each xn represents some known features of the nth item, and yn is a class label, or value, associated with xn . The presence of the labels makes this a case of supervised learning. The aim is to use the examples to predict the value of y on the basis of x for a general out-of-sample case. More generally, the problem is to model the conditional probability distribution p(y|x). Now suppose we are only given the features {x1 , . . . , xN } without corresponding labels. What can be learned about the xn ? This question, which at first sight seems perplexing, can be given a sense if it is interpreted as a search for interesting features of the data. For example, are there clusters in the data? Are there outliers which appear novel or unusual? What is the intrinsic dimensionality of the data? Can the data be visualised, or otherwise analysed, in a lower dimensional space? Do the data exhibit latent structure, which would help to explain how they were generated? Some aspects of unsupervised learning relate to qualitative ideas of concept formation studied in cognitive science and the philosophy of science. In quantitative terms, however, unsupervised learning often corresponds to some form of probability density estimation; and special interest attaches to the structures that particular density estimators might exhibit. When considered as a statistical problem, this makes possible the application of Bayesian methods. We consider two examples.

4.1

The Generative Topographic Mapping

The first example concerns a case where density estimation is useful for clustering and for providing low-dimensional representations enabling visualisation of high dimensional data. Suppose the data lives in a possibly high dimensional space D. If the dataset is {x1 , . . . , xN }, where each xn is a d-dimensional real vector, then D = Rd . A common form of density estimation uses a mixture of Gaussians. For a finite mixture, the density at x ∈ D may be estimated by p(x) =

X

p(x|i)P (i)

(18)

i∈I

where I is some finite index set, P (i) is the weight of the ith component and p(x|i) is a Gaussian density. In the simplest case, each Gaussian will be spherical with the same dispersion, so that each covariance matrix is the same scalar multiple, β −1 say, of the identity. Then 

p(x|i) =

β 2π

d/2

β exp − kx − mi k2 2 



where mi is the centre of the ith mixture component. The idea of the Generative Topographic Mapping [11, 10] is to embed the index set I in a linear topological space L, so that each i ∈ I becomes the 13

index of an element ui ∈ L, and to require that the mapping ui 7→ mi is the restriction of a continuous map u 7→ mu from L to D. L is now referred to as the latent space. For computational reasons, the mixture is still considered to be finite so that we can write (18) as p(x) =

X

p(x|ui )P (ui )

(19)

i∈I

where the weights of the mixture are interpreted as a prior distribution over L concentrated on a certain finite subset. The purpose of the continuity of the mapping u 7→ mu is seen when the mapping from L to D is, in a sense, inverted using Bayes’ theorem P (ui |x) =

p(x|ui )P (ui ) p(x)

(20)

The significance of (20) is that, for any possible data point x ∈ D, there is a discrete probability distribution P (ui |x) over L concentrated on those elements of latent space which correspond to components of the Gaussian mixture. Typically P (ui |x) will be large when x is close to the centre of the mixture component generated by ui . In that case ui is said to have large responsibility for x. If we write mx for the mean in L of the posterior distribution (20), the mapping x 7→ mx maps data points in D to elements of the latent space L. Normally the mapping is continuous, so that points that are close in D should map to points that are close in L. An important application of the GTM is to data-visualisation. Suppose that L has low dimension, specifically L = R2 , and that the elements of L having positive prior probability lie on a regular two-dimensional grid. Then each element xn of the dataset {x1 , . . . , xN } will determine, by the mapping x 7→ mx , an element un in the convex hull of the regular grid in L. Each data point is therefore visualisable by its representative in a rectangular region of the two dimensional latent space L. Furthermore, it might be expected that distinct populations in data space would be represented by distinct clusters in latent space. This account has said nothing of the way in which the mapping ψ : u 7→ mu is characterised. Typically ψ is taken to be a generalised linear regression model fitted, together with the noise parameter β, using the EM algorithm [11]. A more flexible manifold-aligned noise model is described in [10] together with methods of Bayesian inference for hyperparameters. [11] also provides a discussion of the relationship between the GTM and Kohonen’s earlier well-known self-organising feature map [25].

4.2

Blind source separation

The generative topographic map can be viewed as an instance of a generative model in which observations are modelled as noisy expressions of the state of underlying latent variables [38]. Several techniques for modelling multivariate datasets can be viewed similarly, including principal component analysis (PCA) and independent component analysis (ICA). Whereas conventional PCA 14

resolves a data item linearly into uncorrelated components, ICA attempts to resolve it into fully statistically independent components [15]. ICA has been applied to the problem of blind source separation. Suppose there are M statistically independent sources and N sensors. The sources might be auditory, e.g. speakers in a room, and the sensors microphones. Each sensor receives a mixture of the sources. The task is to recover the hidden sources from the observed sensors. This problem arises in many areas, for example medical signal and image processing, speech processing, target tracking etc. Separation is said to be blind since, in the most general form of the problem, nothing is assumed known about the mixing process or the sources, apart from their mutual independence. In this sense ICA can be considered to be an example of unsupervised learning. In the simple linear case, it is assumed that the sensor outputs x can be expressed as x = As where s are the unknown source signals and A is an unknown mixing matrix. One approach to the problem is to recover the source signals by a linear transformation y = Wx where W is the separating matrix [1]. The intention is that y should coincide with the original s up to scalar multiplication and permutation of channels. The idea is to fit W by minimising the Kullback-Leibler divergence Z

p(y) p(y) log QM dw m=1 p(ym )

between the joint distribution for y and the product of its marginals over the source channels m = 1, . . . , M . The minimum value of zero is achieved only if W can be chosen so that the resulting distribution p(y) factorises over channels, in which case the components y1 , . . . , yM of y = Wx are independent. Various algorithms have been proposed for solving this problem at varying degrees of generality [6, 1, 2]. A recent Bayesian approach uses the ideas of ensemble learning mentioned previously [32]. The idea is that the separating matrix W may be ill-defined if the data is noisy. The approach instead is to provide a method for approximating the posterior distribution over all possible sources. Significantly, the method allows Bayesian model selection techniques to determine the number of sources M which is generally unknown in advance.

5

Conclusion

Statistical methods are increasingly guiding principled research in neural information processing, in both its engineering and neuroscientific forms. Improved algorithms and computational resources are making possible the analysis of high dimensional datasets using powerful non-linear models. This is appropriate in view of the complexity of the systems with which the field is now dealing. Inevitably there is a danger of using over-complex models which fail to distinguish between signal and noise. Bayesian methods are proving invaluable in providing model selection techniques for matching complexity of the model to information content of the data in both supervised and unsupervised learning.

15

References [1] S. Amari, A. Cichocki, and H. H. Yang. A new learning algorithm for blind signal separation. In D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, editors, Advances in Neural Information Processing Systems 8, pages 757– 763. MIT Press, 1996. [2] H. Attias and C. E. Schreiner. Blind source separation and deconvolution: The dynamic component analysis algorithm. Neural Computation, 10(6):1373–1424, 1998. [3] David Barber and Christopher M. Bishop. Ensemble learning for multilayer networks. In M. I. Jordan, M. J. Kearns, and S. A. Solla, editors, Advances in Neural Information Processing Systems 10, pages 395–401. MIT Press, 1998. [4] A. R. Barron. Approximation and estimation bounds for artificial neural networks. Machine Learning, 14:115–133, 1994. [5] P. Bartlett and J. Shawe-Taylor. Generalization performance of support vector machines and other pattern classifiers. In B. Sch¨olkopf, C. J. C. Burgess, and A. J. Smola, editors, Advances in Kernel Methods—Support Vector Learning, pages 43–54. MIT Press, 1999. [6] A. J. Bell and T. J. Sejnowski. An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7(6):1129– 1159, 1995. [7] C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, 1995. [8] C. M. Bishop and C. Legleye. Estimating conditional probability densities for periodic variables. In G. Tesauro, D. Touretzky, and T. Leen, editors, Advances in Neural Information Processing Systems 7, pages 641–648. MIT Press, 1995. [9] Chris M. Bishop. Training with noise is equivalent to Tikhonov regularization. Neural Computation, 7(1):108–116, 1995. [10] Christopher M. Bishop, Markus Svens´en, and Christopher K. I. Williams. Developments of the generative topographic mapping. Neurocomputing, 21:203–224, 1998. [11] Christopher M. Bishop, Markus Svens´en, and Christopher K. I. Williams. GTM: The generative topographic mapping. Neural Computation, 10(1):215–234, 1998. [12] Wray L. Buntine and Andreas S. Weigend. Bayesian back-propagation. Complex Systems, 5:603–643, 1991. [13] Zehra Cataltepe, Yaser S. Abu-Mostafa, and Malik Magdon-Ismail. No free lunch for early stopping. Neural Computation, 11(4):995–1009, 1999. 16

[14] Olivier Chapelle and Vladimir Vapnik. Model selection for support vector machines. In S. A. Solla, T. K. Leen, and K-B M¨ uller, editors, Advances in Neural Information Processing Systems 12, pages 230–236. MIT Press, 2000. [15] P. Common. Independent component analysis: A new concept. Signal Processing, 36:287–314, 1994. [16] Noel A. C. Cressie. Statistics for Spatial Data. Wiley, revised edition, 1993. [17] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, 2000. [18] Mark Gibbs and David J. C. MacKay. Efficient implementation of Gaussian processes. Technical report, Cavendish Laboratory, Cambridge, 1997. [19] G. E. Hinton. Learning distributed representations of concepts. In Proceedings of the Eighth Annual Conference of the Cognitive Sicence Society (Amherst, 1986), pages 1–12. Hillsdale: Erlbaum, 1986. [20] G. E. Hinton and D. van Camp. Keeping neural networks simple by minimizing the description length of the weights. In Proceedings of the Sixth Annual Conference on Computational Learning Theory, pages 5–13, 1993. [21] K. Hornik. Some new results on neural network approximation. Neural Computation, 6(8):1069–1072, 1993. [22] R. C. Williamson J. Shawe-Taylor, P. L. Bartlett and M. Anthony. Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44(5):1926–1940, 1998. [23] H. Jeffreys. Theory of Probability. Oxford, third edition, 1961. [24] Peter K. Kitanidis. Parameter uncertainty in estimation of spatial functions: Bayesian analysis. Water Resources Research, 22(4):499–507, 1986. [25] T. Kohonen. Self-Organizing Maps. Springer, 1995. [26] Nhu D. Le and James V. Zidek. Interpolation with uncertain spatial covariances: A Bayesian alternative to Kriging. Journal of Multivariate Analysis, 43:351–374, 1992. [27] D. J. C. MacKay. Developments in probabilistic modelling with neural networks—ensemble learning. In Neural Networks: Artificial Intelligence and Industrial Applications, pages 191–198. Springer, 1995. [28] David J. C. MacKay. 4(3):415–447, 1992.

Bayesian interpolation.

Neural Computation,

[29] David J. C. MacKay. A practical Bayesian framework for backpropagation networks. Neural Computation, 4(3):448–472, 1992.

17

[30] David J. C. MacKay. In G. Heidbreder, editor, Maximum Entropy and Bayesian Methods, Santa Barbara 1993, Dordrecht, 1994. Kluwer. [31] G. Matheron. La Th´eorie des Variables Regionalis´ees et ses Applications. Masson, 1965. [32] J. W. Miskin and D. J. C. MacKay. Ensemble learning for blind source separation. In S. Roberts and R. Everson, editors, Independent Component Analysis: Principles and Practice. Cambridge University Press, 2001. [33] R. M. Neal. Regression and classification using Gaussian process priors. In J. M. Bernardo et al, editor, Bayesian Statistics 6, pages 475–501. Oxford University Press, 1998. [34] Radford M. Neal. Bayesian training of backpropagation networks by the hybrid Monte Carlo method. Technical Report CRG-TR-92-1, Department of Computer Science, University of Toronto, April 1992. [35] Radford M. Neal. Priors for infinite networks. Technical Report CRG-TR94-1, Department of Computer Science, University of Toronto, 1994. [36] Radford M. Neal. Bayesian Learning for Neural Networks. Lecture Notes in Statistics No. 118. Springer-Verlag, 1996. [37] B. D. Ripley. Pattern Recognition and Neural Networks. Cambridge University Press, 1996. [38] Sam Roweis and Zoubin Ghahramani. A unifying review of linear gaussian models. Neural Computation, 11(2):305–345, 1999. [39] D. E. Rumelhart and J. L. McClelland. Parallel Distributed Processing: Explorations in the Microstructure of Cognition. MIT Press, 1986. [40] Matthias Seeger. Bayesian model selection for Support Vector machines, Gaussian processes and other kernel classifiers. In S. A. Solla, T. K. Leen, and K-B M¨ uller, editors, Advances in Neural Information Processing Systems 12, pages 603–609. MIT Press, 2000. [41] A. N. Tikhonov and V. Y. Arsenin. Solutions of Ill-Posed Problems. John Wiley & Sons, 1977. [42] Michael E. Tipping. The relevance vector machine. In S. A. Solla, T. K. Leen, and K.-B. M¨ uller, editors, Advances in Neural Information Processing Systems 12, pages 652–658. MIT Press, 2000. [43] Giancarlo Ferrari Trecate, C. K. I. Williams, and M. Opper. Finitedimensional approximation of Gaussian processes. In M. J. Kearns, S. A. Solla, and D. A. Cohn, editors, Advances in Neural Information Processing Systems 11, pages 218–224. MIT Press, 1999. [44] V. Vapnik. Estimation of Dependences Based on Empirical Data. Nauka, Moscow, 1979. English translation: Springer Verlag, New York, 1982. 18

[45] V. Vapnik and O. Chapelle. Bounds on error expectation for support vector machines. Neural Computation, 12(9):2013–2036, 2000. [46] Vladimir Vapnik. Statistical Learning Theory. John Wiley, 1998. [47] Vladimir Vapnik, Steven E. Golowich, and Alex Smola. Support vector method for function approximation, regression estimation, and signal processing. In Michael C. Mozer, Michael I. Jordan, and Thomas Petsche, editors, Advances in Neural Information Processing Systems 9, pages 281– 287. MIT Press, 1997. [48] Vladimir N. Vapnik and Sayan Mukherjee. Support vector method for multivariate density estimation. In S. A. Solla, T. K. Leen, and K-B M¨ uller, editors, Advances in Neural Information Processing Systems 12, pages 659–664. MIT Press, 2000. [49] N. Wiener. Extrapolation, Interpolation, and Smoothing of Times Series. MIT Press, 1949. [50] Christopher K. I. Williams. Computation with infinite neural networks. Neural Computation, 10(5):1203–1216, 1998. [51] Christopher K. I. Williams and Carl Edward Rasmussen. Gaussian processes for regression. In Michael C. Mozer David S. Touretzky and Michael E. Hasselmo, editors, Advances in Neural Information Processing Systems 8, pages 514–520. The MIT Press, 1996. [52] Christopher K. I. Williams and Matthias Seeger. Using the Nystr¨om method to speed up kernel machines. In T. K. Leen, T. G. Diettrich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13. The MIT Press, 2001. [53] P. M. Williams. A Marquardt algorithm for choosing the step-size in backpropagation learning with conjugate gradients. Cognitive Science Research Paper CSRP 229, University of Sussex, February 1991. [54] P. M. Williams. Bayesian regularization and pruning using a Laplace prior. Neural Computation, 7(1):117–143, 1995. [55] P. M. Williams. Using neural networks to model conditional multivariate densities. Neural Computation, 8(4):843–854, 1996. [56] P. M. Williams. Modelling seasonality and trends in daily rainfall data. In M. I. Jordan, M. J. Kearns, and S. A. Solla, editors, Advances in Neural Information Processing Systems 10, pages 985–991. The MIT Press, 1998. [57] P. M. Williams. Matrix logarithm parametrizations for neural network covariance models. Neural Networks, 12(2):299–308, 1999.

19