Optimal Bayesian Online Learning - CiteSeerX

17 downloads 993 Views 282KB Size Report
Abstract. In a Bayesian approach to online learning a simple paramet- ric approximate posterior over rules is updated in each online learning step. Predictions ...
Optimal Bayesian Online Learning Ole Winther and Sara A. Solla connect,The

Niels Bohr Institute,Blegdamsvej 17 2100 Copenhagen , Denmark [email protected] and [email protected]

Abstract. In a Bayesian approach to online learning a simple paramet-

ric approximate posterior over rules is updated in each online learning step. Predictions on new data are derived from averages over this posterior. This should be compared to the Bayes optimal batch (or oine) approach for which the posterior is calculated from the prior and the likelihood of the whole training set. We suggest that minimizing the di erence between the batch and the approximate posterior will optimize the performance of the Bayes online algorithm. This general principle is demonstrated for three scenarios: learning a linear perceptron rule and a binary classi cation rule in the simple perceptron with binary/continuous weight prior.

1 Introduction The study of optimal online learning schemes within the statistical mechanics formalism has attracted considerable recent interest [1]-[5]. In online learning each training example is used only once: the estimate of the target rule is updated using the current example and the current estimate. This method is to be contrasted to the Bayes optimal batch approach for which predictions on new examples is derived from the posterior probability over rules which depends on all training examples. Surprisingly, in some cases the optimal online algorithm can achieve the same asymptotic generalization error as the best batch algorithm [3]. This e ect has been explained recently [5] from the fact that for likelihoods that are smooth functions of the rule parameters the posterior distribution will be Gaussian asymptotically [6]. This observation has led [5] to suggest a Bayesian online algorithm in which an approximate Gaussian posterior is updated in each and every online learning step. In [7] we have showed that the idea of an approximate posterior may be extended to a binary posterior for a rule with binary parameters. In this paper we suggest to minimize the relative entropy of the approximate posterior with respect to the exact posterior (for which the current example is not discarded). This is optimal because the loss of information in the approximate step will be minimized. The paper is organized as follows. In sections 2 and 3 respectively Bayesian inference and Bayes optimal online learning are discussed. In the following two sections we will study Bayesian online learning in a statistical mechanics analysis of learning a regression rule, i.e. the linear perceptron (section 4) and a binary

62

Ole Winther and Sara A. Solla

classi cation rule, i.e. the simple perceptron (section 5). For the linear perceptron the exact posterior is Gaussian and the problem may therefore be solved exactly using online learning. For the perceptron classi cation rule we consider both binary and Gaussian weight prior/approximate posterior.

2 Bayesian inference To explain the Bayesian approach to inference [8] consider a data set (training set) of size m, Dm = fy;  = 1; : : : ; mg where the data y is generated by a distribution p(yjw) characterized by unknown parameters w. Our total knowledge about the rule after observing m independent examples is expressed by the posterior which can be expressed through Bayes rule as

p(wjDm ) = Z1 p(w)

m Y =1

p(y jw) ;

(1)

where p(w) is Q the prior R  jw) is a normalization over rule parameters w and Z = dw p(w) m p ( y =1 constant. Bayes optimal predictions are based on the predictive probability [9] Z

p(yjDm) = dw p(wjDm ) p(yjw) :

(2)

Thus one needs the information about the whole data set Dm in order to make predictions. For unsupervised learning y is simply a point in an N -dimensional space. For supervised learning which we will consider in the following y is an input-output pair y = (s;  ), where the input s is an N -dimensional vector and the output  is a scalar in the case of regression and a label in the case of classi cation. We will assume that the input is independent of the input-output rule. We may therefore make the following decomposition p(yjw) = p(s)p( jw; s). We are interested in modelling the output. For that we need the predictive probability Z

p( js)  p( js; Dm ) = dw p(wjDm ) p(yjw) :

(3)

In the following we will discuss how to use this probability to make optimal predictions.

3 Optimal online learning In online learning each example is presented only once, and the parameters are updated using the information provided by the current example and the current value of the parameters. The goal is to work with a small number of parameters to provide a representation of the best estimate of the rule given the already presented examples. In a Bayesian approach to online learning { rather than just letting the parameters represent an estimate of the rule { they parametrize

Optimal Bayesian Online Learning

63

a simple approximate posterior. The e ect of adding an example is represented by a simple recursion for the posterior distribution of the parameters given the presented data: m+1 w) (4) p(wjDm ; ym+1 ) = R dpw(wpj(DwmjD) p()yp(ymj+1 jw) : m In order to get rid of the explicit dependence on the previous examples p(wjDm ) is approximated by a simpler distribution p(wjAm ). Am refers to the parameters that characterize the posterior distribution (such as the rst two moments of w in the Gaussian case). Adding a new example we may regard the update of p(wjAm ) as consisting of two steps: 1. Add example p(wjAm ; ym+1 ) / p(wjAm )p(ym+1 jw) 2. Get rid of explicit example dependence p(wjAm ; ym+1 ) ! p(wjAm+1 ) In the second step some of the information contained in the new example is discarded. The best one can do is therefore to minimize the the information loss. The absolute value of this information loss is quanti ed by the relative entropy or the Kullback-Leibler!distance between the two probability distributions, Z m+1 m +1 D(p(wjAm ; y )jjp(wjAm+1 )) = dwp(wjAm ; ym+1) log p(pw(jwAjmA; y ) ) (5) m+1 This principle of minimal information loss may also be used to select between di erent approximations for the posterior distribution. It does not tell us how to invent approximate posteriors though. In the following we will choose the approximate posterior which allow for analytical calculation of Am+1 . The framework needs to be compatible with the prior p(w) from which the parameters of the target rule are picked at random. This suggests we should choose p(wjA0 ) = p(w). In the case of a Gaussian approximation for the posterior, it is straight forward to show the minimization of the relative entropy amounts to choosing Am+1 such that p(wjAm+1 ) and p(wjAm ; ym+1 ) will have the same mean and covariance.

4 Linear perceptron To illustrate the Bayesian approach to learning we will study a a scenario in which the posterior is in fact Gaussian. Bayesian online with a Gaussian posterior and Bayesian oine learning will therefore be equivalent in this case. The rule is taken to be a linear perceptron  = p1N w  s +  with ad2 ditive Gaussian output noise with p variance p  . The likelihood is therefore ? ( ?ws= N )2 =2 2 Gaussian pp( jw; s) = e = 2. For a spherical prior p(w) = is Gaussian with e?ww=2 = 2N it isPstraight forward to show that the posterior 1  Cs and covariance C = (I + 1 2 P s (s )T )?1 . The mean hwi = pN  2  N  predictive probability of the output p( js) is also Gaussian with mean p1N hwi s 2 ssT2 . The variance may be rewritand variance ^2 = 1? sTQ?1s with Q = C?1 + N ten to ^ =  2

2

N2

+ N sT Cs using the identity sT Q?1 s = 1

1+

s1T CsT [10]. N2 s Cs

64

Ole Winther and Sara A. Solla

The squared error ( ? p1N w  s)2 is the natural error measure for linear regression. The average over the predictive probability of the squared error gives the generalization error. The generalization error is minimized by choosing w = hwi yielding the Bayes error Z

Bayes = ds d p(; s)( ? hwpi  s )2 = ds p(s)^2 = 2 + N1 N Z

Z

dsp(s)sT Cs:

(6) For unbiased inputs with covariance matrix B the last term becomes N1 trBC. For simplicity we will study uncorrelated inputs B = I. In the thermodynamic limit N ! 1 the trace of C is self averaging, i.e. independent of the realization of the training set, depending on  Nm and 2 only. Inserting the mean value of the trace calculated in [11] the Bayes error is

Bayes = 2 + 12 (1 ? ? 2 + (2 ? z+ )(2 ? z? )); p

(7)

p

compared to gradient descent learning, with z = ?(1 )2 . This result may be p P i.e minimizing the cost function E = 21  (  ? w  s = N )2 + 21 w  w studied in [11, 12, 10]. For the optimal choice of weight decay parameter  = 2 gradient descent learning coincides with the Bayes result [12]. It should be noted there is a di erence in the de nition of the generalization error here compared to [12]. In [12] the rule is considered to be noiseless and the training set is corrupted by noise whereas in the Bayesian framework the rule is considered to be noisy. This gives the additional factor of 2 in the error in eqs. (6) and (7) which is the minimal error of Bayes learning.

5 Classi cation in the simple perceptron In this section we will analyse information extraction in the well studied online learning scenario of a binary classi cation rule given by a perceptron  = sign (w  s) for which the output is ipped with probability . The likelihood is therefore   1 (8) p( jw; s) = (1 ? 2)  p w  s +  ;

N

where (x) = 1 for x > 0 and 0 otherwise. We will use a cavity argument [13] to derive the predictive probability. The argument is expected to be valid in the thermodynamic limit within one pure state, i.e. one local minimum. Rather than averaging over the potentially complicated distribution of the w we only need the distribution of the eld p1N w  s, i.e. w's projection on a new random direction. We can apply the Central Limit Theorem and conclude that the eld P is a Gaussian variable with mean p1N hwi  s and variance N1 i;j si sj (hwi wj i ? hwi ihwj i)  N1 (hw  wi ? hwi  hwi). The last step is valid only for unbiased spherical inputs, si = 0 and si sj = ij . Exchanging si sj by the its mean value

Optimal Bayesian Online Learning

65

above is expected to give rise to an O(1=N ) error. It is now straight forward to write down the predictive probability [14] !

p( js) = (1 ? 2)  p hwi  s + ; (9) hw  wi(1 ? q) p R where (x) = ?1x Dt, Dt = e?t2 = dt= 2 and q = hhwwihwwii . The predictive probability p( js) gives the probability for  being the correct output label. 2

Clearly the best we can do, i.e. the Bayes optimal algorithm is to choose the output label with highest probability  Bayes = argmax p( js). This will allow us to immediately write down the well known results for the binary Bayes classi er

 Bayes = sign (hwi  s)

(10)

rst derived by [15]. For classi cation the generalization error is naturally de ned as the probability of error on P a new input. The average error on a speci c input s is 1 ? p( Bayes js) or  =1 p( js)(1 ? 2p( js)) for the binary case. generalization error of the Bayes algorithm is thus R The average P Bayes = dsp(s)  =1 p( js)(1 ? 2p( js)). Calculating this for the predictive probability (9) we recover the well known result [16] Bayes = (1 ? 2) 1 arccos (pq) +  : (11)



Note that these results are independent of the type of prior. It may be binary or continuous. This means for example that optimal predictions for a binary rule prior is not necessarily implemented by a binary weight vector. In the following we will derive Bayesian online learning algorithm in the case of binary and Gaussian weights priors. 5.1

Binary weight prior

Q 



For a perceptron rule with binary weights p(w) = i 21 (wi ? 1) + 12 (wi + 1) we choose the approximate posterior to be a biased binary distribution   Y 1 + hwi i 1 ? hwi i (w + 1) ;  ( w ? 1) + (12) p(wjAm ) = i i 2 2 i

i.e. Am = hwim  hwi. Adding a new example (s;  ) we update the posterior by calculating hwim+1 . Introducing the eld p1N w  s as a variable through R R 1 the integral representation 1 = dh(h ? p1N w  s) = dhdx ex(h? pN ws). and 2i summing over i'th weight we obtain   hw i = tanh tanh?1 hw i + p1 s hxiwithout i (13) i m+1

i

N i

where hxiwithout i denotes an average over the auxiliary variable x in a posterior without the i'th weight. It is straight forward to calculate the average of x for

66

Ole Winther and Sara A. Solla

the full posterior 



(1 ? 2)D ph1h?i q ; hxi = @ log@ hph(i js) = p1? q p( js)

(14)

where hhi = p1N hwi  s and D(x) = p12 e?x2 =2 . We can use a cavity argument (expected to hold for N ! 1) this time adding a new weight [17, 13] to write hxiwithout i in terms of known quantities

hxi

without

2 p( js) si hwi im+1 @ log @ hhi2 : N

i = hxi + p1

(15)

We have now obtained a Bayesian online algorithm for binary weights. Note that hwi im+1 is present on the rhs of eq. (13) through eq. (15). This means that we in principle should solve eq. (13) with respect to hwi im+1 . However we set hwi im+1 = hwi i on the rhs. We expect to make an error of O(1=N ) by doing p this because the di erence between hwi i and hwi im+1 is expected to be O(1= N ). Next we will derive the expected generalization error of the algorithm, i.e. the learning curve Bayes versus = m N . The Bayes error (11) depends on through the self averaging order parameter

qm+1 = N1 hwim+1  hwim+1 = N1

Z

ddsp(; s)hwim+1  hwim+1

(16)

To derive the average we observe that the eld tanh?1 hwi im+1 = tanh?1 hwi i + p1N si hxiwithout i has a bias towards hwi im+1 and Gaussian uctuations due to the random inputs. We nd the following recursive relation for the bias 1 ? 2 1 p Am+1 = Am + N 1?q

Z

? 12 qt2

e Dt (p qt) + 

(17)

R

with  = =(1 ? 2). The variance Vm+1 ? Vm = N1 ddsp(; s)hxi2 turns out to be identical to the bias. Introducing a \continuous timesteps" d = N1 we arrive at the following equation for the order parameter Z

p

q = Dz tanh2 ( Az + A) dA = 1 p1 ? 2 Z Dt e? 12 qt2 d  1 ? q (pqt) + 

(18)

We compare the performance of the optimal algorithm with the theoretical prediction above and with the following algorithms { { {

Signhwi, i.e. the most probable solution. Optimal continuous weights algorithm. See section 5.2. Sign of optimal continuous weights algorithm.

Optimal Bayesian Online Learning

67

Simulations results with N = 1000 averaged over 100 runs are shown in gure 1. For large the optimal weights tends towards 1. That is the reason for the narrowing di erence between signhwi and hwi for increasing . For small the continuous weights algorithm performs as well as the optimal. Thus due to insucient training data the optimal algorithm cannot make use of the additional prior information.

Fig. 1. Learning curves for binary weights. Top gure is for  = 0 and the lower is for

 = 0:02. The full line is the theoretical predictions. The lower dotted line is simulation

results using the optimal algorithm. The upper dotted line is taking the sign of the optimal weights. The (to the right) upper dashed line is the optimal algorithm for continuous weights. The lower dashed line is the result of taking the sign of the weights found in the continuous weights algorithm.

68 5.2

Ole Winther and Sara A. Solla Gaussian weight prior

We will consider a spherical Gausian prior for the weights. Due to the scale invariance of the likelihood p( jw; s) the prior xes the variance hw  wi = N . In [5] the case of a general Gaussian approximation to the posterior is discussed. In the thermodynamic limit with spherical input distribution the approximate Gaussian posterior becomes diagonal with all variances equal to 1?q. The update rule for the average weights becomes

hwim = hwi + p1 (1 ? q)s @ log@ hph(i js) +1

N

(19)

The di erential equation for the order parameter becomes

dq = 1 (1 ? q) 23 Z Dt (1 ? 2)e? 12 qt2 d  (pqt) + 

(20)

As discussed in [7] the update rules for average weights and the order parameter q become identical to the one for the weights and the corresponding order parameter for the optimal Hebb-type algorithm found in [1]. The optimal Hebb algorithm is therefore the best possible algorithm using only rst and second order information. An analysis of the optimal Hebb-type algorithm for the tree committee machine has been carried out in [18]. Our preliminary results show that the Bayesian update rule for the average weights become identical to the one for the weights in the optimal Hebb-type algorithm. However, the generalization error is not the same for the two algorithms because [18] uses the optimal weights in the original committee machine architecture to predict on new examples whereas it has been shown by [14] that the Bayes classi er is not implemented by the committee machine itself, but by a modi ed architecture. In the rest of this section we calculate explicitly the average relative entropy for the di erent posterior distributions entering in the approximate Bayes scheme namely p(wjAm ; ym+1), p(wjAm+1 ) and p(wjAm ). Because from the fact that p(wjAm ; ym+1) and p(wjAm+1 ) have identical rst two moments and the approximate posterior is Gaussian it follows that

D(p(wjAm ; ym+1 )jjp(wjAm )) = D(p(wjAm ; ym+1)jjp(wjAm+1 )) +D(p(wjAm+1 )jjp(wjAm )) :

(21)

This relation says that D(p(wjAm ; ym+1 )jjp(wjAm )), the information gain from example ym+1 without approximating is equal to the absolute value of the loss done in the approximation plus the gained information in the approximate scheme. Using the predictive probability (9) we may also calculate the average of the relative entropies, e.g.

I (Am ; ym+1jjAm+1 ) =

X Z

 =1

ds p(s)p( js)D(p(wjAm ; ym+1 )jjp(wjAm+1 )) (22)

Optimal Bayesian Online Learning

69

The actual expressions are I (Am ; ym+1 jjAm+1 ) = I (Am ; ym+1 jjAm ) ?

I (Am+1 jjAm ) with

Z

I (Am ; ym+1 jjAm ) = ?2 Dt( + (1 ? 2)H (zt)) log( + (1 ? 2)H (zt)) + s() Z 2 2 ) I (Am+1 jjAm ) = Dt (1+?(12?)2D)H(zt (23) (zt) p

where s() =  log  + (1 ? ) log(1 ? ) and z = q=(1 ? q). The relative entropies are functions of the (normalized) number of examples  m N through the order parameter q( ). In gure 2 the relative entropy curves and corresponding learning curves are plotted. The asymptotic behavior turns out to be very simple I (A ; ym+1 jjA ) = 2 m

m m +1 I (Am+1 jjAm ) = I (Am ; y jjAm+1 ) = 1

;

(24)

independent of the noise level , i.e. asymptotically half of the information in the example is discarded. This is very di erent from the case of p(yjw) being a smooth function of w (e.g. for weight or input noise) for which the posterior asymptotically will be Gaussian [5] and thus no information are discarded.

Fig. 2. Relative entropy curves and learning curves.m+1Left gure is for  = 0 and the

right is for  = 0:1. The dashed lines are I (Am; y Am+1 )= log 2 as a function of = Nm . The dotted lines show I (Am ; ym+1 Am )= log 2. The discarded information I (Am ; ym+1 Am )= log 2 is the di erence between the two. The full lines show Bayes . jj

jj

jj

6 Conclusion We have presented a Bayesian approach to online learning. The approximate online posterior distribution of rules is updated in each online learning step such that the di erence with the non-approximate posterior is minimized. We have studied three online learning scenarios using the statistical mechanics approach: 1. For linear regression the posterior is Gaussian and online learning is equivalent

70

Ole Winther and Sara A. Solla

to oine learning. The Bayesian result coincides with best possible gradient descent strategy. For a simple perceptron classi cation task with output noise we have considered both: 2. a binary, and 3. a Gaussian prior. In the binary case the approximate posterior is chosen to be a biased binary distribution. The performance of Bayesian algorithm shows good agreement with the theoretical prediction and is superior to three other strategies considered: taking the sign of the optimal weights, the optimal weights for a Gaussain posterior and the sign of the optimal weights for a Gaussain posterior. For the Gaussian case the approximate posterior is chosen to Gaussian. The Bayesian results coincides with the optimal Hebb-type algorithm. Asymptotically the Bayesian online algorithm uses only half the information in the training example.

Acknowledgements This research is supported by the Danish Research Councils for the Natural and Technical Sciences through the Danish Computational Neural Network Center (connect).

References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18.

O. Kinouchi and N. Caticha, J. Phys. A: Math. Gen. 25 6243 (1992). M. Biehl and P. Riegler, Europhys. Lett. 28, 525 (1994). C. Van den Broeck and P. Reimann, Phys. Rev. Lett. 76 2188 (1996). J. W. Kim and H. Sompolinsky, Phys. Rev. Lett. 76, 3021 (1996). M. Opper, Phys. Rev. Lett. 77, 4671 (1996). M. J. Schervish, Theory of Statistics, Springer-Verlag, New York (1995). O. Winther and S. A. Solla, in European Symposium on Arti cial Neural Networks (ESANN'97), 169, D facto, Brussels (1997). J. O. Berger; Statistical Decision theory and Bayesian Analysis, Springer-Verlag, New York (1985). V. N. Vapnik, Estimation of Dependences Based on Empirical Data, SpringerVerlag (1982). P. Sollich, in Advances in Neural Information Processing Systems (NIPS) 7, 207, MIT Press (1996). J. A. Hertz, A. Krogh and G. I. Thorbergsson, J. Phys. A: Math. Gen. 22 2133 (1989). A. Krogh and J. A. Hertz, J. Phys. A: Math. Gen. 25 1135 (1992) M. Mezard, G. Parisi and M. A. Virasoro. Spin Glass Theory and Beyond, Lecture Notes in Physics, 9, World Scienti c (1987). M. Opper and O. Winther, Phys. Rev. Lett. 1964 (1996). T. L. H. Watkin, Europhys. Lett. 21, 871 (1993). M. Opper and D. Haussler, in IVth Annual Workshop on Computational Learning Theory (COLT91), Morgan Kaufmann (1991). M. Mezard, J. Phys. A 22, 2181 (1989). M. Copelli and N. Caticha, J. Phys. A: Math. Gen. 28, 1615 (1995).