Bayesian Linear Classifier

Bayesian Linear Classifier James Barrett Institute of Mathematical and Molecular Biomedicine, King’s College London November 24, 2013 Abstract This document contains the mathematical theory behind the Bayesian Linear Classifier Matlab function. The likelihood function and it’s partial derivatives are derived. Some numerical stability issues are also discussed.

Model Definition We observe covariates xi ∈ Rd (also called input variables) for each sample i where i = 1, . . . , N . We also observe a binary label σi ∈ {−1, +1}. We specify that the class membership probabilities are given by p(σi = −1|xi , w) =

1 erfc(w · xi + w0 ) 2

1 and p(σi = +1|xi , w) = 1 − erfc(w · xi + w0 ) 2

(1)

respectively, where w ∈ Rd is a vector of regression coefficients and w0 is a bias term. We can include the bias term in the weight vector so that w = (w0 , w1 , . . . , wd ) ∈ Rd+1 . The complementary error function is defined as Z ∞ 2 2 erfc(h) = √ e−s ds. π h We will use Bayes’ theorem to infer the most probable weights p(w|D, α, β) = R

p(D|w)p(w|α, β) , p(D|w0 )p(w0 |α, β)dw0

(2)

where the data D = {(σ1 , x1 ), . . . , (σN , xn )}. The evidence term factorises over samples p(D|w) =

N Y

p(σi |xi , w).

(3)

i=1

We will assume a Gaussian prior for the weights 1

2

−

e− 2α2 w p(w|α) = , (2π)N/2 αN

1

w2

e 2β2 0 p(w0 |β) = . (2π)1/2 β

The hyperparameters α and β control the variance of the weight components. 1

(4)

Inferring the weights We will numerically minimise the negative log of the posterior (2) using a gradient based optimisation algorithm: 1 log p(w|D, α, β) N 1 1 1 X 1 X log log 1 − erfc(w · xi + w0 ) =− erfc(w · xi + w0 ) − N σ =−1 2 N σ =+1 2

Lw (w) = −

i

i

1 1 1 1 1 w2 + log 2π + log α + w02 + + log 2π + log β. 2 2 2N α 2 2N β 2N N

(5)

The hyperparameters are fixed in the current implementation.

Error bars We will approximate the posterior distribution over w with a Gaussian centered on w∗ = minw Lw (w). The curvature matrix can then be used to estimate the variance of the weights. Specifically, p(w|D, α, β) = e−N Lw (w

∗

∗ ∗ )− N 2 (w−w )·A(w−w )

where A is the matrix of second order partial derivatives of Lw (w) given below. We can interpret (N A)−1 as a covariance matrix. Inferred regression weights are then given by q for µ = 0, . . . , d wµ∗ ± (N A)−1 µµ

Partial Derivatives Define hi = w · xi + w0 . For the −1 class the first order derivatives for µ = 1, . . . , d are 2

∂ e−h xiµ . log( 21 erfc(h)) = − 1 ∂wµ 2 erfc(h) We note that due to symmetry d 1 d 1 log 1 − erfc(h) = − log erfc(−h) . dh 2 dh 2 Combining these results we may write 2 2 1 X 2 e−hi 1 X 2 e−hi wµ ∂ √ √ Lw (w) = xiµ − xiµ + 2 . ∂wµ N σ =−1 π erfc(hi ) N σ =+1 π erfc(−hi ) α N i

i

The derivative of the bias terms is 2 2 ∂ 1 X 2 e−hi 1 X 2 e−hi w0 √ √ − + 2 Lw (w) = ∂w0 N σ =−1 π erfc(hi ) N σ =+1 π erfc(−hi ) β N i

i

2

(6)

Second order partial derivatives for the −1 class the first order derivatives for µ = 1, . . . , d are 2

∂2 e−hi xiµ = ∂wν ∂wµ erfc(hi ) We observe that

2

2 e−hi √ π erfc(hi )

!2

2

− 2hi

2 e−hi √ π erfc(hi )

! xiν xiµ .

1 1 d2 d2 log 1 − erfc(h) = log erfc(−h) . dh2 2 dh2 2

Second order partials are given by  !2 ! 2 2 ∂2 1 X  2 e−hi 2 e−hi  xiν xiµ √ − 2hi √ Lw (w) = ∂wν ∂wµ N σ =−1 π erfc(hi ) π erfc(hi ) i  !2 ! 2 2 1 X  2 e−hi 2 e−hi  xiν xiµ + δµν √ + + 2hi √ N σ =+1 α2 N π erfc(−hi ) π erfc(−hi ) i

= Aµν . Furthermore  !2 ! 2 2 1 X  2 e−hi 2 e−hi ∂2  √ Lw (w) = − 2hi √ ∂w02 N σ =−1 π erfc(hi ) π erfc(hi ) i  !2 ! 2 2 1 X  2 e−hi 2 e−hi + 1 √ + + 2hi √ N σ =+1 β2N π erfc(−hi ) π erfc(−hi ) i

and finally,  !2 ! 2 2 ∂2 1 X  2 e−hi 2 e−hi  xiµ √ Lw (w) = − 2hi √ ∂wµ ∂w0 N σ =−1 π erfc(hi ) π erfc(hi ) i  !2 ! 2 2 1 X  2 2 e−hi e−hi  xiµ . √ + + 2hi √ N σ =+1 π erfc(−hi ) π erfc(−hi ) i

Numerical Instability 2

If h 0 then e−h ≈ 0 and erfc(h) ≈ 0 and the numerical value of their ratio in the derivatives above will become inaccurate as the limit of machine precision is reached. In this regime we will use the asymptotic expansion (Menzel, 1960) of erfc(h) which is given by 2 e−h 1 2 8 erfc(h) = √ 1 − 2 + − + ··· 2h (2h2 )2 (2h2 )3 h π

3

for h 0.

This means that for the σi = −1 class when h 0 (a cutoff of h ≥ 20 is suitable for matlab) √ 1 2 8 2 1 log 2 erfc(h) = −h − log h − log 2 π + log 1 − 2 + − + ··· 2h (2h2 )2 (2h2 )3 −1 2 √ e−h 1 2 8 − + · · · . =h π 1− 2 + erfc(h) 2h (2h2 )2 (2h2 )3 For the σi = +1 class when h ≤ −20 define g = −h and use log 1 −

√ 1 2 − ··· = log = −g − log g − log 2 π + log 1 − 2 + 2g (2g 2 )2 −1 2 √ e−g 1 2 8 . − + · · · =g π 1− 2 + erfc(g) 2g (2g 2 )2 (2g 2 )3

1 2 erfc(h)

1 2 erfc(g)

2

References Donald H. Menzel. Fundamental Formulas of Physics, volume one. Dover Publications, Inc. New York, 1960.

4

Bayesian Linear Classifier

Bayesian Linear Classifier

Suggest Documents

Evolutionary Bayesian Classifier-based Optimization

Linear vs. Quadratic Discriminant Classifier

Improving Naive Bayesian Classifier by ... - Semantic Scholar

A Hybrid Generative/Discriminative Bayesian Classifier

Bayesian Linear Regression

Quad tree Segmentation Based Bayesian Classifier ...

Development of a Bayesian Classifier for Breast

Joint Discriminative Bayesian Dictionary and Classifier Learning

A New Restricted Bayesian Network Classifier

CTBNCToolkit: Continuous Time Bayesian Network Classifier Toolkit

Combining a Bayesian Classifier with Visualisation ... - CiteSeerX

Feature Selection using Linear Classifier Weights - CiteSeerX

Linear Bayesian Reinforcement Learning - IJCAI

Statistical Mechanical Development of a Sparse Bayesian Classifier

A Continuous Time Bayesian Network Classifier for Intraday FX

NaÃ¯ve Bayesian classifier and genetic risk score for ... - ScienceOpen

Combining a Bayesian Classifier with Visualisation ... - Google Sites

A Bayesian Network Classifier with Inverse Tree ... - Semantic Scholar

Local-Global-Learning of Naive Bayesian Classifier - CiteSeerX

An Effective Bayesian Neural Network Classifier ... - Purdue Statistics

bayesian based classifier for mining image classes

A Heterogeneous Naive-Bayesian Classifier for Relational Databases

Bayesian Network Classifiers versus k-NN Classifier using Sequential ...

A Continuous Time Bayesian Network Classifier for Intraday FX ...