Model Selection Tutorial #1: Akaike's Information ... - User Web Pages

Motivation Estimation AIC

Derivation References

Model Selection Tutorial #1: Akaike’s Information Criterion Daniel F. Schmidt and Enes Makalic

Melbourne, November 22, 2008

Daniel F. Schmidt and Enes Makalic

Model Selection with AIC



Content

1

Motivation

2

Estimation

3

AIC

4

Derivation

5

References





Problem

We have observed n data points yn = (y1 , . . . , yn ) from some unknown, probabilistic source p ∗ , i.e. yn ∼ p ∗ where yn ∈ Y n . We wish to learn about p ∗ from yn . More precisely, we would like to discover the generating source p ∗ , or at least a good approximation of it, from nothing but yn





Statistical Models To approximate p ∗ we will restrict ourself to a set of potential statistical models Informally, a statistical model can be viewed as a conditional probability distribution over the potential dataspace Y n p(yn |θ), θ ∈ Θ where θ = (θ1 , . . . , θp ) is a parameter vector that indexes the particular model Such models satisfy Z

p(yn |θ)dyn = 1

yn ∈Y n

for a fixed θ Daniel F. Schmidt and Enes Makalic




Statistical Models ...

An example would be the univariate normal distribution. ! n n 2 1 X 1 n 2 exp − p(y |θ) = (yi − µ) 2πτ 2τ i=1

where p=2 θ = (µ, τ ) are the parameters Y n = Rn Θ = R × R+





Content

1

Motivation

2

Estimation

3

AIC

4

Derivation

5

References





Parameter Estimation

Given a statistical model and data yn , we would like to take a guess at a plausible value of θ The guess should be ‘good’ in some sense Many ways to approach this problem ; we shall discuss one particularly relevant and important method : Maximum Likelihood





Method of Maximum Likelihood (ML), Part 1

A heuristic procedure introduced by R. A. Fisher Possesses good properties in many cases Is very general and easy to understand To estimate parameters θ for a statistical model from yn , solve ˆ n ) = arg max p(yn |θ) θ(y θ∈Θ

or, more conveniently

ˆ n ) = arg min − log p(yn |θ) θ(y θ∈Θ





Method of Maximum Likelihood (ML), Part 2 Example : Estimating the mean parameter µ of a univariate normal distribution Negative log-likelihood function : L(µ, τ ) =

n n 1 X (yi − µ)2 log(2πτ ) + 2 2τ i=1






n n 1 X (yi − µ)2 log(2πτ ) + 2 2τ i=1

Differentiating L(·) with respect to µ yields ∂L(µ, τ ) 1 = ∂µ 2τ


2nµ − 2

n X

yi

i=1


!




n n 1 X (yi − µ)2 log(2πτ ) + 2 2τ i=1

Differentiating L(·) with respect to µ yields ∂L(µ, τ ) 1 = ∂µ 2τ

2nµ − 2

n X

yi

i=1

Setting this to zero, and solving for µ yields n

µ ˆ(yn ) =

1X yi n i=1



!



Univariate Polynomial Regression A more complex model : k-order polynomial regression





Univariate Polynomial Regression A more complex model : k-order polynomial regression Let each y(x) be distributed as per a univariate normal with variance τ and a special mean µ(x) = β0 + β1 x + β2 x 2 . . . ...βk x k The parameters of this model are θ (k ) = (τ, β0 , . . . , βk ).





Univariate Polynomial Regression A more complex model : k-order polynomial regression Let each y(x) be distributed as per a univariate normal with variance τ and a special mean µ(x) = β0 + β1 x + β2 x 2 . . . ...βk x k The parameters of this model are θ (k ) = (τ, β0 , . . . , βk ). In this model the data yn is associated with a xn which are known





Univariate Polynomial Regression A more complex model : k-order polynomial regression Let each y(x) be distributed as per a univariate normal with variance τ and a special mean µ(x) = β0 + β1 x + β2 x 2 . . . ...βk x k The parameters of this model are θ (k ) = (τ, β0 , . . . , βk ). In this model the data yn is associated with a xn which are known Given an order k, maximum likelihood can be used to estimate θ (k )





Univariate Polynomial Regression A more complex model : k-order polynomial regression Let each y(x) be distributed as per a univariate normal with variance τ and a special mean µ(x) = β0 + β1 x + β2 x 2 . . . ...βk x k The parameters of this model are θ (k ) = (τ, β0 , . . . , βk ). In this model the data yn is associated with a xn which are known Given an order k, maximum likelihood can be used to estimate θ (k ) But it cannot be used to provide a suitable estimate of order k ! Daniel F. Schmidt and Enes Makalic




Univariate Polynomial Regression If we let µ ˆ(k ) (x) = βˆ0 + βˆ1 x + βˆ2 x 2 . . . ...βˆk x k Maximum Likelihood chooses βˆ(k ) (yn ) to minimise τˆ(k ) (yn ) =

n 2 1 X yi − µ ˆ(k ) (xi ) n i=1

This is called the residual variance.





Univariate Polynomial Regression If we let µ ˆ(k ) (x) = βˆ0 + βˆ1 x + βˆ2 x 2 . . . ...βˆk x k Maximum Likelihood chooses βˆ(k ) (yn ) to minimise τˆ(k ) (yn ) =

n 2 1 X yi − µ ˆ(k ) (xi ) n i=1

This is called the residual variance. The likelihood function L(yn |θˆ(k ) (yn )) made by plugging in the Maximum Likelihood estimates is n n τ (k ) (yn ) + L(yn |θˆ(k ) (yn )) = log 2πˆ 2 2 Daniel F. Schmidt and Enes Makalic




Method of Maximum Likelihood (ML), Part 4 ‘Truth’ : µ(x) = 9.7x 5 + 0.8x 3 + 9.4x 2 − 5.7x − 2, τ = 1 14 12 10 8

y

6 4 2 0 −2 −4 −6 −1

−0.5


0 x

0.5


1



Method of Maximum Likelihood (ML), Part 4 Polynomial fit, k = 2, τˆ(2) (y) = 4.6919 14 12 10 8

y

6 4 2 0 −2 −4 −6 −1

−0.5


0 x

0.5


1




y

6 4 2 0 −2 −4 −6 −1

−0.5


0 x

0.5


1




y

6 4 2 0 −2 −4 −6 −1

−0.5


0 x

0.5


1




y

6 4 2 0 −2 −4 −6 −1

−0.5


0 x

0.5


1



A problem with Maximum Likelihood

It is not difficult to show that τˆ(0) > τˆ(1) > τˆ(2) > . . . > τˆ(n−1) and furthermore that τˆ(n−1) = 0.





A problem with Maximum Likelihood

It is not difficult to show that τˆ(0) > τˆ(1) > τˆ(2) > . . . > τˆ(n−1) and furthermore that τˆ(n−1) = 0. From this it is obvious that attempting to estimate k using Maximum Likelihood will fail, i.e. the solution of nn no log 2πˆ τ (k ) (yn ) + kˆ = arg min 2 k ∈{0,...,n−1} 2 is simply kˆ = (n − 1), irrespective of yn .





Some solutions ...

The minimum encoding approach, pioneered by C.S. Wallace, D. Boulton and J.J. Rissanen The minimum discrepancy estimation approach, pioneered by H. Akaike





Content

1

Motivation

2

Estimation

3

AIC

4

Derivation

5

References





Kullback-Leibler Divergence

AIC is based on estimating the Kullback-Leibler (KL) divergence The Kullback-Leibler divergence Z Z f (yn ) log g(yn )d yn + KL(f ||g) = − f (yn ) log f (yn )d yn n n | Y {z } |Y {z } Entropy

Cross−entropy

Cross-entropy, ∆(f ||g), is the ‘expected negative log-likelihood’ of data coming from f under g





Kullback-Leibler Divergence Cross-entropy for polynomial fits of order k = {0, . . . , 20} 200 180 160

Cross−entropy

140 120 100 80 60 40 20 0 0

5


10 k

15


20



Akaike’s Information Criterion

Problem : KL divergence depends on knowing the truth (our p ∗ ) Akaike’s solution : Estimate it !





Akaike’s Information Criterion

The AIC score for a model is ˆ n )) = − log p(yn |θ(y ˆ n )) + p AIC(θ(y where p is the number of free model parameters. Using AIC one chooses the model that solves n o kˆ = arg min AIC(θˆ(k ) (yn )) k ∈{0,1,...}





Properties of AIC

Under certain conditions the AIC score satisfies h i h i ˆ n )) = Eθ∗ ∆(θ ∗ ||θ(y ˆ n )) + on (1) Eθ∗ AIC(θ(y

where on (1) → 0 as n → ∞

In words, the AIC score is an asymptotically unbiased estimate of the cross-entropy risk This means it is only valid if n is ‘large’





Properties of AIC

AIC is good for prediction AIC is an asymptotically efficient model selection criterion In words, as n → ∞, with probability approaching one, the model with the minimum AIC score will also possess the smallest Kullback-Leibler divergence It is not necessarily the best choice for induction





Conditions for AIC to apply

AIC is an asymptotic approximation ; one should consider whether it applies before using it






AIC is an asymptotic approximation ; one should consider whether it applies before using it For AIC to be valid, n must be large compared to p






AIC is an asymptotic approximation ; one should consider whether it applies before using it For AIC to be valid, n must be large compared to p The true model must be θ ∗ ∈ Θ






AIC is an asymptotic approximation ; one should consider whether it applies before using it For AIC to be valid, n must be large compared to p The true model must be θ ∗ ∈ Θ Every θ ∈ Θ must map to a unique distribution p(·|θ)






AIC is an asymptotic approximation ; one should consider whether it applies before using it For AIC to be valid, n must be large compared to p The true model must be θ ∗ ∈ Θ Every θ ∈ Θ must map to a unique distribution p(·|θ) The Maximum Likelihood estimates must be consistent and be approximately normally distributed for large n






AIC is an asymptotic approximation ; one should consider whether it applies before using it For AIC to be valid, n must be large compared to p The true model must be θ ∗ ∈ Θ Every θ ∈ Θ must map to a unique distribution p(·|θ) The Maximum Likelihood estimates must be consistent and be approximately normally distributed for large n L(θ) must be twice differentiable with respect to θ for all θ∈Θ





Some models to which AIC can be applied include ...

Linear regression models, function approximation Generalised linear models Autoregressive Moving Average models, spectral estimation Constant bin-width histogram estimation Some forms of hypothesis testing





When not to use AIC

Multilayer Perceptron Neural Networks Many different θ map to the same distribution





When not to use AIC


Neyman-Scott Problem, Mixture Modelling The Maximum Likelihood estimates are not consistent





When not to use AIC



The Uniform Distribution L(θ) is not twice differentiable





When not to use AIC



The Uniform Distribution L(θ) is not twice differentiable

The AIC approach may still be applied to these problems, but the derivations need to be different





Application to polynomials AIC criterion for polynomials n n τ (k ) (yn ) + + (k + 2) AIC(k) = log 2πˆ 2 2 50 45 40 35

AIC(k)

30 25 20 15 10 5 0 0

5


10 k

15


20



Application to polynomials AIC selects kˆ = 3 14 12 10 8

y

6 4 2 0 −2 −4 −6 −1

−0.5


0 x

0.5


1



Improvements to AIC For some model types it is possible to derive improved estimates of the cross-entropy Under certain conditions, the ‘corrected’ AIC (AICc) criterion ˆ n )) = − log p(yn |θ(y ˆ n )) + n(p + 1) AICc (θ(y n−p−2 satisfies h i h i ˆ n )) = Eθ∗ ∆(θ ∗ ||θ(y ˆ n )) Eθ∗ AICc (θ(y

In words, it is an exactly unbiased estimator of the cross-entropy, even for finite n





Application to polynomials AICc criterion for polynomials AICc (k) =

n n(k + 2) n log 2πˆ τ (k ) (yn ) + + 2 2 n−k −3

50 AIC AICc

45 40

Criterion Score

35 30 25 20 15 10 5 0 0

5


10 k

15


20



Using AICc

Tends to perform better than AIC, especially when n/p is small Theoretically only valid for homoskedastic linear models ; these include Linear regression models, including linear function approximation Autoregressive Moving Average (ARMA) models Linear smoothers (kernel, local regression, etc)

Practically, tends to perform well as long as the model class is suitably regular





Content

1

Motivation

2

Estimation

3

AIC

4

Derivation

5

References





Some theory

Let k ∗ be the true number of parameters, and assume that the model space is nested Two sources of error/discrepancy in model selection Discrepancy due to approximation Main source of error when underfitting, i.e. when kˆ < k ∗

Discrepancy due to estimation Source of error when exactly fitting or overfitting, i.e. when kˆ ≥ k ∗





Discrepancy due to Approximation

14 12

True Curve Best Fitting Cubic

10 8

y

6 4 2 0 −2 −4 −6 −1

−0.5


0 x

0.5


1



Discrepancy due to Estimation

14 12

True Curve Lower CI Upper CI

10 8

y

6 4 2 0 −2 −4 −6 −1

−0.5


0 x

0.5


1



Derivation The aim is to show that h i h i ˆ + p = Eθ∗ ∆(θ ∗ ||θ) ˆ + on (1) Eθ∗ L(yn |θ)






Note that (under certain conditions) h i ˆ = ∆(θ ∗ ||θ0 )+ 1 (θ−θ ˆ 0 )0 J(θ0 )(θˆ−θ0 )+on (1) Eθ∗ ∆(θ ∗ ||θ) 2






Note that (under certain conditions) h i ˆ = ∆(θ ∗ ||θ0 )+ 1 (θ−θ ˆ 0 )0 J(θ0 )(θˆ−θ0 )+on (1) Eθ∗ ∆(θ ∗ ||θ) 2 ... and

h i ˆ θˆ − θ0 ) + on (1) ˆ + 1 (θˆ − θ0 )0 H(θ)( ∆(θ ∗ ||θ0 ) = Eθ∗ L(yn |θ) 2






Note that (under certain conditions) h i ˆ = ∆(θ ∗ ||θ0 )+ 1 (θ−θ ˆ 0 )0 J(θ0 )(θˆ−θ0 )+on (1) Eθ∗ ∆(θ ∗ ||θ) 2 ... and

h i ˆ θˆ − θ0 ) + on (1) ˆ + 1 (θˆ − θ0 )0 H(θ)( ∆(θ ∗ ||θ0 ) = Eθ∗ L(yn |θ) 2 Where

J(θ0 ) =

"

# 2 ∂ L(yn ||θ) ∂ 2 ∆(θ ∗ ||θ) ˆ , H(θ) = ∂θ∂θ 0 θ=θ0 ∂θ∂θ 0 θ=θˆ





Derivation Since h i 1 Eθ∗ (θˆ − θ0 )0 J(θ0 )(θˆ − θ0 ) = 2 h i 1 ˆ θˆ − θ0 ) = Eθ∗ (θˆ − θ0 )0 H(θ)( 2


p + on (1) 2 p + on (1) 2




Derivation Since h i 1 Eθ∗ (θˆ − θ0 )0 J(θ0 )(θˆ − θ0 ) = 2 h i 1 ˆ θˆ − θ0 ) = Eθ∗ (θˆ − θ0 )0 H(θ)( 2

p + on (1) 2 p + on (1) 2

Then, substituting h i h i ˆ ˆ + p + on (1) + p + on (1) Eθ∗ ∆(θ ∗ ||θ) = Eθ∗ L(yn |θ) 2 i2 h n ˆ + p +on (1) = Eθ∗ L(y |θ) | {z } ˆ AIC(θ)





Content

1

Motivation

2

Estimation

3

AIC

4

Derivation

5

References





References

S. Kullback and R. A. Leibler, ‘On Information and Sufficiency’, The Annals of Mathematical Statistics, Vol. 22, No. 1, pp. 79–86, 1951 H. Akaike, ‘A new look at the statistical model identification’, IEEE Transactions on Automatic Control, Vol. 19, No. 6, pp. 716–723, 1974 H. Linhart and W. Zucchini, Model Selection, John Wiley and Sons, 1986 C. M. Hurvich and C. Tsai, ‘Regression and Time Series Model Selection in Small Samples’, Biometrika, Vol. 76, pp. 297–307, 1989 J. E. Cavanaugh, ‘Unifying the Deriviations for the Akaike and Corrected Akaike Information Criteria’, Statistics & Probability Letters, Vol. 33, pp. 201–208, 1997



Model Selection Tutorial #1: Akaike's Information ... - User Web Pages

Model Selection Tutorial #1: Akaike's Information ... - User Web Pages

Suggest Documents

Model Selection Tutorial #1: Akaike's Information Criterion

Incorporating a User Model into an Information ... - User Web Pages

math.GT - User Web Pages

An Information-theoretic Causal Power Theory ... - User Web Pages

An Information-theoretic Causal Power Theory ... - User Web Pages

A Mixed Integer Linear Programming Model for ... - User Web Pages

A Probabilistic Model for Understanding Composite ... - User Web Pages

Supervised Learning of a Generative Model for ... - User Web Pages

Tutorial – Creating a User-Defined Model - PowerWorld

Tutorial – Creating a User-Defined Model

Motivations, perceptions, and aspirations ... - User Web Pages

FIELD RESEARCH CHARACTERIZING OCEANIC ... - User Web Pages

Born Americans - User Web Pages - Monash University

immigration unemployment relationship: the ... - User Web Pages

introduction - User Web Pages - Monash University

Approximately Processing Multi-granularity ... - User Web Pages

van Dongen.vp - User Web Pages - Monash University

PhD Thesis - User Web Pages - Monash University

pdf available - User Web Pages - Monash University

pdf available - User Web Pages - Monash University

pre-print - User Web Pages - Monash University

MICROFLUIDIC COLLOIDAL ISLAND SELF ... - User Web Pages

letters - User Web Pages - Monash University

uncorrected proof - User Web Pages - Monash University