Nov 22, 2008 - Daniel F. Schmidt and Enes Makalic. Melbourne, November 22 ..... C.S. Wallace, D. Boulton and J.J. Rissanen. The minimum discrepancy ...
Motivation Estimation AIC
Derivation References
Model Selection Tutorial #1: Akaike’s Information Criterion Daniel F. Schmidt and Enes Makalic
Melbourne, November 22, 2008
Daniel F. Schmidt and Enes Makalic
Model Selection with AIC
Motivation Estimation AIC
Derivation References
Content
1
Motivation
2
Estimation
3
AIC
4
Derivation
5
References
Daniel F. Schmidt and Enes Makalic
Model Selection with AIC
Motivation Estimation AIC
Derivation References
Problem
We have observed n data points yn = (y1 , . . . , yn ) from some unknown, probabilistic source p ∗ , i.e. yn ∼ p ∗ where yn ∈ Y n . We wish to learn about p ∗ from yn . More precisely, we would like to discover the generating source p ∗ , or at least a good approximation of it, from nothing but yn
Daniel F. Schmidt and Enes Makalic
Model Selection with AIC
Motivation Estimation AIC
Derivation References
Statistical Models To approximate p ∗ we will restrict ourself to a set of potential statistical models Informally, a statistical model can be viewed as a conditional probability distribution over the potential dataspace Y n p(yn |θ), θ ∈ Θ where θ = (θ1 , . . . , θp ) is a parameter vector that indexes the particular model Such models satisfy Z
p(yn |θ)dyn = 1
yn ∈Y n
for a fixed θ Daniel F. Schmidt and Enes Makalic
Model Selection with AIC
Motivation Estimation AIC
Derivation References
Statistical Models ...
An example would be the univariate normal distribution. ! n n 2 1 X 1 n 2 exp − p(y |θ) = (yi − µ) 2πτ 2τ i=1
where p=2 θ = (µ, τ ) are the parameters Y n = Rn Θ = R × R+
Daniel F. Schmidt and Enes Makalic
Model Selection with AIC
Motivation Estimation AIC
Derivation References
Content
1
Motivation
2
Estimation
3
AIC
4
Derivation
5
References
Daniel F. Schmidt and Enes Makalic
Model Selection with AIC
Motivation Estimation AIC
Derivation References
Parameter Estimation
Given a statistical model and data yn , we would like to take a guess at a plausible value of θ The guess should be ‘good’ in some sense Many ways to approach this problem ; we shall discuss one particularly relevant and important method : Maximum Likelihood
Daniel F. Schmidt and Enes Makalic
Model Selection with AIC
Motivation Estimation AIC
Derivation References
Method of Maximum Likelihood (ML), Part 1
A heuristic procedure introduced by R. A. Fisher Possesses good properties in many cases Is very general and easy to understand To estimate parameters θ for a statistical model from yn , solve ˆ n ) = arg max p(yn |θ) θ(y θ∈Θ
or, more conveniently
ˆ n ) = arg min − log p(yn |θ) θ(y θ∈Θ
Daniel F. Schmidt and Enes Makalic
Model Selection with AIC
Motivation Estimation AIC
Derivation References
Method of Maximum Likelihood (ML), Part 2 Example : Estimating the mean parameter µ of a univariate normal distribution Negative log-likelihood function : L(µ, τ ) =
n n 1 X (yi − µ)2 log(2πτ ) + 2 2τ i=1
Daniel F. Schmidt and Enes Makalic
Model Selection with AIC
Motivation Estimation AIC
Derivation References
Method of Maximum Likelihood (ML), Part 2 Example : Estimating the mean parameter µ of a univariate normal distribution Negative log-likelihood function : L(µ, τ ) =
n n 1 X (yi − µ)2 log(2πτ ) + 2 2τ i=1
Differentiating L(·) with respect to µ yields ∂L(µ, τ ) 1 = ∂µ 2τ
Daniel F. Schmidt and Enes Makalic
2nµ − 2
n X
yi
i=1
Model Selection with AIC
!
Motivation Estimation AIC
Derivation References
Method of Maximum Likelihood (ML), Part 2 Example : Estimating the mean parameter µ of a univariate normal distribution Negative log-likelihood function : L(µ, τ ) =
n n 1 X (yi − µ)2 log(2πτ ) + 2 2τ i=1
Differentiating L(·) with respect to µ yields ∂L(µ, τ ) 1 = ∂µ 2τ
2nµ − 2
n X
yi
i=1
Setting this to zero, and solving for µ yields n
µ ˆ(yn ) =
1X yi n i=1
Daniel F. Schmidt and Enes Makalic
Model Selection with AIC
!
Motivation Estimation AIC
Derivation References
Univariate Polynomial Regression A more complex model : k-order polynomial regression
Daniel F. Schmidt and Enes Makalic
Model Selection with AIC
Motivation Estimation AIC
Derivation References
Univariate Polynomial Regression A more complex model : k-order polynomial regression Let each y(x) be distributed as per a univariate normal with variance τ and a special mean µ(x) = β0 + β1 x + β2 x 2 . . . ...βk x k The parameters of this model are θ (k ) = (τ, β0 , . . . , βk ).
Daniel F. Schmidt and Enes Makalic
Model Selection with AIC
Motivation Estimation AIC
Derivation References
Univariate Polynomial Regression A more complex model : k-order polynomial regression Let each y(x) be distributed as per a univariate normal with variance τ and a special mean µ(x) = β0 + β1 x + β2 x 2 . . . ...βk x k The parameters of this model are θ (k ) = (τ, β0 , . . . , βk ). In this model the data yn is associated with a xn which are known
Daniel F. Schmidt and Enes Makalic
Model Selection with AIC
Motivation Estimation AIC
Derivation References
Univariate Polynomial Regression A more complex model : k-order polynomial regression Let each y(x) be distributed as per a univariate normal with variance τ and a special mean µ(x) = β0 + β1 x + β2 x 2 . . . ...βk x k The parameters of this model are θ (k ) = (τ, β0 , . . . , βk ). In this model the data yn is associated with a xn which are known Given an order k, maximum likelihood can be used to estimate θ (k )
Daniel F. Schmidt and Enes Makalic
Model Selection with AIC
Motivation Estimation AIC
Derivation References
Univariate Polynomial Regression A more complex model : k-order polynomial regression Let each y(x) be distributed as per a univariate normal with variance τ and a special mean µ(x) = β0 + β1 x + β2 x 2 . . . ...βk x k The parameters of this model are θ (k ) = (τ, β0 , . . . , βk ). In this model the data yn is associated with a xn which are known Given an order k, maximum likelihood can be used to estimate θ (k ) But it cannot be used to provide a suitable estimate of order k ! Daniel F. Schmidt and Enes Makalic
Model Selection with AIC
Motivation Estimation AIC
Derivation References
Univariate Polynomial Regression If we let µ ˆ(k ) (x) = βˆ0 + βˆ1 x + βˆ2 x 2 . . . ...βˆk x k Maximum Likelihood chooses βˆ(k ) (yn ) to minimise τˆ(k ) (yn ) =
n 2 1 X yi − µ ˆ(k ) (xi ) n i=1
This is called the residual variance.
Daniel F. Schmidt and Enes Makalic
Model Selection with AIC
Motivation Estimation AIC
Derivation References
Univariate Polynomial Regression If we let µ ˆ(k ) (x) = βˆ0 + βˆ1 x + βˆ2 x 2 . . . ...βˆk x k Maximum Likelihood chooses βˆ(k ) (yn ) to minimise τˆ(k ) (yn ) =
n 2 1 X yi − µ ˆ(k ) (xi ) n i=1
This is called the residual variance. The likelihood function L(yn |θˆ(k ) (yn )) made by plugging in the Maximum Likelihood estimates is n n τ (k ) (yn ) + L(yn |θˆ(k ) (yn )) = log 2πˆ 2 2 Daniel F. Schmidt and Enes Makalic
Model Selection with AIC
Motivation Estimation AIC
Derivation References
Method of Maximum Likelihood (ML), Part 4 ‘Truth’ : µ(x) = 9.7x 5 + 0.8x 3 + 9.4x 2 − 5.7x − 2, τ = 1 14 12 10 8
y
6 4 2 0 −2 −4 −6 −1
−0.5
Daniel F. Schmidt and Enes Makalic
0 x
0.5
Model Selection with AIC
1
Motivation Estimation AIC
Derivation References
Method of Maximum Likelihood (ML), Part 4 Polynomial fit, k = 2, τˆ(2) (y) = 4.6919 14 12 10 8
y
6 4 2 0 −2 −4 −6 −1
−0.5
Daniel F. Schmidt and Enes Makalic
0 x
0.5
Model Selection with AIC
1
Motivation Estimation AIC
Derivation References
Method of Maximum Likelihood (ML), Part 4 Polynomial fit, k = 5, τˆ(5) (y) = 1.1388 14 12 10 8
y
6 4 2 0 −2 −4 −6 −1
−0.5
Daniel F. Schmidt and Enes Makalic
0 x
0.5
Model Selection with AIC
1
Motivation Estimation AIC
Derivation References
Method of Maximum Likelihood (ML), Part 4 Polynomial fit, k = 10, τˆ(10) (y) = 1.0038 14 12 10 8
y
6 4 2 0 −2 −4 −6 −1
−0.5
Daniel F. Schmidt and Enes Makalic
0 x
0.5
Model Selection with AIC
1
Motivation Estimation AIC
Derivation References
Method of Maximum Likelihood (ML), Part 4 Polynomial fit, k = 20, τˆ(20) (y) = 0.1612 14 12 10 8
y
6 4 2 0 −2 −4 −6 −1
−0.5
Daniel F. Schmidt and Enes Makalic
0 x
0.5
Model Selection with AIC
1
Motivation Estimation AIC
Derivation References
A problem with Maximum Likelihood
It is not difficult to show that τˆ(0) > τˆ(1) > τˆ(2) > . . . > τˆ(n−1) and furthermore that τˆ(n−1) = 0.
Daniel F. Schmidt and Enes Makalic
Model Selection with AIC
Motivation Estimation AIC
Derivation References
A problem with Maximum Likelihood
It is not difficult to show that τˆ(0) > τˆ(1) > τˆ(2) > . . . > τˆ(n−1) and furthermore that τˆ(n−1) = 0. From this it is obvious that attempting to estimate k using Maximum Likelihood will fail, i.e. the solution of nn no log 2πˆ τ (k ) (yn ) + kˆ = arg min 2 k ∈{0,...,n−1} 2 is simply kˆ = (n − 1), irrespective of yn .
Daniel F. Schmidt and Enes Makalic
Model Selection with AIC
Motivation Estimation AIC
Derivation References
Some solutions ...
The minimum encoding approach, pioneered by C.S. Wallace, D. Boulton and J.J. Rissanen The minimum discrepancy estimation approach, pioneered by H. Akaike
Daniel F. Schmidt and Enes Makalic
Model Selection with AIC
Motivation Estimation AIC
Derivation References
Content
1
Motivation
2
Estimation
3
AIC
4
Derivation
5
References
Daniel F. Schmidt and Enes Makalic
Model Selection with AIC
Motivation Estimation AIC
Derivation References
Kullback-Leibler Divergence
AIC is based on estimating the Kullback-Leibler (KL) divergence The Kullback-Leibler divergence Z Z f (yn ) log g(yn )d yn + KL(f ||g) = − f (yn ) log f (yn )d yn n n | Y {z } |Y {z } Entropy
Cross−entropy
Cross-entropy, ∆(f ||g), is the ‘expected negative log-likelihood’ of data coming from f under g
Daniel F. Schmidt and Enes Makalic
Model Selection with AIC
Motivation Estimation AIC
Derivation References
Kullback-Leibler Divergence Cross-entropy for polynomial fits of order k = {0, . . . , 20} 200 180 160
Cross−entropy
140 120 100 80 60 40 20 0 0
5
Daniel F. Schmidt and Enes Makalic
10 k
15
Model Selection with AIC
20
Motivation Estimation AIC
Derivation References
Akaike’s Information Criterion
Problem : KL divergence depends on knowing the truth (our p ∗ ) Akaike’s solution : Estimate it !
Daniel F. Schmidt and Enes Makalic
Model Selection with AIC
Motivation Estimation AIC
Derivation References
Akaike’s Information Criterion
The AIC score for a model is ˆ n )) = − log p(yn |θ(y ˆ n )) + p AIC(θ(y where p is the number of free model parameters. Using AIC one chooses the model that solves n o kˆ = arg min AIC(θˆ(k ) (yn )) k ∈{0,1,...}
Daniel F. Schmidt and Enes Makalic
Model Selection with AIC
Motivation Estimation AIC
Derivation References
Properties of AIC
Under certain conditions the AIC score satisfies h i h i ˆ n )) = Eθ∗ ∆(θ ∗ ||θ(y ˆ n )) + on (1) Eθ∗ AIC(θ(y
where on (1) → 0 as n → ∞
In words, the AIC score is an asymptotically unbiased estimate of the cross-entropy risk This means it is only valid if n is ‘large’
Daniel F. Schmidt and Enes Makalic
Model Selection with AIC
Motivation Estimation AIC
Derivation References
Properties of AIC
AIC is good for prediction AIC is an asymptotically efficient model selection criterion In words, as n → ∞, with probability approaching one, the model with the minimum AIC score will also possess the smallest Kullback-Leibler divergence It is not necessarily the best choice for induction
Daniel F. Schmidt and Enes Makalic
Model Selection with AIC
Motivation Estimation AIC
Derivation References
Conditions for AIC to apply
AIC is an asymptotic approximation ; one should consider whether it applies before using it
Daniel F. Schmidt and Enes Makalic
Model Selection with AIC
Motivation Estimation AIC
Derivation References
Conditions for AIC to apply
AIC is an asymptotic approximation ; one should consider whether it applies before using it For AIC to be valid, n must be large compared to p
Daniel F. Schmidt and Enes Makalic
Model Selection with AIC
Motivation Estimation AIC
Derivation References
Conditions for AIC to apply
AIC is an asymptotic approximation ; one should consider whether it applies before using it For AIC to be valid, n must be large compared to p The true model must be θ ∗ ∈ Θ
Daniel F. Schmidt and Enes Makalic
Model Selection with AIC
Motivation Estimation AIC
Derivation References
Conditions for AIC to apply
AIC is an asymptotic approximation ; one should consider whether it applies before using it For AIC to be valid, n must be large compared to p The true model must be θ ∗ ∈ Θ Every θ ∈ Θ must map to a unique distribution p(·|θ)
Daniel F. Schmidt and Enes Makalic
Model Selection with AIC
Motivation Estimation AIC
Derivation References
Conditions for AIC to apply
AIC is an asymptotic approximation ; one should consider whether it applies before using it For AIC to be valid, n must be large compared to p The true model must be θ ∗ ∈ Θ Every θ ∈ Θ must map to a unique distribution p(·|θ) The Maximum Likelihood estimates must be consistent and be approximately normally distributed for large n
Daniel F. Schmidt and Enes Makalic
Model Selection with AIC
Motivation Estimation AIC
Derivation References
Conditions for AIC to apply
AIC is an asymptotic approximation ; one should consider whether it applies before using it For AIC to be valid, n must be large compared to p The true model must be θ ∗ ∈ Θ Every θ ∈ Θ must map to a unique distribution p(·|θ) The Maximum Likelihood estimates must be consistent and be approximately normally distributed for large n L(θ) must be twice differentiable with respect to θ for all θ∈Θ
Daniel F. Schmidt and Enes Makalic
Model Selection with AIC
Motivation Estimation AIC
Derivation References
Some models to which AIC can be applied include ...
Linear regression models, function approximation Generalised linear models Autoregressive Moving Average models, spectral estimation Constant bin-width histogram estimation Some forms of hypothesis testing
Daniel F. Schmidt and Enes Makalic
Model Selection with AIC
Motivation Estimation AIC
Derivation References
When not to use AIC
Multilayer Perceptron Neural Networks Many different θ map to the same distribution
Daniel F. Schmidt and Enes Makalic
Model Selection with AIC
Motivation Estimation AIC
Derivation References
When not to use AIC
Multilayer Perceptron Neural Networks Many different θ map to the same distribution
Neyman-Scott Problem, Mixture Modelling The Maximum Likelihood estimates are not consistent
Daniel F. Schmidt and Enes Makalic
Model Selection with AIC
Motivation Estimation AIC
Derivation References
When not to use AIC
Multilayer Perceptron Neural Networks Many different θ map to the same distribution
Neyman-Scott Problem, Mixture Modelling The Maximum Likelihood estimates are not consistent
The Uniform Distribution L(θ) is not twice differentiable
Daniel F. Schmidt and Enes Makalic
Model Selection with AIC
Motivation Estimation AIC
Derivation References
When not to use AIC
Multilayer Perceptron Neural Networks Many different θ map to the same distribution
Neyman-Scott Problem, Mixture Modelling The Maximum Likelihood estimates are not consistent
The Uniform Distribution L(θ) is not twice differentiable
The AIC approach may still be applied to these problems, but the derivations need to be different
Daniel F. Schmidt and Enes Makalic
Model Selection with AIC
Motivation Estimation AIC
Derivation References
Application to polynomials AIC criterion for polynomials n n τ (k ) (yn ) + + (k + 2) AIC(k) = log 2πˆ 2 2 50 45 40 35
AIC(k)
30 25 20 15 10 5 0 0
5
Daniel F. Schmidt and Enes Makalic
10 k
15
Model Selection with AIC
20
Motivation Estimation AIC
Derivation References
Application to polynomials AIC selects kˆ = 3 14 12 10 8
y
6 4 2 0 −2 −4 −6 −1
−0.5
Daniel F. Schmidt and Enes Makalic
0 x
0.5
Model Selection with AIC
1
Motivation Estimation AIC
Derivation References
Improvements to AIC For some model types it is possible to derive improved estimates of the cross-entropy Under certain conditions, the ‘corrected’ AIC (AICc) criterion ˆ n )) = − log p(yn |θ(y ˆ n )) + n(p + 1) AICc (θ(y n−p−2 satisfies h i h i ˆ n )) = Eθ∗ ∆(θ ∗ ||θ(y ˆ n )) Eθ∗ AICc (θ(y
In words, it is an exactly unbiased estimator of the cross-entropy, even for finite n
Daniel F. Schmidt and Enes Makalic
Model Selection with AIC
Motivation Estimation AIC
Derivation References
Application to polynomials AICc criterion for polynomials AICc (k) =
n n(k + 2) n log 2πˆ τ (k ) (yn ) + + 2 2 n−k −3
50 AIC AICc
45 40
Criterion Score
35 30 25 20 15 10 5 0 0
5
Daniel F. Schmidt and Enes Makalic
10 k
15
Model Selection with AIC
20
Motivation Estimation AIC
Derivation References
Using AICc
Tends to perform better than AIC, especially when n/p is small Theoretically only valid for homoskedastic linear models ; these include Linear regression models, including linear function approximation Autoregressive Moving Average (ARMA) models Linear smoothers (kernel, local regression, etc)
Practically, tends to perform well as long as the model class is suitably regular
Daniel F. Schmidt and Enes Makalic
Model Selection with AIC
Motivation Estimation AIC
Derivation References
Content
1
Motivation
2
Estimation
3
AIC
4
Derivation
5
References
Daniel F. Schmidt and Enes Makalic
Model Selection with AIC
Motivation Estimation AIC
Derivation References
Some theory
Let k ∗ be the true number of parameters, and assume that the model space is nested Two sources of error/discrepancy in model selection Discrepancy due to approximation Main source of error when underfitting, i.e. when kˆ < k ∗
Discrepancy due to estimation Source of error when exactly fitting or overfitting, i.e. when kˆ ≥ k ∗
Daniel F. Schmidt and Enes Makalic
Model Selection with AIC
Motivation Estimation AIC
Derivation References
Discrepancy due to Approximation
14 12
True Curve Best Fitting Cubic
10 8
y
6 4 2 0 −2 −4 −6 −1
−0.5
Daniel F. Schmidt and Enes Makalic
0 x
0.5
Model Selection with AIC
1
Motivation Estimation AIC
Derivation References
Discrepancy due to Estimation
14 12
True Curve Lower CI Upper CI
10 8
y
6 4 2 0 −2 −4 −6 −1
−0.5
Daniel F. Schmidt and Enes Makalic
0 x
0.5
Model Selection with AIC
1
Motivation Estimation AIC
Derivation References
Derivation The aim is to show that h i h i ˆ + p = Eθ∗ ∆(θ ∗ ||θ) ˆ + on (1) Eθ∗ L(yn |θ)
Daniel F. Schmidt and Enes Makalic
Model Selection with AIC
Motivation Estimation AIC
Derivation References
Derivation The aim is to show that h i h i ˆ + p = Eθ∗ ∆(θ ∗ ||θ) ˆ + on (1) Eθ∗ L(yn |θ)
Note that (under certain conditions) h i ˆ = ∆(θ ∗ ||θ0 )+ 1 (θ−θ ˆ 0 )0 J(θ0 )(θˆ−θ0 )+on (1) Eθ∗ ∆(θ ∗ ||θ) 2
Daniel F. Schmidt and Enes Makalic
Model Selection with AIC
Motivation Estimation AIC
Derivation References
Derivation The aim is to show that h i h i ˆ + p = Eθ∗ ∆(θ ∗ ||θ) ˆ + on (1) Eθ∗ L(yn |θ)
Note that (under certain conditions) h i ˆ = ∆(θ ∗ ||θ0 )+ 1 (θ−θ ˆ 0 )0 J(θ0 )(θˆ−θ0 )+on (1) Eθ∗ ∆(θ ∗ ||θ) 2 ... and
h i ˆ θˆ − θ0 ) + on (1) ˆ + 1 (θˆ − θ0 )0 H(θ)( ∆(θ ∗ ||θ0 ) = Eθ∗ L(yn |θ) 2
Daniel F. Schmidt and Enes Makalic
Model Selection with AIC
Motivation Estimation AIC
Derivation References
Derivation The aim is to show that h i h i ˆ + p = Eθ∗ ∆(θ ∗ ||θ) ˆ + on (1) Eθ∗ L(yn |θ)
Note that (under certain conditions) h i ˆ = ∆(θ ∗ ||θ0 )+ 1 (θ−θ ˆ 0 )0 J(θ0 )(θˆ−θ0 )+on (1) Eθ∗ ∆(θ ∗ ||θ) 2 ... and
h i ˆ θˆ − θ0 ) + on (1) ˆ + 1 (θˆ − θ0 )0 H(θ)( ∆(θ ∗ ||θ0 ) = Eθ∗ L(yn |θ) 2 Where
J(θ0 ) =
"
# 2 ∂ L(yn ||θ) ∂ 2 ∆(θ ∗ ||θ) ˆ , H(θ) = ∂θ∂θ 0 θ=θ0 ∂θ∂θ 0 θ=θˆ
Daniel F. Schmidt and Enes Makalic
Model Selection with AIC
Motivation Estimation AIC
Derivation References
Derivation Since h i 1 Eθ∗ (θˆ − θ0 )0 J(θ0 )(θˆ − θ0 ) = 2 h i 1 ˆ θˆ − θ0 ) = Eθ∗ (θˆ − θ0 )0 H(θ)( 2
Daniel F. Schmidt and Enes Makalic
p + on (1) 2 p + on (1) 2
Model Selection with AIC
Motivation Estimation AIC
Derivation References
Derivation Since h i 1 Eθ∗ (θˆ − θ0 )0 J(θ0 )(θˆ − θ0 ) = 2 h i 1 ˆ θˆ − θ0 ) = Eθ∗ (θˆ − θ0 )0 H(θ)( 2
p + on (1) 2 p + on (1) 2
Then, substituting h i h i ˆ ˆ + p + on (1) + p + on (1) Eθ∗ ∆(θ ∗ ||θ) = Eθ∗ L(yn |θ) 2 i2 h n ˆ + p +on (1) = Eθ∗ L(y |θ) | {z } ˆ AIC(θ)
Daniel F. Schmidt and Enes Makalic
Model Selection with AIC
Motivation Estimation AIC
Derivation References
Content
1
Motivation
2
Estimation
3
AIC
4
Derivation
5
References
Daniel F. Schmidt and Enes Makalic
Model Selection with AIC
Motivation Estimation AIC
Derivation References
References
S. Kullback and R. A. Leibler, ‘On Information and Sufficiency’, The Annals of Mathematical Statistics, Vol. 22, No. 1, pp. 79–86, 1951 H. Akaike, ‘A new look at the statistical model identification’, IEEE Transactions on Automatic Control, Vol. 19, No. 6, pp. 716–723, 1974 H. Linhart and W. Zucchini, Model Selection, John Wiley and Sons, 1986 C. M. Hurvich and C. Tsai, ‘Regression and Time Series Model Selection in Small Samples’, Biometrika, Vol. 76, pp. 297–307, 1989 J. E. Cavanaugh, ‘Unifying the Deriviations for the Akaike and Corrected Akaike Information Criteria’, Statistics & Probability Letters, Vol. 33, pp. 201–208, 1997
Daniel F. Schmidt and Enes Makalic
Model Selection with AIC