Order Selection of Autoregressive Processes using Bridge Criterion Jie Ding, Mohammad Noshad and Vahid Tarokh
arXiv:1508.02473v1 [math.ST] 11 Aug 2015
Harvard University, Cambridge, Massachusetts 02138, U.S.A. Email:
[email protected],
[email protected],
[email protected]
Abstract—A new criterion is introduced for determining the order of an autoregressive model fit to time series data. The proposed technique is shown to give a consistent and asymptotically efficient order estimation. It has the benefits of the two well-known model selection techniques, the Akaike information criterion and the Bayesian information criterion. When the true order of the autoregression is relatively large compared with the sample size, the Akaike information criterion is known to be efficient, and the new criterion behaves in a similar manner. When the true order is finite and small compared with the sample size, the Bayesian information criterion is known to be consistent, and so is the new criterion. Thus the new criterion builds a bridge between the two classical criteria automatically. In practice, where the observed time series is given without any prior information about the autoregression, the proposed order selection criterion is more flexible and robust compared with classical approaches. Numerical results are presented demonstrating the robustness of the proposed technique when applied to various datasets.
I. I NTRODUCTION Autoregressive models for univariate time series are Markov processes with dependence on the past observations. Linear Gaussian models represent an important class of these models for time series analysis. Given an observed time series, an important practical issue is to select the number of lags of the autoregressive model so that there is no overfitting while maintaining a low modeling error. Lag order selection, as a model selection problem, can be addressed using information criterion methods. There are many such methods with their own advantages and shortcomings. Akaike [1], [2] proposed a FPE criterion that aims to minimize the one-step prediction error when the estimates are applied to another independently generated dataset. The FPE criterion is a forerunner of the well-known Akaike Information Criterion (AIC) [3] that was also proposed by Akaike. Bayesian Information Criterion (BIC) [4] is another popular method that is similar to AIC. Both AIC and BIC has a penalty term that is a linear function of the number of model parameters. Some variants of AIC have also been considered, for example the modified AIC with constants α instead of 2 [5]. For the α-AIC, Bhansali et al. [6] proved that the asymptotic probability of choosing the correct lag (if it exists) is increasing as the constant α increases. Akaike [7] argued in a Bayesian setting that α = 2 is the best in a practical situation. Hurvich et al. [8] proposed the corrected-AIC for small size case. There are many other approaches proposed for AR order selection. One simple way is to sequentially test when the
partial autocorrelation of the time series becomes zero. This approach will be elaborated in Section II. Newey and West [9] propose a method of lag selection for the purposes of computing the variance-covariance matrix and heteroskedasticity, which is shown to be more accurate if applied after prewhitening of the data. However, the method suffers from size distortions in its estimate. Morales [10] proposed a criterion derived from minimizing the mean absolute error between the true and estimated spectral density at frequency zero. Nevertheless, it seems that the most popular approaches are still using AIC and BIC. In fact, BIC is shown to be consistent. In addition, Hannon et al. [11] proposed a criterion that replaces the log N term in BIC by c log log N (c > 1) in order to keep the consistency while under-estimating the order to the minimum degree. Thus AIC is not consistent when sample size tends to infinity, and it tends to overfit compared to BIC. However, AIC is shown to be the most efficient in the case of infinite order [12]. In other words, AIC produce less modeling error than BIC when the data is not generated from a finite order AR process. In this paper, a new order selection criterion referred to as the bridge criterion for autoregressive models is introduced. The proposed technique is initially motivated by postulating that the underlying autoregressive filter is randomly drawn according to a non-informative uniform distribution prior and by fixing the type I error in a sequence of hypothesis tests. Although the technique is motivated by assuming a prior distribution on autoregressive filters, it can be applied to any scenario where the prior is unknown or there is no probabilistic structure on the filters at all. In fact, the new criterion turns out to be in a form that combines the merit of AIC and BIC—it is consistent for finite autoregression order, while asymptotically efficient for infinite order. Furthermore, in contrast to AIC and BIC, it does not need the prior knowledge as to whether the underlying data is generated from a finite order autoregression or not, nor it does need to specify the candidate orders. In other words, it can automatically identify the appropriate order of autoregression. Since the new criterion has the advantages of both AIC and BIC, it is more flexible and adaptive to a practical dataset where no prior knowledge is available. The outline of this paper is as follows. In Sec. II, we formulate the problem and briefly review the background. The new order selection criterion is introduced in Sec. III. In Sec. IV, we establish the consistency of the proposed criterion, and provide analytic bounds for the error of the estimates
given finite sample size. The asymptotic efficiency property is discussed in Sec. V. Numerical results are given in Sec. VI comparing the performance of our approach to those of other techniques. Finally, we make our conclusions in Sec. VII. II. BACKGROUND
ψL = arg min eL ,
A. Notations
∗ ψL
Given two functions f (N, . . .) and g(N, . . .) where N ∈ N, if limN →∞ f (N )/g(N ) = 0, we write f = oN (g). If there exists a constant c 6= 0 such that c < limN →∞ f (N )/g(N ) < 1/c, we write f = ON (g). Let f ≤ ON (g) denote either f = ON (g) or f = oN (g). Let ⌊x⌋ denote the largest integer less than or equal to x. Let N (µ, σ 2 ), B(a, b), χ21 respectively denote the normal distribution with density function √ f (x) = exp{−(x− µ)2 /(2σ 2 )}/( 2πσ), the beta distribution with density function f (x) = xa−1 (1 − x)b−1 /B(a, b) (where B(·, ·) is the beta function), and the chi-square distribution with one degree of freedom. B. Formulation We assume that the data {xn }N n=1 is generated by AR filter ψL0 = [ψL0 ,1 , · · · , ψL0 ,L0 ]T of length L0 , i.e., xn + ψL0 ,1 xn−1 + · · · + ψL0 ,L0 xn−L0 = ǫn ,
N X
1 = γˆi = N
xn xn−i ,
n=Lmax +1
(2)
and the one-step prediction error eˆL is N X
E {(xn +
n=Lmax +1 ∗ ψL,1 xn−1 +
∗ · · · + ψL,L xn−L )2 , (6)
which can be calculated from a set of equations similar to (2), (4), and (5), by removing the hats (∧) from PN all parameters. 2 For convenience, we also define eˆ0 = N1 n=Lmax +1 xn and 2 e0 = E(xn ). Given an observed time series, it is important to identify the unknown order of the AR process. Throughout the paper, we assume that {1, · · · , Lmax } is the candidate set of orders that contains the “correct” order L0 . The AIC and BIC criteria ˆ (1 ≤ L ˆ ≤ Lmax ) that for AR order selection is to select L minimizes the following quantities AIC(N, L) = log eˆL +
2 L N
and BIC(N, L) = log eˆL +
log(N ) L, N
respectively.
Assume that we generate a time series to simulate an AR(L0 ) process using (1). Clearly,
0 ≤ i ≤ Lmax .
ˆ −1 γ ˆL , ψˆL = Γ
1 N
N X
C. Motivation
The sample autocovariance vector and matrix are respectively ˆ L = [ˆ ˆL = [ˆ defined by γ γ1 , · · · , γˆL ]T , Γ γi−j ]L i,j=1 . The AR ˆ filter ψL can be estimated by the Yule-Walker equations, which yield consistent estimates [13]. Using these equations, the estimated filter for an AR model of order L is given by
eˆL =
1 N
(1)
where ψi ∈ R, for i = 1, 2, · · · , L0 , and εn is i.i.d. Gaussian noise with zero mean and variance σ 2 , i.e., εn ∼ N (0, σ 2 ). The sample autocovariances are defined by γˆ−i
We note that ψˆk,k is also referred to as the partial autocorrelation of lag k. Let γi = E{xn xn−i } be the autocovariances and ψL = [ψL,1 , · · · , ψL,L ]T be the best linear predictor of length L, then
(xn + ψˆL,1 xn−1 + · · · + ψˆL,L xn−L )2 .
n=Lmax +1
The filter coefficients of the optimal AR(L) model can be estimated efficiently using Levinson-Durbin recursion [14]; 2 first fit AR(1) model by ψ1,1 = −ˆ γ1 /ˆ γ0 , eˆ1 = γˆ0 (1 − ψˆ1,1 ), and then recursively calculate the AR(k) model parameters for k = 2, 3, · · · , L using γˆk + ψˆk−1,1 γˆk−1 + · · · + ψˆk−1,k−1 γˆ1 , ψˆk,k = − γˆ0 + ψˆk−1,1 γˆ1 + · · · + ψˆk−1,k−1 γˆk−1 ψˆk,ℓ = ψˆk−1,ℓ + ψˆk,k ψˆk−1,k−ℓ , ℓ = 1, · · · , k − 1, .
(3) (4)
The error term of the AR(k) model can be calculated as follows 2 eˆk = eˆk−1 (1 − ψˆk,k ).
(5)
eˆ1 ≥ · · · ≥ eˆL0 −1 ≥ eˆL0 ≥ eˆL0 +1 ≥ · · · ≥ eˆLmax .
Because of Equation (5) and the consistency of ψˆL , generally eˆL is large for L < L0 and is close to 0 for L ≥ L0 . If we plot eˆL versus L for L = 1, 2, · · · , Lmax , the curve is usually strictly decreasing for L < Lmax and becomes almost flat for ˆ is selected such that the L > Lmax . Intuitively, the order L ˆ gain eˆL /ˆ eL−1 for L > L becomes “less significant” than its preceding ones. We define the empirical and theoretical gain of using AR(L) over AR(L − 1) respectively as 1 eˆL−1 , (7) = log gˆL = log 2 eˆL 1 − ψˆL,L and gL = log
eL−1 eL
= log
1 , 2 1 − ψL,L
(8)
where the log is used primarily to cancel out the influence of the variance of the noise, σ 2 . Lemma 1. ( [15], Theorem 5.6.2 and 5.6.3) Assume that the data is generated by a stable filter ψL0 of length L0 . For any positive integer L greater than L0 , [ψˆL0 +1,L0 +1 , · · · , ψˆL,L ]T have a limiting joint-normal distribution N (0, I) as N tends to infinity, where I denotes the identity matrix.
2 2 From the lemma, N gˆL = −N log(1 − ψˆL,L ) = N ψˆL,L + o(1) for L = L0 + 1, · · · , Lmax is asymptotically independent and distributed according to χ21 , the chi-square distribution with one degree of freedom. Furthermore, the following sequence of hypothesis testings are applicable. We choose a small positive number q as the significance level, and thresholds sL such that q = P(Z/N > sL ), where Z ∼ χ21 . If gˆL > sL , reject H0 and replace L − 1 by L, L = 1, · · · until H0 is not rejected: (L ≤ Lmax )
H0 : order is L − 1
H1 : order is L.
A problem of the sequential hypothesis testing technique is that it may get stuck at a local point. An alternative would be to find the global minimum of the goodness of fit eˆL plus the penalty term that is a sum of thresholds sk L X
ˆ = arg min L L
(−ˆ gk + sk ) = log eˆL +
k=1
L X
sk
(9)
k=1
where the equal sign is up to a shift of a constant. AIC has a penalty term 2L/N . For the case when N is large, it corresponds to the forward procedure with q = 0.1573 described in (9). BIC has a penalty term L log(N )/N . It corresponds to the forward procedure with varying q. For example the confidence level q of BIC w.r.t. different sample size N is tabulated in Table I. CONFIDENCE LEVEL
N q
100 0.0319
q
OF
TABLE I BIC W. R . T. DIFFERENT SAMPLE SIZE N
500 0.0127
1000 0.0086
2000 0.0058
10000 0.0024
Now we assume that underlying the data is an AR process whose coefficients are generated from a certain distribution, e.g., uniform distribution if no priori is given. We can choose a small positive number p as the confidence level, and the associated thresholds tL such that p = P(gL < tL ).
(10)
Similarly, we have hypothesis testing in an opposite direction as before (for a given Lmax ) H1 : order is L − 1
H0 : order is L
where L = Lmax , Lmax − 1, · · · , 1, and the hypothesis is tested until H0 is not rejected. Likewise, the L that minimizes the following function is chosen as the optimal AR order ˆ = arg min L L
L max X
(ˆ gk − tk ) = log eˆL −
k=L+1
= log eˆL +
L X
tk
L max X
tk
k=L+1
(11)
k=1
where the equal signs are up to a shift of constants that depend only on Lmax . The next subsection introduces uniform distribution of stable AR filters, which, together with (11), motivates the proposed criterion in Section III.
D. Uniform Distribution of Stable AR Filters Consider the space of all the stable AR filters whose roots are bounded by a given positive radius r ∈ (0, 1]: ( L X SL (r) = [ψL,1 , · · · , ψL,L ]T | z L + ψL,ℓ z L−ℓ = ℓ=1
L Y
)
(z − aℓ ), ψL,ℓ ∈ R, |aℓ | ≤ r, ℓ = 1, · · · , L .
ℓ=1
A technique was proposed in [16] to generate filters ψL = [ψ1,L , · · · , ψL,L ]T that are uniformly distributed in SL (r), denoted by ψL ∼ U (SL (r)). Consider the case r = 1, the procedure can be written as: 1. Randomly draw ψ1,1 from the uniform distribution in [−1, 1]. From k = 2 → L, do: 2. Randomly draw βk from the beta distribution βk ∼ Beta(⌊k/2 + 1⌋, ⌊(k + 1)/2⌋); 3. Let ψk,k = 2βk − 1 and ψk,ℓ = ψk−1,ℓ + ψk,k ψk−1,k−ℓ , ℓ = 1, · · · , k − 1.
(12)
The formula is similar to that in Equation (12) except in the way that ψk,k is obtained. From the above procedure, the generation of a uniformly random filter of length L breaks down to the generation of L independent but not identically distributed random variables. In fact, a stable filter can be uniquely identified by ψk,k [17]. The following results are straightforward. Theorem 1. Assume that ψL0 ∼ U (SL0 (1)). For any positive integer L < L0 , log eL can be represented as log eL = L0 P 2 log(1 − ψk,k ), where (ψk,k + 1)/2 ∼ Beta(⌊k/2 + − k=L+1
1⌋, ⌊(k + 1)/2⌋). Besides this, the probability density function of gL and exp(gL ) for any positive integer L ≤ L0 can be computed as 1
pexp(gL ) (x) ∝ (exp(x) − 1)− 2 (exp(x))−(⌊ − 12
pgL (x) ∝ (x − 1)
1 −(⌊ L+1 2 ⌋+ 2 )
x
L+1 1 2 ⌋− 2 )
1x>0 ,
1x>1 ,
2 where 1(·) denote the indicator function. Moreover, LψL,L and 2 L log gL converge in distribution to χ1 as L tends to infinity.
III. P ROPOSED O RDER S ELECTION C RITERION In the rest of the paper, we let the largest candidate order to (N ) grow with the sample size N , and use notation Lmax instead of Lmax to emphasize this dependency. PL Building on the idea of (11), we adopt the penalty term k=1 tk (p) where tk (p) is defined in (10), and p is determined by 2 tL(N ) (p) = . (13) max N Theorem 1 implies that tk ≈ F −1 (p)/k for large k, where F −1 (·) denotes the inverse function of the cumulative distribution function of chi-square distribution with one degree of freedom. From (13) we have (N )
F −1 (p) =
2Lmax , N
(14)
0.02 0.018 0.016
0.05 N=1000 N=10000 N=100000
0.045
AIC 0.04
0.014
HQ BIC
0.035
0.012
Penalty
Penalty
TS
0.01 0.008
0.03 0.025
0.006 0.02 0.004 0.015
0.002 0 1
2
3
4
5
6
7
8
9
10
11
0.01 1
2
3
4
5
6
L
L
Fig. 1. Illustration of the penalty curve for different sample sizes
Fig. 2. Comparison of the penalty curves of BC with AIC, HQ, and BIC for N = 1000 (without loss of generality the latter three curves are shifted to be tangent to BC curve, and the tangent points are denoted by circles)
and therefore, (N )
2Lmax 1 . (15) N k We thus propose the following order selection criterion, which we refer to as the bridge criterion (BC): (N ) Let Lmax be a sequence of positive integers which determines the largest possible order to be selected. Then the ˆ is estimated order L tk (p) ≈
L (N ) X 1 ˆ = arg min BC(N, L) = log eˆL + 2Lmax L . N k (N ) 1≤L≤L max
(16)
k=1
Due to the theoretical results to be introduced in Sections IV (N ) and V, we suggest PL Lmax = O(log(N )) when no priori is given. We note that k=1 1/k ≈ log L + c1 for large L, where c1 is a constant number. Fig. 1 illustrates the penalty terms for (N ) different N and Lmax = ⌊log N ⌋. Without loss of generality the curves are shifted to be at the same position for L = 1. Fig. 2 illustrates the penalty curve for BC together with AIC, BIC, and the criterion proposed by Hannan and Quinn (HQ) [11].P The curves are L plotted using BC : (2 log N/N ) k=1 1/k, AIC : (2/N )L, BIC : (log(N )/N )L, HQ : (c log log(N )/N )L, L = 1, · · · , ⌊log N ⌋, where c is chosen to be 1.1. Since only the slope of each curve matters for the order selection, the latter three criteria are shifted to be tangent to the (log-like) curve of BC in order to highlight the differences between the penalty terms. The tangent points (marked by circles) for AIC, HQ and BIC are respectively 6, 2 and 1. Take the HQ curve as an example. The meaning of the tangent point is that BC penalizes more than HQ for L ≤ 2 and vice versa for L > 2. Thus, definition (13) guarantees that BC penalizes less than AIC so that it does not cause significant overfitting. In addition, BC is consistent just like BIC and HQ. The intuition is this: given sample size N , the tangent point (N ) between BC and HQ curves is at Lmax / log log N . If one (N ) chooses Lmax = ⌊log N ⌋, since log N/ log log N → ∞ as N tends to infinity, it will finally be larger than any finite true order L0 (if it exists), so that BC penalizes larger than HQ
in the region L0 falls into. It implies that asymptotically BC never overfits. On the other hand, BC will not underfit because the largest penalty preventing from moving L to L + 1 is (N ) 2Lmax /N → 0, which will finally be less than any fixed positive number gL0 (defined in (8)). BC is therefore consistent. Moreover, since BC penalizes less for larger orders and finally (N ) becomes similar to AIC (at the end point Lmax ), it shares the asymptotic optimality that AIC has. The related theory is to be introduced in the following sections. In summary, the bent curve of BC builds a bridge between BIC and AIC so that a good balance between the underfitting and overfitting risks is achieved. IV. C ONSISTENCY Theorem 2. Assume that {xn }N n=1 is generated from a finite AR filter. If (N )
Lmax → ∞, log log N
(N )
Lmax →0 N 1/2
ˆ produced in (16) then the bridge criterion is consistent, i.e. L satisfies ˆ = L0 ) = 1. lim P r(L N →∞
The proof is available in the longer version of this paper. Remark 1. Assume that the filter has finite length L0 and sample size is N , the underfitting and overfitting error using the bridge criterion can be bounded analytically. Let A(k) = {(a1 , a2 , · · · , ak ) : a1 + 2a2 + · · · + kak = k, a1 , . . . , ak are nonnegative integers}. If L > L0 (overfitting), then we have Y 0 1 δ j aj X L−L ˆ P r{L = L} ≤ aj ! j j=1 A(L−L0 )
where δj =
2j 2Lmax 2Lmax exp − +1 . L L
TABLE II T HE FREQUENCY ( IN PERCENTS ) OF THE SELECTED ORDERS FROM AR(8) PROCESSES AND THE THEORETICAL UPPER BOUNDS
L ˆ = L} P r{L Upper bound
5 0 8.99
6 2.62 17.4
7 16.32 39.19
9 0.89 15.19
10 0.08 4.04
11 0 1.60
12 0 0.73
If L < L0 (underfitting), assuming that ψL0 ∼ U (SL0 (1)), then we have a 0 −L X LY 1 τj j ˆ P r{L = L} ≤ aj ! j j=1 A(L0 −L)
where (N )
τj =
(N )
2Lmax L0 2Lmax L0 exp − +1 N (L + 1) N (L + 1)
!! j2
.
The derivation is provided in the longer version of this paper. As an illustration of Remark 1,100 filters are uniformly random generated from S8 (1), and for each filter 100 AR(8) time series of length 1000 are independently generated. The ˆ = 5, · · · , 7, 9, · · · , 12 (in percentage) along frequencies of L the derived upper bound for the 100 × 100 time series are summarized in Table II. The frequency of choosing the correct order 8 is 80.09%. V. A SYMPTOTIC E FFICIENCY We are usually interested in the one-step prediction error if a mismatch filter is specified [1], [2]. Assume that the data is generated from a filter ψL0 as in Equation (1). The one-step prediction error of using filter ζL′ minus that of using the true filter is referred to as mismatch error, and it is defined by n 2 o D(ψL0 , ζL′ ) = E (ψLm − ζLm )T xn:n−Lm +1 = (ψLm − ζLm )T ΓLm (ψLm − ζLm )
where Lm = max{L0 , L′ }, xn:n−Lm +1 = T [xn , · · · , xn−Lm +1 ] , ψLm , ζLm are filters of length Lm by concatenating ψL0 , ζL′ with zeros. We note that the true filter may have infinite order. If so it is denoted by ψ∞ . Here, we show that for two special cases of interest the proposed order selection asymptotically minimizes the mismatch error. To that end, the following assumptions are needed. (A.1) The data {xn }N n=1 is generated from xn + ψ1 xn−1 + ψ2 xn−2 + · · · = εn , P∞ 2 where ψi ∈ R, i=1 |ψi | < ∞, εn ∈ N (0, σ ) is i.i.d. Gaussian noise, and the associated power series Ψ(z) = 1+ψ1z −1 +ψ2 z −2 +· · · converges and is not zero for |z| > 1. (N ) (A.2) {Lmax } is a sequence of positive integers such that (N )
(N )
Lmax → ∞ and Lmax = o(N 1/2 ) as N → ∞. (A.3) The order of the AR process (or the length of filter ψ) is either infinite, or finite but bounded below by a constant L0 (N ) which satisfies N 1/2 ≤ O(L0 (N )) as N → ∞.
We note that Assumption (A.1) is a more general assumption than we had in previous sections, since the AR filter ψL0 = [ψ1 , ψ2 , · · · ]T may have infinite length. L 2 σ + The cost function is defined by CN (L) , N D(ψ∞ , ψL ), which is the expected mismatch error if an estimated filter of length L is used for prediction. In fact, we have the following result. Lemma 2. ( [12] Corollary 3.1) Let {LN } be a s sequence (N ) of integers such that 1 ≤ LN ≤ Lmax and limN →∞ LN = (N ) limN →∞ Lmax = ∞. If Assumptions (A.1) and (A.2) hold, then D(ψ∞ , ψˆLN ) = 1 in probability . N →∞ CN (LN ) lim
Let {L∗N } denote a sequence of positive integers which achieves the minimum of CN (L) for each N , i.e. L∗N = arg min1≤L≤L(N ) CN (L). max
Lemma 3. ( [12] Theorem 3.2) If Assumptions (A.1), (A.2), ˜ possibly and (A.3) hold, then for any random variable L depending on {xn }N , and for any ǫ > 0, n=1 ) ( D(ψ∞ , ψˆL˜ ) ≥ 1 − ǫ = 1. lim P r N →∞ CN (L∗N ) ˆ L) ˜ is no The result shows that the cost of the estimate ψ( ˜ less than CN (L∗N ) in probability for any order selection L. ˜ is called asymptotically efficient if An order selection L lim
N →∞
D(ψ∞ , ψL˜ ) = 1 in probability . CN (L∗N )
We note that an infinite order was assumed in the proof of Theorem 3.2 in [12]. Nevertheless, it is easy to observe that the related proof also holds for the more general assumption as in (A.3) (this fact was also emphasized in Section 5 in [12]). We thus have the following result whose proof is provided in a longer version of this paper. Theorem 3. Assume that (A.1), (A.2), and (A.3) hold. 1) Suppose that the mismatch error D(ψ∞ , ψL ) decreases as L in power, namely log D(ψ∞ , ψL ) = −γ log L + log cL for a constant γ ≥ 1 and series {cL : L = 1, 2, . . .} satisfying cL > 0, |cL+1 − cL | < γcL /(L + 1). If 1 ) 1+γ −ε L(N max ≤ O N
holds for a fixed constant ε ∈ (0, 1/(1 + γ)), then the bridge criterion is asymptotically efficient. Specially, if ψ is uniformly distributed in SL0 (1) and L0 satisfies (A.3), then for large L (1 ≤ L ≤ L0 ) the above condition is satisfied in the sense that E {log D(ψL0 , ψL )} = − log L + log L0 + oL (1).
TABLE III S IMULATION RESULTS FOR AR(2) PROCESSES (1000 REALIZATIONS FOR EACH α AND EACH N ) ˆ L 1 2 3 >3 1 2 3 >3 1 2 3 >3 1 2 3 >3
α = 0.3
α = −0.3
α = 0.8
α = −0.8
BC 807 135 27 31 784 153 36 27 0 860 78 62 0 850 73 77
N = 100 AIC BIC 544 859 301 123 79 12 76 6 566 853 291 132 88 12 55 3 0 0 791 971 124 23 85 6 0 0 785 968 109 27 106 5
BC 633 331 15 21 641 316 22 21 0 914 53 33 0 922 39 39
N = 500 AIC BIC 210 654 555 341 108 4 127 1 227 669 552 324 93 7 128 0 0 0 748 977 114 20 138 3 0 0 737 989 116 9 147 2
BC 321 624 27 28 312 636 25 27 0 925 41 34 0 928 40 32
N = 1000 AIC BIC 47 407 720 585 104 7 129 1 52 403 698 591 111 6 139 0 0 0 743 991 133 9 124 0 0 0 737 993 123 6 140 1
N BC 0 983 15 2 0 983 14 3 0 981 9 10 0 978 17 5
= 10000 AIC BIC 0 0 721 998 125 2 154 0 0 0 742 998 96 2 162 0 0 0 743 999 112 0 145 1 0 0 764 998 106 2 130 0
TABLE IV S IMULATION RESULTS FOR AR PROCESSES WITH LARGE ORDER (10 EXPERIMENTS FOR EACH PROCESS )
Experiment 1 2 3 4 5 6 7 8 9 10
BC 6 6 6 6 6 5 5 6 6 6
AIC 6 6 6 6 6 5 5 6 6 6
Process 1 BIC Optimum 5 6 5 6 5 6 6 6 5 6 5 5 5 6 5 6 5 6 5 6
BC 6 6 6 6 6 6 6 6 6 6
2) Suppose that the mismatch error decreases as L exponentially, namely log D(ψ∞ , ψL ) = −γ log L + log cL for a constant γ > 0 and series {cL : L = 1, 2, . . .} satisfying cL > 0, |cL+1 − cL | < γcL . If ) L(N max ≤
1−ε log N γ
holds for a fixed constant ε ∈ (0, 1), then the bridge criterion is asymptotically efficient. A special case is when the data is generated from a finite order movingaverage process. VI. N UMERICAL R ESULTS The performance of the proposed order selection criterion is compared to those of AIC and BIC in Table III and IV. In Table III, the data is simulated from AR filter ψ2 = [α, α2 ]T for α = 0.3, −0.3, 0.8, −0.8. For each α, the estimated orders are shown for 1000 independent realizations of AR(2) processes, i.e., xn + αxn−1 + α2 xn−2 = ǫn , ǫn ∼ N (0, 1). The experiment is repeated for different sample sizes N = 100, 500, 1000, 10000. As is expected, the performance of BC is in between AIC and BIC for small N , and it is consistent when N tends to infinity.
AIC 6 6 6 6 6 6 6 6 6 6
Process 2 BIC Optimum 6 6 6 6 5 6 5 6 5 6 6 6 6 6 6 6 6 6 6 6
BC 10 9 10 8 12 9 9 7 9 8
AIC 11 9 10 8 12 9 9 7 12 8
Process 3 BIC Optimum 6 12 7 10 6 12 8 10 7 12 7 11 8 11 4 12 7 11 6 12
In Table IV, the estimated AR orders using different criteria are shown for data that are generated from three different processes. For each process, the experiments are repeated 10 times. For each experiment, the order whose associated estimated filter gave the minimum one-step prediction error is denoted by “optimum” in the table. In other words, if the optimum order is used, the estimated filter when applied to another independently generated data from the same process produces the least one-step prediction error. The optimum order is determined by numerical computations. The result illustrates the performance of three criteria in the case where the true order is large relative to sample size. It shows that BC performs as good as AIC, and better than BIC. The first process is AR(8) with ψ8 = [0.7k ]8k=1 , i.e. xn + ψ8,1 xn−1 + · · · + ψ8,8 xn−8 = ǫn , ǫn ∼ N (0, 1). The (N ) sample size is 1000, and Lmax is ⌊log N ⌋ = 6. The second process is AR(8) similar to the first one, but with ψ8 = [0.7755, 0.2670, −0.0058, −0.1231, −0.4551, −0.2516, − 0.5146, −0.1867], which is uniformly random generated from S8 (1). The third process is an MA(1) process: xn = ǫn − 0.8ǫn−1 , ǫn ∼ N (0, 1). It can be regarded as an AR (N ) process with infinite order. The sample size is 1000, and Lmax is 2⌊log N ⌋ = 12. These results show that our proposed technique achieves the performance that we had expected. It is consistent when the
data is from finite order autoregression and the sample size is relatively large, and asymptotically efficient when the sample size is relatively small compared with the order. In both cases, our criterion is nearly optimal, and this optimality, unlike AIC or BIC, is automatic and does not require the specification of the candidate orders. In practice when only the observed time series is given (without any prior information about model order), the proposed method is flexible and robust for the selection of the appropriate order of autoregression. VII. C ONCLUSIONS A new order selection criterion named bridge criterion is proposed to identify the optimal AR order for a given time series. While classical approaches such as AIC and BIC is neither consistent nor efficient, the new approach is both consistent and efficient under mild assumptions. This gives a good balance between the underfitting and overfitting risks. It would be interesting to study the possible extensions of this criterion to other model selection problems, for instance the vector autoregressive model and autoregressive-movingaverage model. ACKNOWLEDGMENT This research was funded by the Defense Advanced Research Projects Agency (DARPA) under grant number W911NF-14-1-0508. R EFERENCES [1] H. Akaike, “Fitting autoregressive models for prediction,” Annals of the institute of Statistical Mathematics, vol. 21, no. 1, pp. 243–247, 1969.
[2] ——, “Statistical predictor identification,” Annals of the Institute of Statistical Mathematics, vol. 22, no. 1, pp. 203–217, 1970. [3] ——, “Information theory and an extension of the maximum likelihood principle,” 2nd Int. Sym. Info. Theory, pp. 267–281, Sep 1973. [4] G. Schwarz, “Estimating the dimension of a model,” Ann. Statist., vol. 6, no. 2, pp. 461–464, Mar 1978. [5] P. Broersen, “Finite sample criteria for autoregressive order selection,” IEEE Trans. Signal Process., vol. 48, no. 12, pp. 3550–3558, 2000. [6] R. J. Bhansali and D. Y. Downham, “Some properties of the order of an autoregressive model selected by a generalization of akaike’s fpe criterion,” Biometrika, vol. 64, no. 3, pp. 547–551, 1977. [7] H. Akaike, “A bayesian extension of the minimum aic procedure of autoregressive model fitting,” Biometrika, vol. 66, no. 2, pp. 237–242, 1979. [8] C. M. Hurvich and C.-L. Tsai, “Regression and time series model selection in small samples,” Biometrika, vol. 76, no. 2, pp. 297–307, 1989. [9] W. Newey and K. West, “Automatic lag selection in covariance matrix estimation,” Rev. Econ. Stud., vol. 61, no. 4, pp. 631–653, 1994. [10] M. Morales, “Lag order selection for an optimal autoregressive covariance matrix estimator,” J. Appl. Stat., vol. 37, no. 5, pp. 739–748, 2010. [11] E. J. Hannan and B. G. Quinn, “The determination of the order of an autoregression,” Journal of the Royal Statistical Society. Series B (Methodological), pp. 190–195, 1979. [12] R. Shibata, “Asymptotically efficient selection of the order of the model for estimating parameters of a linear process,” The Annals of Statistics, pp. 147–164, 1980. [13] E. J. Hannan, “The asymptotic theory of linear time-series models,” Journal of Applied Probability, pp. 130–145, 1973. [14] J. Durbin, “The fitting of time-series models,” Revue de l’Institut International de Statistique, pp. 233–244, 1960. [15] T. W. Anderson, The statistical analysis of time series. John Wiley & Sons, 1971. [16] J. Ding, M. Noshad, and V. Tarokh, “Data-driven learning of the number of states in multi-state autoregressive models,” available at http://arxiv.org/abs/1506.02107 . [17] F. L. Ramsey et al., “Characterization of the partial autocorrelation function,” The Annals of Statistics, vol. 2, no. 6, pp. 1296–1301, 1974.