VARIANCE I NITIALISATION IN GARCH E STIMATION Matteo Pelagatti1 and Francesco Lisi2 1
2
Dipartimento di Statistica Universit`a degli Studi di Milano-Bicocca Via Bicocca degli Arcimboldi 8, 20126 Milano, Italy (e-mail:
[email protected]) Dipartimento di Scienze Statistiche Universit`a degli Studi di Padova Via Cesare Battisti 241, 35121 Padova, Italy (e-mail:
[email protected])
A BSTRACT. In setting up the (quasi) maximum likelihood (QML) estimation of the unknown parameters of a GARCH model the initial instances of the conditional variance process must be given values. Many software packages use the sample variance as default while others use exponentially weighted moving averages schemes. Many other alternatives are of course possible, but to the best of our knowledge nobody has studied the performance of QML estimators under the different alternatives. This is probably due to the fact that under rather weak conditions the choice of the initial values is asymptotically irrelevant. Nevertheless, in finite samples different initialisation criteria do matter in particular when highly persistent GARCH processes are considered. This work intends to fill this gap in the literature. The precision of QML estimates under different choices of initialisation and sample dimension is analysed, and the closeness of the actual (Monte Carlo) finite-sample distributions to the asymptotic approximation is measured.
1
I NTRODUCTION
The class of GARCH processes has become very popular in modelling financial time series and in a huge amount of academic papers their stochastic properties have been analysed, variations and generalisations have been proposed, and the asymptotic behaviour of maximum likelihood (ML) and quasi maximum likelihood (QML) inference has been derived. However, to the best of our knowledge, the finite sample behaviour of ML and QML estimates received less consideration in the scientific literature. In this paper we concentrate on a particular aspect of the finite sample behaviour of the ML estimator for the simple GARCH(1,1) process: p xt = ht zt (1) 2 + βht−1 ht = ω + αxt−1 where zt is an i.i.d. sequence with E[zt ] = 0 and E[zt2 ] = 1 and ω > 0, α, β ≥ 0. Since the conditional variance process in (1) is recursively defined, a value for the initial variance h1 has to be set. Were the marginal distribution of the variance process known, then one could use this probability measure to initialise the process and update the distribution of ht by projecting it on the process xt , xt−1 , . . . , x1 in a (nonlinear) Kalman-filter like fashion. But,
even if we had an approximation of the marginal distribution of h1 , the updating process would be too complicated to be carried out. Furthermore, if the strict stationarity condition E log(αzt2 + β) < 0 holds, then h1 can be set equal to any finite value without effecting the asymptotic behavior of the process. Indeed, Francq and Zako¨ıan (2006) prove that under this condition the GARCH(1,1) process is βmixing. So, due to the two aforementioned reasons (complication of an exact solution and asymptotic negligibility) the issue of initialising the variance process has never been given much T attention and Bollerslev’s (1986) suggestion to set h1 = T −1 ∑t=1 xt2 has become quite common. This is indeed the choice of almost all packages that estimate GARCH type models: G@RCH and PcGive for OxMetrics, FANPAC for Gauss, Gretl, Stata, SAS, Econometrics Toolbox for Matlab and R all have this as default choice. The only exception is represented by EViews which backcasts the initial variance with a (reverse) exponential smoothing with parameter 0.7. QMS’s (2007) EViews 6 User Guide II on page 191 writes: Using the unconditional variance provides another common way to set the presample variance. Our experience has been that GARCH models initialized using backcast exponential smoothing often outperform models initialized using the unconditional variance. We conjecture that the choice of using the (sample) unconditional variance may lead to serious distortions in the presence of highly persistent GARCH processes, since the real realisation of h1 will be frequently rather distant from its mean value. We think that some more “local” average should yield better estimates of σ21 than the “global” mean. In this paper we compare the performance of six different choices of initial variances in estimating the parameters of simulated GARCH(1,1) processes of 100 through 1000 observations. We anticipate that the common choice of using the unconditional variance does not seem to represent the best solution.
2
I NITIALISATIONS UNDER TEST
We consider six different approaches to the estimation of the initial variance h1 . Four of them involve only known functions of the data and do not involve any additional parameter to be estimated. The other two depend on data through an additional parameter, with respect to which the log-likelihood function has to be maximized. We will refer to the first four methods as fixed and the latter two as parametric. In the following lines a formal description of the used initialisation formulas is given. The name by which the approach will be referred to precedes the description. Fixed methods UnVar Unconditional sample variance, h1 = σˆ 2 =
1 T
T
∑ xt2 .
t=1
10Var Variance of the first 10 observations, h1 =
1 10 2 ∑ xt . 10 t=1
01Var Square of the first observation, h1 = x12 . FixES Exponential smoothing backcast with parameter λ = 0.7, T −1
h1 = λT σˆ 2 + (1 − λ)
2 ∑ λ j x1+ j.
(2)
j=0
Parametric methods ParH1 The initial variance h1 is an additional parameter with respect to which the loglikelihood is optimised. ParES The smoothing parameter in equation (2) is an additional parameter with respect to which the log-likelihood is optimised.
3
I MPLEMENTATION
Our results are based on 10000 realisations of GARCH(1,1) processes of T = 100, 250, 500, 1000 observations, with zt standard normal. For every sample size T , a sample path of 500+T observations is generated and the first 500 observations are discarded in order to allow for a sufficiently long burn-in period. As for the estimates, we maximise the Gaussian log-likelihood function T
LT (θ) =
∑ lt (θ),
t=1
1 xt2 lt (θ) = − log ht + 2 ht
using a BFGS algorithm with analytical scores. For the four fixed initialisation methods, the parameter vector is defined as θ = (ωl , αl , βl )0 with the footing l denoting log-transform of the respective parameter (i.e. ωl = log ω). Optimisation with respect to log-parameters has the advantage of constraining the raw parameters to be positive (i.e. ω = exp(ωl ) is always positive). The score vector is a straightforward modification of equation (6) in Fiorentini et al. (1996): T ∆θ ht xt2 ∂LT (θ) = exp(θ) ∑ −1 , (3) ∂θ ht t=1 2ht with denoting elementwise (Hadamard) product, ∆θ ht = ∂ht /∂θ (gradient) and ∆θ h1 = (0, 0, 0)0
2 and ∆θ ht = (1, xt−1 , ht−1 )0 + β ∆θ ht−1
for t = 2, 3, . . . , T.
When ParH1 method is used, θ contains the additional parameter hl = log h1 . Equation (3) remains valid with the gradient ∆θ h1 = (0, 0, 0, 1)0
2 and ∆θ ht = (1, xt−1 , ht−1 , 0)0 + β ∆θ ht−1
for t = 2, 3, . . . , T.
When ParES method is implemented, θ contains the additional parameter λl = log(λ) − log(1 − λ), which constrains the raw parameter λ = [1 + exp(−λl )]−1 to belong to the open unit interval. Now the forth row of ∆θ h1 is given by ∆λ h1 = λT −1 σˆ 2 T +
T −1
∑
2 (1 − λ)λ j−1 j − λ j x1+ j
j=0
and ∆λ ht = β ∆λ ht−1 for t = 2, 3, . . . , T . The first three rows of the score function with respect to the parameter vector θ = (ωl , αl , βl , λl ) remain as in (3), while the fourth row is given by T exp(−λ) ∆λ ht xt2 ∂LT (θ) = − 1 . ∑ ∂λ [1 + exp(−λ)]2 t=1 2ht ht Both the generation of the GARCH(1,1) sample paths and the ML estimates have been carried out in Ox 5.10 (see Doornik 2007) using the MaxBFGS optimizer with default settings.
4
S IMULATION RESULTS
We implemented the simulation experiment described above for the set of parameters ω = .01, α = .08, β = .90. Due to space restrictions we do not show other results for other parameters choices. However, estimating a GARCH(1,1) on daily stock or stock-index returns usually 1 yields estimates that lie in a small neighbourhood √ of the ones we chose here Table 1 contains the main results; namely T times the bias, T times the mean squared error and the relative efficiency with UnVar chosen as benchmark. The first thing we can notice is that there is no single method that yields uniformly best results for all the parameters. The method 10Var seems to be a good choice for estimating ω and β, while UnVar seems to be more reliable for estimating α. The uniformly worst choice is 01Var. The two parametric methods, notwithstanding their slightly higher computational complexity, are generally not the best choice, with the exception of Parh1, which shows the best efficiency for T = 100 and the ω parameter. As far as approximation to the asymptotic distribution is concerned, Table 2 and Figure 1 contain the relevant information for the 10Var case. The most striking evidence is that even with samples as large as 1000 observations, the Gaussian approximation is still very bad for ω and β with extremely high kurtosis and asymmetry. For moderate samples, β and especially α are bi-modal. 1
We tried the same experiment using estimates obtained from fitting a GARCH(1,1) to the S&P100 daily returns for the period March 1984 – March 2009 and the results were virtually the same we present here.
T Method 100 UnVar 10Var 01Var FixES ParH1 ParES 250 UnVar 10Var 01Var FixES ParH1 ParES 500 UnVar 10Var 01Var FixES ParH1 ParES 1000 UnVar 10Var 01Var FixES ParH1 ParES
√ T × Bias ω α β 0.499 -0.049 -1.123 0.434 -0.317 -0.670 1.563 -0.076 -3.339 0.603 -0.260 -1.109 0.461 -0.365 -0.771 0.493 -0.319 -0.849 0.386 0.044 -0.950 0.313 -0.158 -0.584 1.120 0.165 -2.744 0.392 -0.109 -0.824 0.349 -0.198 -0.653 0.359 -0.169 -0.694 0.220 0.037 -0.561 0.195 -0.069 -0.401 0.655 0.279 -1.795 0.224 -0.031 -0.509 0.209 -0.087 -0.424 0.210 -0.078 -0.434 0.111 0.021 -0.285 0.109 -0.034 -0.232 0.343 0.294 -1.061 0.121 -0.007 -0.283 0.112 -0.044 -0.230 0.112 -0.039 -0.234
T × MSE ω α β 1.159 0.726 6.935 1.148 0.692 5.398 6.631 0.899 24.415 1.761 0.708 7.468 1.004 0.730 6.108 1.151 0.724 6.470 0.793 0.568 5.620 0.629 0.628 4.269 4.525 0.919 24.392 0.865 0.624 5.641 0.744 0.719 5.367 0.777 0.685 5.438 0.359 0.428 2.914 0.274 0.441 2.291 2.162 0.767 14.433 0.335 0.441 2.729 0.318 0.493 2.762 0.326 0.478 2.766 0.077 0.373 0.981 0.073 0.365 0.920 0.571 0.654 4.977 0.079 0.366 0.974 0.079 0.384 0.990 0.079 0.381 0.989
Rel. Efficiency ω α β 1.000 1.000 1.000 1.009 1.050 1.285 0.175 0.807 0.284 0.658 1.026 0.929 1.154 0.995 1.135 1.007 1.003 1.072 1.000 1.000 1.000 1.261 0.904 1.317 0.175 0.618 0.230 0.916 0.910 0.996 1.067 0.790 1.047 1.021 0.828 1.033 1.000 1.000 1.000 1.313 0.970 1.272 0.166 0.558 0.202 1.071 0.971 1.068 1.129 0.868 1.055 1.102 0.895 1.054 1.000 1.000 1.000 1.051 1.022 1.067 0.134 0.571 0.197 0.969 1.019 1.008 0.975 0.973 0.991 0.974 0.980 0.993
Table 1. Bias and Mean Squared Errors with ω = .01, α = .08, β = .90 (best results in boldface).
In conclusion, our results seem to suggest that method 10Var is a competitive choice with respect to the more common UnVar, while we should warn all end users of GARCH type models to be somewhat prudent with asymptotic inference when samples up to 1000 observations are being analysed.
R EFERENCES BOLLERSLEV, T. (1986): Generalized autoregressive conditional heteroskedasticity. Journal of Econometrics, 31, 307-327. DOORNIK, J.A. (2007): Object-Oriented Matrix Programming Using Ox, 3rd ed. Timberlake Consultants Press, London. ENGLE, R.F. (1982): Autoregressive conditional heteroskedasticity with estimates of the variance of the United Kingdom inflation. Econometrica, 50, 987-1007. FIORENTINI, G., CALZOLARI G., PANATTONI L. (1996): Analytic derivatives and the computation of GARCH estimates. Journal of Applied Econometrics, 11, 399-417. FRANCQ, C., ZAKO˝IAN, J.M. (2006): Mixing properties of a general class of GARCH(1,1) models without moment assumptions on the observed process. Econometric Theory, 22, 815-834. QMS (2007): EViews 6 User’s Guide II. Quantitative Micro Software, Irvine CA.
ω T = 100 250 500 1000 Skewness 9.1 6.3 7.9 4.4 Excess Kurtosis 225.6 65.4 101.7 60.9 Minimum -0.1 -0.2 -0.2 -0.3 Maximum 37.2 14.3 10.7 6.1 α T= Skewness Excess Kurtosis Minimum Maximum
100 250 500 1000 2.1 0.8 0.3 0.2 5.6 2.3 1.2 0.3 -0.8 -1.3 -1.8 -2.5 5.3 6.3 4.2 2.7
β T= Skewness Excess Kurtosis Minimum Maximum
100 -2.3 5.3 -9.0 1.3
250 500 1000 -3.6 -5.1 -2.7 18.8 47.6 38.2 -14.2 -20.1 -21.9 1.8 2.3 3.2
Table 2. Distributional statistics for the 10Var case with ω = .01, α = .08, β = .90.
omega−100 omega−250 omega−500 omega−1000
2 1 −2 2.0
−1.5
−1
−0.5
0
0.5
1
0
0.5
1.5
2
2.5
3
3.5
4
4.5
5
4.5
5
3
4
alpha−100 alpha−250 alpha−500 alpha−1000
1.5 1.0 0.5
−2.5 −2 −1.5 −1 −0.5 0.4
1
1.5
2
2.5
3
3.5
4
beta−100 beta−250 beta−500 beta−1000
0.2 −10
−9
−8
−7
−6
−5
−4
−3 −2 −1 0 1 2 √ Figure 1. Kernel densities of T (θˆ i − θ) for the 10Var case.