Problems in Model Averaging with Dummy Variables

3 downloads 0 Views 178KB Size Report
Problems in Model Averaging with Dummy. Variables. David F. Hendry and J. James Reade. May 3, 2005. Abstract. Model averaging is widely used in empirical ...
Problems in Model Averaging with Dummy Variables David F. Hendry and J. James Reade May 3, 2005 Abstract Model averaging is widely used in empirical work, and proposed as a solution to model uncertainty. This paper provides a range of relevant empirical contexts where model averaging performs poorly in terms of bias on coefficients and forecast errors. These contexts are when outliers and structural breaks exist in datasets. Monte Carlo simulations support these assertions and suggest that they apply in more complicated models than the simple ones considered here. It is argued that not selecting relevant variables over irrelevant ones is precisely the cause of poor performance; weights ascribed to irrelevant components will bias that attributable to the relevant. Within this context, the superior performance of model selection algorithms is indicated.

1

Introduction

Model averaging, the practice of taking a weighted average of a number of regression models, is widely used, and is proposed as a method for accommodating model uncertainty in statistical analysis. However, while averaging can be shown to have desirable properties in a stationary world (see Raftery, Madigan & Hoeting 1997), extention to the non-stationary world presents difficulties. In this paper the performance of model averaging compared to model selection is assessed in the empirically relevant situation where dummy variables form part of the data generating process. In Section 2, model averaging is introduced, various methods of implementing it are touched upon, and the use of model averaging in the empirical literature is discussed. In Section 3 a number of simple models are introduced to highlight problems with model averaging in particular empirically relevant situations, before Monte Carlo simulations are used firstly to support the simple models and their conclusions, and then to suggest the problems exist in a more general context. Section 4 concludes.

1

2

Model Averaging

Model averaging can be carried out in both the classical statistical framework (see Buckland, Burnham & Augustin 1997), or the Bayesian paradigm (see Raftery et al. 1997). In the empirical literature, the latter has been much more commonly used as computing power has increased exponentially. Examples in the growth literature include Fernandez, Ley & Steel (2001) who use a pure Bayesian methodology with non-informative priors, and Doppelhofer, Miller & Sala-i-Martin (2000) who calculate weights in a Bayesian manner, but average over classical OLS estimates, while Koop & Potter (2003) use Bayesian model averaging to forecast US quarterly GDP, and Eklund & Karlsson (2004) forecast Swedish inflation based on predictive Bayesian densities. When carrying out model averaging, practitioners state a set of K variables considered to have explanatory power for the parameter of interest. These variables then form a set M of L models, {M1 , . . . , ML } ∈ M. These models could be any particular type of statistical model. Here, along with Raftery et al. (1997) and the other empirical studies mentioned above, linear regression models are considered. Thus each one is of the form: y l = β l Xl + u l (l)

(l)

= β1 X1 + · · · + βK XK , where zeros in the β vector would signify where a particular regressor is not included in model l. The models in the set M are usually every subset of the K variables specified in the initial dataset, or some subset of these models using some kind of selection algorithm. Raftery et al. (1997) advocate a Bayesian selection algorithm based on the posterior density of each individual model. However, the use of non-informative priors induced by the inability to specify specific priors for each variable in the 2K models that result from considering every subset of the K variables specified means that such selection algorithms favour larger models (see Eklund & Karlsson 2004). Buckland et al. (1997) appear to suggest that model selection should not be carried out at all.1 In conventional linear regression analysis, the mean of parameters of interest conditional on the explanatory variables is usually reported, and as such one might expect the weighted average of this conditional mean over the L models, say a

β =

L X

wl βl ,

(1)

l=0

1 It is not possible to challenge this claim in the small models considered here, because model selection algorithms tend to choose just one model hence leaving nothing to average over and leaving the comparion as one between model averaging and model selection per se. It is hoped to investigate this claim in future research.

2

where wl is the weight for model l, to be reported in model averaging, hence giving an output from the process of: y = β a X + ua .

(2)

Bayesian model averagers, such as Fernandez et al. (2001), discuss the probability of including any particular regressor in the averaged model as its importance, refraining from reporting any coefficients or model of the form (2) in their averaging analysis, since, as Doppelhofer et al. (2000) point out, Bayesian statisticians reject the idea of a single, true estimate, believing that each parameter has a true distribution, and hence Fernandez et al. (2001) produce charts of distribution functions for each parameter. At this point a debate about the existence or not of a true specification can be entered into; Hoover & Perez (2004, pp. 767–769) summarise this well. Buckland et al. (1997) suggest reporting the averaged coefficients as in (1), and they see the sum of weights as a measure of the importance of each regressor. This introduces the debate over how the models are weighted in the combination, which manifests itself on two levels; firstly how to construct the weights, and secondly which weighting criterion to use. Considering the first issue, for any particular weighting criterion, say Cl , the weighting method might be: Cl wl = PL

i=1

Ci

.

(3)

PL This ensures that l=1 wl = 1. However, no variable appears in every model, meaning that the sum of the weights applied to a particular variable will not be unity, and as such the coefficient will be biased down.2 An alternative weighting method to account for this downward bias might be to rescale the weights for each regressor so that the sum over the number of models it appears in is unity. Thus the weight for model l might then be described as, where Nk ⊂ M is the set of models in M that contain regressor βk , the following: wl = P

Cl i∈Nk

Ci

.

(4)

Hence the weights for any particular regressor will sum to unity. Doppelhofer et al. (2000) advocate this rescaled weighting for reporting coefficients in their averaged model, stating that the coefficients produced by this method would be the ones used in forecasting, and for analysing marginal effects. Both weight construction methods will be considered in this paper. In terms of the weighting criterion Cl , in the Bayesian context each model is weighted by its posterior probability, which is given by: Pr (Ml ) Pr (X |Ml ) . Pr (Ml |X ) = PL k=1 Pr (Mk ) Pr (X |Mk )

(5)

2 Taking the simplest 2-variable model illustrates this; then 4 models result, and each variable will only appear in two of the models. Given a non-zero weighting for each model, it cannot be the case that the sum of weights on either variable equals unity.

3

In non-Bayesian contexts, information criteria might be considered, such as the Akaike or Bayesian information criteria. In this paper, following Buckland et al. (1997), an approximation to Schwarz information criteria (SIC) is employed, 2 which uses exp(−b σv,l /2) (which is almost the same for the small number of 2 parameters considered here) where σ bv,l denotes the residual variance of the lth model: T 1X 2 2 σ bv,l = vb . T t=1 t Estimator averaging, therefore, uses the weights given by: ³ ´ 2 exp − 21 σ bv,l ³ ´. wl = P L 1 2 bv,l l=1 exp − 2 σ

(6)

Finally, for weighting, out-of-sample methods might be used; Eklund & Karlsson (2004) suggest using predictive Bayesian densities, while Hendry & Clements (2004) discuss minimising the mean squared forecast error of the averaged model as a criterion to construct weights. The justification for using SIC-based weights in this paper is that the Schwarz information criterion does not discriminate strongly between models differing by a regressor or two, a property infitting with Bayesian concern for model uncertainty. Further, the SIC is an approximation to the Bayes factor. Thus the analytical results and Monte Carlo simulation results, it is argued, can be applied to the more widely used Bayesian model averaging. Model averaging is just one way of carrying out a data-focussed macroeconomic modelling exercise. Another method is General-to-Specific model selection (see Hoover & Perez 1999, Hendry & Krolzig 2005, Perez-Amaral, Gallo & White 2003), whereby a general model is posited to include all possible factors contributing to determination of a parameter of interest, and then a process of reduction is carried out to leave the practitioner with the most parsimonious congruent and encompassing econometric model.

3 3.1

The bias when dummy variables are included Orthogonal model with irrelevant dummy

We consider the simplest location-scale data generation process (DGP) in (7) with a transient mean shift, namely: ¤ £ (7) yt = β + γ1{t=ta } + vt , where vt ∼ IN 0, σv2

where 1{t=ta } denotes a zero-one observation-specific indicator, unity at observation ta and zero otherwise. The parameter of interest is β and the forecast will be for yT +1 , 1-step ahead from√the forecast origin T . We consider the empirically relevant case where γ = λ T for a fixed constant λ , (see e.g. Doornik, 4

¡ ¢ Hendry & Nielsen 1998), and neglect terms of Op T −1/2 or smaller in the analytic derivations. The simulation illustration confirms their small impact on the outcomes. The postulated model has an intercept augmented by adding one relevant and one irrelevant impulse dummy, denoted d1,t = 1{t=ta } and d2,t = 1{t=tb } respectively. This yields the general unrestricted model (GUM): yt = β + γd1,t + δd2,t + ut

(8)

for t = 1, . . . , T where in the DGP, δ = 0 and γ 6= 0, the former holding in the sense that only one transient location shift actually occurred, although the investigator is unaware of that fact. Equation (8) is the starting point for model averaging as it is the set of variables from which all possible models are derived; it is also the starting point for model selection, which then follows a process of reduction to arrive at the most parsimonious congruent encompassing economic model (see ch. 9, Hendry 1995). For model averaging a regression would be run on all subsets of the regressors in (8), and the following 23 = 8 possible models result: M0 : β = 0; δ = 0; γ = 0 M3 : β = 0; δ = 0 M6 : β = 0

M1 : δ = 0; γ = 0 M4 : γ = 0 M7 : —

M2 : β = 0; γ = 0 M5 : δ = 0

(9)

This yields eight estimated models, all using least squares, where estimators are denoted by the subscript of their model number: M0 : M2 : M4 : M6 : 3.1.1

ybt ybt ybt ybt

=0 = δb(2) d1,t = βb(4) + δb(4) d1,t = δb(6) d1,t + γ b(6) d2,t

M1 : M3 : M5 : M7 :

ybt ybt ybt ybt

= βb(1) =γ b(3) d2,t = βb(5) + γ b(5) d2,t b = β(7) + δb(7) d1,t + γ b(7) d2,t

(10)

Deriving the weights and estimates

For the regressors, using least squares we find that: βb(0) = βb(2) = βb(3) = βb(6) = 0,

T T ¢ γ λ 1 X¡ 1X β + γ1{t=ta } + vt ≃ β + = β + √ , yt = βb(1) = βb(4) = T t=1 T t=1 T T

βb(5) = βb(7) =

1 T −1

T X

t=1,t6=ta

yt =

1 T −1

T X

t=1,t6=ta

¡

¢ β + γ1{t=ta } + vt ≃ β.

Hence there are three possible outcomes for estimating the parameter of interest (neglecting sampling variation as second order): • βbi ≃ 0, when there is no intercept (M0 , M2 , M3 , M6 ); 5

• βbi ≃ β, when an intercept and d1,t are included (M5 , M7 ); and √ • βbi ≃ β + λ/ T , when an intercept, but no d1,t , is included (M1 , M4 ).

All the derivations of the weights follow the same formulation. First, for M0 from (7): 2 σ bv,0

=

T T ¢2 1X 2 1 X¡ β + γ1{t=tb } + vt yt = T t=1 T t=1

T ¢ 1 X¡ 2 β + γ 2 1{t=tb } + vt2 + 2βγ1{t=tb } + 2βvt + 2γ1{t=tb } vt T t=1 ¢ 1¡ 2 γ + 2βγ + 2γvtb = β 2 + σ 2v + 2βv + Tµ ¶ 1 2 2 2 = β + σ v + λ + Op √ T ≃ β 2 + σv2 + λ2 (11)

=

where: σ 2v =

T T 1X 2 1X vt and v = vt , T t=1 T t=1

and the last line of (11) uses the asymptotic approximations: √ ¤ £ D P T v → N 0, σv2 and σ 2v → σv2 .

Clearly, βb(0) = 0 in M0 yet its weight will be non-zero in (1). A similar approach for M1 yields: ¶2 T µ 1X √ λ 2 σ bv,1 = ≃ λ2 + σv2 λ T 1{t=tb } + vt − √ T t=1 T since:

λ βb(1) ≃ β + √ . T Continuing through the remaining models delivers the complete set of approximate error standard deviations: 2 σ bv,0

2 σ bv,1

2 σ bv,2

2 σ bv,3

2 σ bv,4

2 σ bv,5

2 σ bv,6

2 σ bv,7

≃ β 2 + λ2 + σv2

≃ λ2 + σv2 ;

≃ β 2 + λ2 + σv2 ;

≃ β 2 + σv2

≃ λ2 + σv2 ;

≃ σv2 ;

≃ β 2 + σv2

≃ σv2 . 6

The error variance enters 8 times, and β 2 and λ2 both enter 4 times. Cumulating these: ¶ µ λ βe ≃ (w5 + w7 ) β + (w1 + w4 ) β + √ T λ = (w1 + w4 + w5 + w7 ) β + (w1 + w4 ) √ . T

(12)

Simulation confirms the accuracy of these calculations for the mean estimates of β, even for T as small as 25 (where the number of parameters might matter somewhat). From (12), the averaged coefficient will not equal the true coefficient so long PL as λ 6= 0, and/or w1 + w3 + w5 + w7 < 1, which l=1 wl = 1 will imply in most cases. On the other hand, rescaling √ the weights will mean w1 + w4 is larger and hence the bias induced from λ/ T will be greater. Also, rescaling will mean e the irrelevant regressor, will receive greater weight, since the principle that δ, applies to all regressors, when one imagines previously, especially if weights are a reflection of the importance of a parameter, it received a low weighting. 3.1.2

Model averaging for forecasting stationary data

One justification for model averaging is as a method of ‘forecast pooling’, so we consider that aspect next. The outlier was one-off so will not occur in the forecast period, yielding:

Letting:

M0 : ybT +1,0 = 0 M1 : ybT +1,1 = βb(1) M3 : ybT +1,3 = 0 M4 : ybT +1,4 = βb(4) M6 : ybT +1,6 = 0 M7 : ybT +1,7 = βb(7) yeT +1|T =

7 X i=0

M2 : ybT +1,2 = 0 M5 : ybT +1,5 = βb(5)

wi ybT +1,i

then the forecast error is veT +1|T = yT +1 − yeT +1|T , with mean:

£ ¤ λ E veT +1|T = (w0 + w2 + w3 + w6 ) β − (w1 + w4 ) √ . T

7

(13)

Thus, forecasts can be considerably biased, for similar reasons that βe can be biased. The mean-square forecast error (MSFE) is: Ã !2  7 i h X wi ybT +1,i  = E  yT +1 − E veT2 +1|T = E



i=0

λ (w0 + w2 + w3 + w6 ) β − (w1 + w4 ) √ T 2

= σv2 + (w0 + w2 + w3 + w6 ) β 2 + (w1 + w4 ) +

βλ 2 (w0 + w2 + w3 + w6 ) (w1 + w4 ) √ , T

2

¶2 #

λ2 T (14)

which for unrescaled weights is almost bound to be worse for large λ than the general unrestricted model (GUM) or any selected model, even allowing for estimation uncertainty. 3.1.3

Numerical Example

Consider β = 1, λ = −1, σ 2v = 0.01 (i.e., a standard ¡ 2 ¢ deviation of 10%), and T = 25. Then using the weights based on exp −b σv /2 in (12): ¶ µ ¶ µ λ 1 = 0.626 βe = (w5 + w7 ) β + (w1 + w4 ) β + √ = 0.382 + (0.305) 1 − 5 T (15) which is very biased for the true value of unity. The basic problem is that the weights increase too little for better models over poorer ones, in addition to which, the ‘irrelevant’ impulse creates several such poorer models in the averaging pool. The bias can be seen to be smaller if one takes the second weighting methodology outlined in Section 2; firstly if we rewrite (15) as in (12): λ βe = (w1 + w4 + w5 + w7 ) β + (w5 + w7 ) √ T µ ¶ 1 = 1 + 0.37754 − = 0.924, 5 since the weights on the β coefficient, which only appears in the models 1, 4, 5 and 7, will sum to unity. The MSFE from (14) when forecasting without rescaling the weights is: i h E veT2 +1|T = 0.118,

so the MSFE is a factor of almost 12-fold larger than the DGP error variance. It is hard to calculate the MSFE with the rescaled weights, because each weight 8

depends on which coefficient it is being multiplied by; one would expect the MSFE to be smaller when the weights are rescaled since the bias on the coefficients is smaller in that case. Finally, considering the parameter values chosen √ for this example, γ = −5 is large when σv = 0.1, but outliers of magnitude T often occur in practical models: (see Hendry 2001, Doornik et al. 1998). In the Monte Carlo simulation, a range of values of λ and T will be considered. 3.1.4

Monte Carlo Evidence

A Monte Carlo simulation of 1,000 replications was run to assess the impact of sampling distributions on the bias derived in Section 3.1.2. Table 2 reports the average bias on the β (first panel) and γ (second panel) coefficients when various modelling strategies are used. The different columns report the bias from the various modelling strategies; GUM is the general unrestricted model, hence the simple regression run on the entire dataset originally specified (equation (8)), MA is model averaging, MA R is model averaging rescaled, Lib is model selection using PcGets Liberal strategy, and Cons is model selection using PcGets Conservative strategy.3 A range of T and λ values are considered along with the parameterisation of the numerical example in Section 3.1.3. As λ varies, the size of the coefficient on the relevant dummy, γ varies, and Table 1 shows the actual size of the γ coefficient for each (λ, T ) combination. Table 1: Size of γ coefficient for various -0.25 -0.5 T \ λ 0 -0.05 25 0 -0.25 -1.25 -2.5 50 0 -0.354 -1.768 -3.536

values of λ and T . -0.75 -1 -3.75 -5 -5.303 -7.071

So from Table 2 the bias when simply the GUM is run is tiny on both coefficients. Model averaging induces a very large bias, ranging from about 30% of the true β coefficient size when the dummies are both insignificant (λ = 0), to around 40% when λ = −1, and we see the calculations of the previous Section are supported here; the bias of -0.374 is reproduced and in fact is stronger, at -0.397 in the Monte Carlo. The bias on γ under model averaging decreases as a fraction of the size of the true coefficient from about 100% of the size when the dummy is barely noticeable (λ = −0.05), to about 40% when the dummy is very conspicuous at λ = −1. Thus a picture of strong bias is drawn from model averaging. Rescaling the weights, as described in Section 2, improves this markedly. The bias on β does increase with the size of the true γ coefficient, holding T fixed, but is much smaller than when weights are not rescaled. The bias calculated in the previous section (−0.076 when λ = −1) is supported with −0.079 found in the simulation, while if λ = −0.5, the bias is around 5% of the β coefficient size. The bias on the γ is invariant to changes in λ, and hence changes in the 3 For a description of these strategies, which amount to differing stringencies of tests used at various stages of the selection procedure, see Hendry & Krolzig (2005).

9

Table 2: Bias on Coefficients in tions. β GUM MA MA R T L0 25 0.000 -0.319 0.000 50 -0.001 -0.316 -0.001 L -0.05 25 0.000 -0.323 -0.006 50 -0.001 -0.319 -0.004 L -0.25 25 0.000 -0.340 -0.026 50 -0.001 -0.332 -0.018 L -0.5 25 0.000 -0.362 -0.048 50 -0.001 -0.348 -0.034 L -0.75 25 0.000 -0.381 -0.067 50 -0.001 -0.363 -0.047 L -1 25 0.000 -0.397 -0.079 50 -0.001 -0.376 -0.055

DGP (equation (7)). Based on 1,000 replicaLib

Cons

GUM

MA

0.000 -0.001

0.000 -0.001

-0.002 0.000

0.212 0.211

0.000 -0.001

0.000 -0.001

-0.002 0.000

0.324 0.369

0.000 -0.001

0.000 -0.001

-0.002 0.000

0.766 0.993

0.000 -0.001

0.000 -0.001

-0.002 0.000

1.277 1.709

0.000 -0.001

0.000 -0.001

-0.002 0.000

1.692 2.281

0.000 -0.001

0.000 -0.001

-0.002 0.000

1.966 2.645

γ MA R L0 0.383 0.381 L -0.05 0.383 0.381 L -0.25 0.383 0.381 L -0.5 0.383 0.381 L -0.75 0.383 0.381 L -1 0.383 0.381

Lib

Cons

-0.073 0.026

-0.116 -0.042

-0.004 -0.002

-0.007 -0.002

-0.003 -0.002

-0.003 -0.002

-0.003 -0.002

-0.003 -0.002

-0.003 -0.002

-0.003 -0.002

-0.003 -0.002

-0.003 -0.002

size of the γ coefficient itself; once λ is non-zero, however, the bias is smaller when weights are rescaled.4 When λ = 0, γ is the coefficient on an irrelevant regressor, and this highlights the problem discussed in Section 3.1.1 of rescaling weights; it increases the weight on irrelevant regressors.5 Table 2 also shows the performance of model selection; the bias is generally comparable to the bias on the GUM, hence negligible, while Tables 3 and 4, which report the percentage of simulation replications on which the DGP was selected as the final model in the model selection algorithm, tell us that in almost every replication the true DGP is found.6 4 This

invariance of the bias to λ is to be expected, since from (10), d1,t only appears in M2 , M4 , M6 and M7 , and in M4 and M7 the coefficient is unbiased, γ ˆ (3) = γ ˆ since the model is well specified, but in M2 and M6 we have: PT PT PT dt β dt yt t=1 dt (β + γdt + vt ) = = Pt=1 + γ = β + γ. γ ˆ (2) = Pt=1 PT T T 2 2 2 d d t=1 t t=1 t t=1 dt

Thus when we consider the averaged coefficient:

γ ˆ (a) = (w2 + w6 ) (β + γ) + (w4 + w7 ) γ = (w2 + w4 + w6 + w7 ) γ + (w2 + w6 ) β, (16) √ then if w2 + w4 + w6 + w7 = 1 (rescaled weights) the bias is independent of γ = λ T for fixed T. 5 Furthermore, the bias on δ, the irrelevant dummy, which is not reported, is larger when weights are rescaled. 6 In fact the two relevant regressors were retained on every replication. It was only retention

10

Table 3: Percentage of times specific model is DGP using PcGets Liberal Strategy. Based on 1,000 replications. λ =0 λ =-0.1 λ =-0.25 λ =-0.5 λ =-0.75 λ =-1 T 25 91.1% 95.4% 95.8% 95.8% 95.8% 95.8% 95.3% 95.3% 95.3% 95.3% 95.3% T 50 91.6% Table 4: Percentage of times specific model Strategy. Based on 1,000 replications. λ =0 λ =-0.1 λ =-0.25 T 25 97.3% 97.3% 99.1% T 50 97.2% 99.0% 99.0%

is DGP using PcGets Conservative λ =-0.5 99.1% 99.0%

λ =-0.75 99.1% 99.0%

λ =-1 99.1% 99.0%

Table 5 provides information on the mean-square forecast error for the different modelling strategies. The MSFE for a 1-step forecast of T + 1 from T is given for the same modelling strategies reported in Table 2. One can see that indeed the huge MSFE’s for MA predicted in Section 3.1.3 are supported, but when the weights are rescaled for model averaging, the MSFE isn’t noticeably worse than using the GUM, and MSFEs over the various modelling strategies are indistinguishable. Thus the Monte Carlo simulations support the assertions from Section 3.1.1 of bias and terrible forecast performance when using model averaging in the presence of indicator variables. Rescaling weights does improve the bias performance of model averaging for regressors that are relevant, and substantially improves the MSFE. However, upward bias is increased for irrelevant variables, potentially giving false information as to the relevance of some variables, and strong bias remains for coefficients on dummy variables. The model here is extremely simplistic; however, it does generalise quite easily analytically to a model with a regressor in place of the constant in (7), and Monte Carlo simulations give almost identical results; in fact a stronger bias of 0.103 on β is reported in the λ = −1, T = 25 case is reported. Generalising to large models with numerous regressors and dummies is analytically fearsome, but for empirical work it is important we can make generalisations. Longer dummy variables and dummy variables that take unity for a certain state of the world are considered in the next Section, while larger models with impulse dummy variables are investigated via Monte Carlo simulation in Section 3.3.

3.2

Period and intermittent dummy variables

If the dummy variable was postulated to be a period dummy, say d = 1{t1

Suggest Documents