Model Selection Using Information Criteria and Genetic ... - Springer Link

Computational Economics (2005) 25: 207–228 DOI: 10.1007/s10614-005-2209-8

C

Springer 2005

Model Selection Using Information Criteria and Genetic Algorithms KELVIN G. BALCOMBE Imperial College, Wye Campus, London, U.K.; E-mail: [email protected] Accepted 14 February 2005 Abstract. Automated model searches using information criteria are used for the estimation of linear single equation models. Genetic algorithms are described and used for this purpose. These algorithms are shown to be a practical method for model selection when the number of sub-models are very large. Several examples are presented including tests for bivariate Granger causality and seasonal unit roots. Automated selection of an autoregressive distributed lag model for the consumption function in the US is also undertaken. Key words: algorithms, autoregressive, distributed lags, genetic, information criteria, model selection JEL classifications: C32, C69

1. Introduction Parsimoniously specified econometric models are often achieved by imposing restrictions sequentially. However, sequential reduction can be both complicated and problematic. This paper argues that, while they are not a panacea for the ills of pretesting, automated searches that are not based on sequential reduction have been insufficiently exploited. It examines the role of automated model selection using information criteria (IC) in conducting some popular econometric procedures such as testing for seasonal unit roots, causality, and variable selection within Autoregressive Distributed Lag (ARDL) models. The second contribution of this paper will be to examine the use of genetic algorithms. There are many procedures which are commonly used to select the ‘best subset’ model. A full search over all subsets of a k variable linear model becomes impractical with more that 20–25 variables even when using efficient Gray coding (George and McCulloch, 1996). Alternatives include ‘stepwise’ algorithms (see Miller, 1990) and Bayesian approaches (George and McCulloch, 1993, 1996). However, a class of algorithms called ‘Genetic’ are highly appealing yet seem not to have been used extensively within applied econometrics. This paper employs model searches in several contexts using data sets which have already been published in the econometric literature in order to facilitate a comparison with the ‘standard’ approaches and the automated search results. Using

208

KELVIN G. BALCOMBE

automated selection, the data in Beaulieu and Miron (1993) are ‘tested’ for unit roots, bivariate causality between a subset of the Nelson and Plosser (1982) series are examined, and the consumption function in the US is estimated using the data in De Crombrugghe et al. (1997). For conciseness, only two information criteria are employed in this paper. The most parsimonious of the commonly employed ICs, the Schwarz-Bayes (BIC), is used and compared with the posterior information forecast criteria (PICF) of Phillips (1995a,b). The paper proceeds by giving a brief overview of the important issues, and coverage of the algorithms employed in Section 2. The applications are discussed in Section 3, along with a Monte Carlo assessment of the performance of the Genetic Algorithms. Section 4 presents the results when using empirical data and Section 5 concludes. 2. Automated Specification Searches Using Genetic Algorithms Sequential ‘testing down’ is currently considered to be orthodoxy within a large section of the applied econometrics community. However, as widespread as this convention is, it that finds little justification in classical statistics. For instance, in models containing lags, lags are usually selected as a first step before applying other tests. Most researchers will be aware that the number of lags selected can be pivotal in determining the results of subsequent tests. Even in relatively large samples overparameterisation can distort the empirical size of tests, and reducing the number of lags can improve the power of tests. If, as a result of low power, false restrictions are imposed, further tests may be highly misleading. The inclusion of irrelevant variables, or the failure to impose correct restrictions in a regression is often not innocuous. Moreover, even the relatively simple problem of characterizing trends in univariate series can become intricate when adopting a ‘general to specific’ sequential reduction approach (see Holden and Perlman, 1994). While a typical ex-post account of model selection often describes a process that could have been automated, most econometricians are unlikely to be comfortable with doing so and prefer the ‘interactive’ approach to model reduction or construction. However, in reality there is seldom any single clear ex ante reduction path. Researchers are unlikely to be willing to adopt rigid levels of significance for all tests and the decision tree may be so vast as to make a prior plan of action incomprehensible. In short, the sequential reduction approach does not easily avail itself to automation. Sequential reduction is not the only option for model selection. For example, Bayesian approaches (Fernandez et al., 2001) that use ‘g-priors’ over the construction of posterior model probabilities. The Bayesian literature has recently focused on the use of Monte Carlo Markov Chain (MCMC) algorithms in order to generate ‘likely’ models in conjunction with model averaging (e.g. Smith and Kohn, 1996). An alternative approach is to adopt an automated search over a range of models using information criteria to rank each of the models. This approach has been used for lag selection, trends and unit roots in autoregressive models (Phillips, 1995a,b). Schwarz (1978) constructed information criteria based on the asymptotic

MODEL SELECTION USING INFORMATION CRITERIA AND GENETIC ALGORITHMS

209

association with what could be termed ‘model probabilities’. Likewise Fernandez et al. (2001) show that some g-priors become asymptotically equivalent to information criteria (including the Schwarz-Bayes). These priors do not explicitly assume anything about the stationarity or non-stationarity of the data. As noted by Laud and Ibrahim (1995), an inherent problem with ICs is that they do not allow prior input for model choice and they are based heavily on asymptotic considerations. However, Ploberger and Phillips (2001, 2003) argue that the penalties required in non-stationary systems need to be higher than that required by the Schwarz-Bayes criteria in order to obtain ‘consistent’ model choice. 2.1.

INFORMATION CRITERIA

Some commonly employed criteria are the Akaike (AIC), Schwarz-Bayes (BIC), and Hannan Quinn (HQ) which, for a single equation, share the simple structure C(k) = ln (σ 2 ) + f (k, T ) where σ 2 is the maximum likelihood estimate of the error variance, and 2k/T, 2 ln (ln (T ))k/T and ln (T)k/T are the AIC, HQ, and BIC penalties (f(k, T)) respectively where k is the number of parameters and T is the sample size. As noted above, the PICF variant of the PIC criteria has both an intuitive and theoretical basis for which readers are referred to Ploberger and Phillips (1998a,b) . The PICF is explicitly constructed from forecast errors as in (1) ln ( p f k ) =

2 T T vt,k ln f t,k + 2 2 f t,k t=K +1 t=K +1

(1)

where T is the sample size, K is set such that K < T and vt,k is the one step ahead forecast error with the forecast variance ft ,k . K must be more than the number of parameters in the model. However, for a comparison of models, a common K is required. This can be achieved by choosing K to be some number larger than the number of parameters in the largest model. Ploberger and Phillips (2003) show that under stationarity the PIC criterion becomes asymptotically similar to the BIC, whereas in non-stationary conditions it tends to put a higher penalty on additional parameters. Chao and Phillips (1999) recently proposed an extension of the PIC criterion for the simultaneous selection of cointegrating rank and lag length. In the light of this work the use of ICs in both stationary and non-stationary settings has been given some justification. A drawback of the PICF criteria is that it is computationally intensive. Efficient computational methods are required to avoid the number of matrix inversions (e.g. see Brown et al., 1975). However, for the moderate sample sizes it is not impractical to use. Information criteria (IC) have become a popular method for lag-selection. However, most practitioners are reticent to use them more generally for model selection. The selection of models by minimizing a given criteria correspond with a reduction strategy based on the ‘acceptance’ of a model if a hypothesis cannot be rejected at a fixed level of significance. Criteria such as the HQ and BIC have penalties which

210

KELVIN G. BALCOMBE

increase with the sample size. Consequently, for large samples, the HQ and BIC would be likely to select models with restrictions that could be rejected with high levels of confidence. This outcome may be unpalatable to some. However, if one derives disutility from committing type one as well as type two error, then increasing the level of significance as the sample size increases is arguably sensible. ICs can also be used to compare non-nested models. As illuded to in the opening section, sequential reduction can be problematic in this regard, since candidate models may often be nested within a general model, but not nested within each other. Different non-nested tests (Davidson and McKinnon, 1993 pp. 381–388) often disagree, do not always provide an unambiguous ranking of one model over another and often have theoretical underpinnings requiring stationarity of the data. 2.2.

GENETIC ALGORITHMS

If searches over large dimensional spaces are required, then a brute force search will be impractical. Genetic Algorithms are employed here for this purpose. These algorithms incorporate aspects of natural selection or survival of the fittest. Usually, such an algorithm maintains a population of structures (usually randomly generated initially), that evolves according to rules of selection, recombination, mutation and survival. A fitness or performance must in some way be assigned to each individual in the population. The fittest individuals are more likely to be selected for reproduction, retention or duplication, while recombination and mutation modify those individuals (for a flow-chart of a typical genetic algorithm see Koza, 1992, p. 29). Dorsey and Mayer (1995) note that while these algorithms derive their intuitive appeal from natural selection, a more rigorous foundation can be found in ‘schema theory’ (see Dorsey and Meyer for further references). Uses for these algorithms were found in engineering and other fields in the 1980s (see Goldberg, 1989; Koza, 1991). Economists first focused on genetic algorithms in the mid 1990s (see special issue of Computational Economics 1995, Vol. 8, No. 3). Within the econometric literature, Dorsey and Mayer (1995) conducted an extensive study of the performance of genetic algorithms in solving problems with multiple optima or where functions are non-differentiable. However, while these algorithms are well suited to the problems of variable selection in linear models, there has been no systematic use of them with this purpose, at least within the field of econometrics. This paper has used only one type of algorithm, sketched below, which was found to be an effective tool of model selection in variable problems as large as 50 (>250 sub-models). It is not claimed that the algorithms used in the study are optimal. Readers may also refer to Dorsey and Mayer (1995) for a brief discussion of strategies that may improve computational efficiency, although their discussion pertains to solutions for parameters in a continuous space. Genetic algorithms can be described using a bivariate example of a simple autoregressive distributed lag model without any deterministic


211

components yt = αyt−1 + γ0 xt + γ1 xt−1 + et

(2)

where yt and xt are variables indexed to time t, and α, γ 0 and γ l are parameters. For the moment, consider only exclusion restrictions α = 0,

γ0 = 0,

and γ1 = 0

(3)

The unrestricted model is described in binary terms as (0, 0, 0) where a ‘0’ implies that the restriction (in the order above) has not been imposed. A pure ‘distributed lag’ model is described as (1, 0, 0) and a pure autoregressive model is described as (0, 1, 1). We have the eight possible permutations of this triple. If restrictions such as γ0 + γ1 = 0 are added then there are 16 possible models which are described by a 1 by 4 vector of zeros and ones (some of these models may be equivalent e.g. (0, 1, 1, 0) would be equivalent to (0, 1, 1, 1) or (0, 1, 0, 1), however, this does not present a fundamental problem). It is simple to estimate all the possible models for the case above. However, for large models, the number of models is impractical to estimate. Genetic algorithms such as in Algorithm One therefore become useful: • Algorithm One: Step 1. Randomly generate a set (numbering G0 ) of models using a binary representation (vectors of ones and zeros) with each element taking the value zero or one with probability 1/2. G0 can depend on the dimensionality of the search and the population can also include models which are thought to be highly probable. Step 2. Estimate all these models, give them a ‘score’, and rank each model according to this score. The score can be constructed using information criteria, or alternatively some other measure such as forecasting performance. Step 3. Kill off D0 of the lowest scoring models, leaving H0 = G 0 − D0 models. Alternatively, the rank could constitute a selection probability (analogous to Dorsey and Mayer, 1995, Step 3). The H0 models then constitute the generation that is used to ‘breed’ the next. Step 4. ‘Breeding’ from H0 can be achieved by giving each model a chance of a ‘mutation’, in that a zero may have a chance of turning into a one, or vice versa. Alternatively, each of the surviving models may be paired. The ‘offspring’ of two models will acquire a mixture of the ‘genes’ of the two parents, by setting a probability of acquiring the element of the vector describing each of the ‘parents’. For example, (1, 0, 0) and (1, 1, 0) would be likely have the offspring (1, 1, 0) or (1, 0, 0). Without ‘mutation’ the offspring of these two individuals could be assigned by taking (1, x, 0), where the element x takes the value 1, or 0, with some

212

KELVIN G. BALCOMBE

probability. ‘Mutations’ take place when both parents share a one or a zero, but there is a small probability (say 5%) that the offspring has the opposite value. The population of models cannot be allowed to expand indefinitely. Accordingly, each pair breeds a small number of offspring (two in the algorithms used here). Therefore, the new population numbering G1 is generated from the H0 in the previous generation. This ‘breeding’ step is similar to Steps 4–6 in Dorsey and Mayer, 1995, and can also be described in terms such as reproduction, crossover, and mutation. Step 5. Return to Step 2, until there is a specified number of models left, or until a fixed number of iterations has been exceeded. In each iteration Hi = G i − Di where breeding and mutation produce Hi → G i+1 . The same number (Di ) need not necessarily be killed off in each generation. Herein, a large ‘cull’ was adopted after the first generation, leaving only a small percentage of high performing models to breed (around 10% of the initial population). Otherwise, around 25% of the worst performing models were eliminated. It is effective to run several algorithms independently with the highest scoring model (‘topdog’ for short) being selected in each case. These independent trials should choose the same model more than once. The best models can then form part of the initial population for a final cycle. Algorithm Two can be employed. • Algorithm Two Step 1. Run Algorithm One (S) times Step 2. Record the ‘topdog’ model for each of these (T D1 , . . . T D S ) Step 3. Enter (T D1 , . . . T D S ) into a final round of Algorithm One. If the same model has not occurred several times in the first S rounds then, perhaps, G0 needs to be increased (or Di decreased). 3. Models and Performance of Search Procedures This section introduces the models that will be used, and presents Monte Carlo evidence about the performance of the search procedures. With regard to performance, there are two separate issues: (i) the performance of a given criteria or score in identifying the generating process within a specified search; and, (ii) the performance of the genetic algorithms in obtaining the top model for a given score (IC). The second of these issues is conceptually straight forward to investigate, though computationally expensive. There are several ways of exploring (ii). The first is to conduct a Monte Carlo exercise, whereby data from a known model is generated and the GA is then used for model selection. Unfortunately, there is no guarantee that the ‘true’ model has the highest IC, unless a full search can also be conducted. The algorithm may have worked well (issue ii), but the criteria may have performed poorly (issue i). If the


213

dimensionality is too large to permit a full search, one can verify that the model which is chosen has an IC that is at least as large as the score for the true generating process.1 However, this evidence is rather imprecise. Therefore, alternatively, a model can be chosen which is small enough to allow computation of all possible sub-models (hence the ‘fittest’ model according to a given criteria is known). The performance of the GA in finding the top model (identified by the full search) can then be assessed. This approach is pursued in the following subsections in two ways. First, for a relatively small model space, a full search and GA is run for every single Monte Carlo trial and whether the GA selects the top model is recorded for each trial. Second, for a larger model space, a full search is conducted for a given model and set of (empirical data) only once, and the GA is then run repeatedly and the proportion of trials it identifies this model is recorded. The performance of a given criteria is, conceptually, a more difficult issue to explore. The ‘performance’ of a given IC is relative both in comparison with other ICs or other methods of model selection (automated or not automated). The obvious benchmark that may be used for a comparison is that of the standard ‘testing down’ approach (e.g. an F-test for causality after lag length selection). However, if automated selection is being used as a kind of hypothesis testing device, one cannot set the level of significance. Accordingly, when evaluating automated selection as a method to ‘test’ for seasonal roots or causality, the procedure that is adopted here is as follows: • Compute the implicit ‘size’ using a model search for a given IC (the proportion of trials in which a search/criteria makes the incorrect choice). • Compare the power (the ability to reject an incorrect hypothesis) of a ‘standard approach’ (i.e. t or F-test) with that of the model search approach where the implicit size from the model search procedure is used as the significance of the test. 3.1.

TESTING FOR SEASONAL UNIT ROOTS

Phillips (1995a,b) introduced an automated model selection approach to testing for unit roots at the ‘zero frequency’. This section examines the potential of this approach in the seasonal root case. Tests for unit roots have been extended into the seasonal context by Hylleberg et al. (1990) (henceforth referred to as HEGY tests), and Beaulieu and Miron (1993). In order to test for seasonal roots, the autoregression s yt =

I i=1

δi dit +

kz i=1

λi z it +

ky

αi s yt−i + et

(4)

i=1

must be estimated where s yt = yt − yt−s with the zit being lag polynomials of yt (see Hylleberg et al., 1990; Beaulieu and Miron, 1993 for details) and the dit s are intercepts, trends and seasonal dummies. Within this approach (4) is estimated

214

KELVIN G. BALCOMBE

using OLS, and the rejection of single and joint exclusion restrictions on the zit are used to infer that unit roots exist in yt at corresponding frequencies. If λ1 = 0, a unit root exists at non-seasonal frequencies. If λ2 = 0 or jointly λi = λi+i = 0 for i = 3, 5. . . s − 1 then there are unit roots at some of the seasonal frequencies. Thus, ‘testing’ for a unit root can be framed as a variable (or model) selection problem. For the monthly case (s = 12), the selection of a model with λ1 = 0 is, in essence, an ‘acceptance’ of a unit root at the zero frequency. Likewise, if λ2 = 0 is imposed, then a root is ‘accepted’ at this frequency. Conversely if z1t , z2t and at least one in each of the pairs (z3t , z4t ), (z5t , z6t ) . . . (z11t , z12t ) is included in the final model, unit roots would be rejected at all frequencies. Within the automated approach, the final lag length, inclusion of deterministic components and the number of unit roots can be determined simultaneously. 3.1.1. Monte Carlo Evidence In Table I the data has been generated by the process yt − θ yt−4 = et where et is standard normal iid and θ is equal to 1, in order to calculate the size of the seasonal root tests and, alternatively, 0.66 in order to calculate the power. An estimating equation was specified as in equation (4). The total number of submodels in this instance (when including a time trend, seasonal dummies and four lags) was 212 = 4096. By contrast, the genetic algorithm was set so as to estimate around 1000 models in total. The relatively small number of submodels in this case enables not only a full search to be conducted, but also a Monte Carlo exercise to be performed. For each trial the model was estimated using both a full-search and GA. In each trial the proportion of the time that a given hypothesis was rejected (i.e. whether a given variable was included) was also recorded when θ was 1, and .66 respectively. This enables the ‘implicit size’ for a given criteria (and search) to be computed, along with a record of whether the GA actually found the top model according to a given criteria. Further Monte Carlo trials were then be performed to find the implicit critical value that would be used in evaluating the power of an F-test or t-test for the hypotheses that λ1 = 0, λ2 = 0 and λ3 = λ4 = 0 when using the standard HEGY method. These can then be used to assess the power of the HEGY method at θ = .66, at the implicit size generated by the search procedure. In Table I, in the first line, the BIC, at a sample size of 50, has an implicit size of 41.5% when testing the hypothesis of a unit root at the zero frequency. Adopting this size and applying the HEGY procedure would give a critical value of −2.33. However, the power to reject a unit root at the zero frequency when data is generated using θ = .66, using a 41.5% level of significance, is 53.7% for the HEGY procedure, whereas the search procedure is able to reject this false hypothesis around 64% of the time. This demonstrates that at the implicit level of significance of the BIC search, the automated search has superior power when compared with the conventional HEGY approach. The same picture emerges at sample sizes 100 and 200 when using the BIC. The last column of Table I indicates that the GA found the top model for a given score (found by a full search) over


215

Table I. Size and power of seasonal root tests: GA vs. standard tests. Sample size

Criteria

T = 50

BIC

PICF

T = 100

BIC

PICF

T = 200

BIC

PICF

λ1 = 0

λ2 = 0

λ3 = 0, λ4 = 0

Implicit size (IS) Implicit critical value Power of GA Power of test (at IS)

.415 −2.33 .642 .537

.134 −2.42 .426 .25

.212 4.41 .75 .46


.306 −2.57 .495 .409

.275 −1.96 .61 .475

.418 2.99 .843 .723


.325 −2.49 .771 .667

.078 −2.7 .626 .375

.124 5.41 .946 .689

Implicit size (IS) Implicit critical value Power of GA Power of test

.212 −2.8 .576 .49

.197 −2.19 .732 .68

.289 3.82 .935 .90


.233 −2.71 .980 .914

.042 −2.93 .965 .772

.078 6.06 .999 .987


.123 −3.038 .855 .787

.12 −2.47 .940 .940

.185 4.65 .997 .998

%GA = full search 99.68 99.79 99.73 99.76 99.85 99.92 99.78 99.94 100.00 100.00 99.94 99.98

Note. In all cases, the results are based on 10000 trials. Power calculations are based on the GDP yt = .66yt−4 + et where et is normal iid.

99.5% in all cases. A similar story emerges when examining the PICF. At the implicit level of significance of the PICF search, the search procedure has superior power to that of the HEGY test. For both information criteria, and for tests of λ1 = 0, λ2 = 0 and λ3 = λ4 = 0, the search procedures have superior power (at the implicit level of significance) when compared to the standard HEGY procedure. Interestingly, when comparing the two criteria, it can be observed that there is quite a different pattern across the tests for roots at alternative frequencies. The implicit significance of the PICF is less than that of the BIC for a test for a unit root at the zero frequency, but the converse is true for the first seasonal frequency (λ2 = 0). Thus, the PICF appears to be less likely to reject a unit root at the zero frequency,

216

KELVIN G. BALCOMBE

and at the seasonal frequencies corresponding to λ3 and λ4 , but less likely to reject a unit root the seasonal frequency associated with λ2 , when compared with the BIC. Finally, as would be expected, the implicit level significance decreases as the sample size grows, but the power increases, for both criteria. 3.2.

CAUSALITY

If x can be used to improve on the one step ahead forecasts of y (over and above using past y alone), then x ‘Granger-causes’ y. Historically, the most popular test for causality has been an F-Test for γ1 = γ2 . . . = γkx = 0 in the regression of yt =

I

δi dit +

i=1

ky

αi yt−i +

i=1

kx

γ j xt− j + et

(5)

j=1

either as a single equation, or as one part of a VAR. This requires the prior selection of both the elements of dit as well as k y and kx . Commonly, a two stage procedure is used where k y = k x is set to k in the first instance and k is selected by maximizing an information criterion. Alternatively, k is reduced until one of the lags becomes jointly significant or there is evidence of serial correlation in the residuals. A the joint test of H0 : γ1 = γ2 . . . = γk = 0 is then conducted. This approach may give results that are misleading or have poor power. It may be difficult to correctly estimate k. However, if the underlying generating process has k y > k x , then even if k is correctly selected (as k y ) the inclusion of redundant lags of x endows the joint test with less power. Moreover, if there is a ‘lag phase’ (i.e. γ1 = · · · = γk−1 = 0, but γk = 0) the test will also have poor power, (for a full exploration of this issue in relation to the impact of research and development expenditures on productivity, see Balcombe et al., 2005). An alternative is to conduct a specification search over all possible sub-models of (5). In addition to the exclusion restrictions, summation restrictions of the form ky i=1

αi = 1 and

kx

γj = 0

(6)

j=1

may be added. The inclusion of at least one xt− j implies Granger-Causality of x on y. This requires no pretesting for unit roots and so-forth. That the PICF is also explicitly based on one-step ahead forecast errors makes it additionally attractive, given the definition of Granger Causality. 3.2.1. Monte Carlo Evidence In order to evaluate the automated model selection approach to ‘testing’ for causality, a variation on the Monte Carlo procedures used for the seasonality tests are explored in this subsection. In generating the results in Table II, the following


217

Table II. Causality.

Implicit size (IS) Power of GA Power of F-test at (IS) Implicit size (IS) Power of GA Power of F-test at (IS)

T = 50

T = 100

T = 200

.227 .507 .438 .417 .646 .442

.147 .650 .531 .284 .749 .490

.090 .881 .730 .200 .917 .741

The first 3 rows pertain to the BIC and the second 3 rows to the PICF.

process was used: yt = αyt−1 + γ xt−2 + et

(7)

where both xt and et are iid standard normal with mean zero. When calculating the ‘implicit size’ for each of the criteria γ = 0. The power calculations were calculated using γ = .25. As with the seasonality case, Monte Carlo trials were conducted at sample sizes T = 50, 100 and 200. The finite sampling distributions of test statistics can diverge from their asymptotic distributions depending on the value of autoregressive parameter α. However, these differences become large only when xt is serially correlated. When xt is iid, the results are largely invariant (this was established using Monte Carlo trials) to the value of α between 0 and 1 in absolute terms. Therefore, while in the experiments below α was set to 0.5, the results were the approximately the same for other values of α between 0 and 1 (inclusive). The specification of xt−2 , within the DGP, rather than xt−1 is in order to illustrate the arguments alluded to in Section 3. In the majority of causality tests, the test for causality (t or F test) is proceeded by lag selection. This step will be likely to improve on the power of tests in some circumstances, since otherwise the test will be diluted by irrelevant lags. On the other hand, if relevant lags are eliminated, then lag selection from ‘long to short’ lags will be counter productive. When conducting the causality tests the encompassing model is a second order ADL. Since this is a small dimensional model, a full search can be conducted, and there is no need for a Genetic Algorithm. Therefore, the results in Table II are generated by a full search. The performance of the Genetic Algorithm is explored in Table III, using a different approach. Therefore, Table II addresses issue (i) rather than issue (ii). The results in Table II confirm that, at the implicit size of each of the criteria, both search procedures have superior power to the test which uses sequential reduction on the lag length (using the BIC criteria) followed by a F-test for causality. In order explore the performance of the Genetic Algorithm (issue ii), the approach outlined earlier in the section is employed. A useful algorithm should be

218

KELVIN G. BALCOMBE

Table III. Causality.

n = 11600 n = 5830 n = 2933

Unemployment on wage

Wage on unemployment

1000/1000 998/1000 974/1000

1000/1000 982/1000 872/1000

Note. n is the number of models estimated using the genetic algorithm, 998/1000 denotes that the correct model was chosen 998 out of 1000 trials. N: Total number of possible sub models for 216 = 65536.

able to find the correct model having performed significantly less regressions than for the full search. One such trial was performed for real data (Nelson and Plosser, 1982) for a two variable case with 6 lags of each variable (with a trend). The results for regressions of wage on unemployment, and unemployment on wage are presented in Table III. The number of regressions used in the GA (n) was set to approximately 15, 10 and 5% of that required by the full search (65000 = N ). The algorithm was run 1000 times in each case. As can be observed, the correct model was nearly always selected where n/N = .1 (10%). The correct model was always chosen, where only n/N = .15 (15%). The gains by using a genetic algorithm increase with the number of sub-models. Hence, these results massively understate the efficiency gains in large models. For instance, trials were conducted for k = 20 (over 1 million sub-models). Only 50 thousand, (5%) models were estimated using the GAs and the correct model was chosen 100% of the time. 3.3.

AUTOREGRESSIVE DISTRIBUTED LAG MODELS

(ARDL S )

The basic structure of the ARDL is yt =

I i=1

δi dit +

ky i=1

αi yt−i +

kj J

γl j x j,t−l + vt

(8)

j=1 l=0

where the dit represent deterministic or unlagged stochastic variables (intercept, time trend, dummies), y is the dependent variable, and x j,t−l is the lth lag of the jth variable (in many cases l is set to one rather than zero). Hendry (1995, p. 232) gives a ‘typology’ of linear dynamic equations which are special cases of the above. These include the static regression, along with • • • •

The distributed lag: αi = 0 for all i The autoregressive: γ l j = 0 for all l,j The error correction αi = 1, and k j The differenced model αi = 1, and l=0 γl j = 0 for all j


219

Most automated procedures for lag-selection (for example in Pesaran and Pesaran, 1997) limit themselves to the selection of k y and the k j s. However, we might like to limit the models so as to allow γi j ∗ = 0 for some j ∗ < k j without imposing γi j = 0 for all j > j ∗ . Here, this type of restriction is referred to as a ‘hole’ in the lag structure. Holes in the lag structure are quite a plausible phenomena. In the case of seasonal series, lags are even more likely to contain holes. Viewed purely as forecasting tools, there can be little objection to automating model selection. However, if the resulting equation purports to have parameters which are ‘behavioral’ there may be considerable objections to this approach. There is a large literature in this area that could not be done justice here. However, in defense of the automated approach, the author would offer the additional comments: • The specification search can be made as narrow as the researcher permits. If solid grounds for the lack of a prior restriction can be made from a structural point of view, the researcher has the opportunity to do so; • Diagnostic testing can be performed on the resulting equations and, if necessary, be integrated into the selection of the final equations; • A model which has been automatically selected may serve as a useful ‘bench mark’ model by which alternative models might be evaluated. 4. Empirical Section 4.1.

SEASONAL ROOTS

In order to demonstrate the application of automated model selection to the seasonal root case, outlined earlier, seasonal roots were ‘tested’ using the data in Beaulieu and Miron (1993). In all cases, except the real wage series, the regressions were run over the period 50:1–88:12 in order to approximately correspond to those used by Beaulieu and Miron (1993) (henceforth B&M). For the real wage series, the regression was run from 69:1–88:12. A maximum of 12 lags of 12 yt were placed into the regressions. An intercept was (as in all the cases in this paper) always forced into the regression and seasonal dummies and a linear trend were included. The models were selected using the PICF and BIC criteria. Since there are optionally 37 variables to be selected, a GA was used to select a model. Given the long-sample sizes for this type of problem, the use of PICF criteria is slow, since the one-step ahead forecast errors have to be computed for over 400 periods when estimating each model. Each model required approximately 6 hours of computing time using the PICF whereas the BIC models could be selected within an hour. A summary of the results for the PICF and BIC criteria are given in Table IV. These results can be compared with that of Table IV in B&M. The t-statistics or standard errors for the resulting models are of limited interest and have therefore not been presented. z2 was selected along with one of the z variables in each of the

PICF BIC PICF BIC PICF BIC PICF BIC PICF BIC

– – – – – – – – – –

Trend 1, 7, 8, 9 1, 7, 9 6, 8, 11 8, 11 1, 3, 4, 5, 6 1, 6, 8, 11 2, 6, 7, 8, 9, 10 2, 6, 7, 8, 9 7 –

Dummiesc – – – – – – – – y y

z1 y y y y y y y y y y

z2 3, 4 3, 4 3, 4 3, 4 3, 4 3, 4 4 3, 4 4 4

z3, z4 5 5 5, 6 5, 6 5 5 5, 6 5, 6 5, 6 5, 6

z5, z6 8 8 8 8 7, 8 8 8 8 8 8

z7, z8 10 9 9 9 9 9 9 9 9, 10 9, 10

z9, z10 12 12 12 12 12 12 11, 12 12 12 12

z11, z12

1, 8, 9, 11 1, 8, 9, 11 – – 1, 7 1, 7 1, 4 1, 5, 6, 11 1, 7, 9 1, 3, 4, 6, 10

Lags

Note. The cell with en-dash (–) indicates that the variable concerned was not selected; an intercept was ‘forced into’ all regressions. c Dummies refer to seasonal dummies January–November.

Nom Rate

Ind-Prod

Unemp

Price

Real wage

IC

Table IV. Seasonal root tests.

220 KELVIN G. BALCOMBE


221

pairs when using either of the criteria. Therefore, as outlined in the Section 3.1, seasonal roots have not been selected in any of the 5 series, using either of the criteria. This concurs with the results of B&M. The only substantive difference found here is that the nominal rate series is found to be stationary (again using either criteria) since z1 is included for this series. B&M did not reject unit roots at the non-seasonal frequencies for any of the series. The fact that the trend was also eliminated suggests that the nominal rate series is weakly stationary. Although not presented here, model selection was also undertaken where holes in the lag structure were not allowed. This did not make any substantive difference to the findings with regard to the unit roots. Differences between the models chosen by the BIC and PICF exist. However, they do not lead to different conclusions concerning unit roots. There are similar patterns in both the dummy variables and the lag selection for most of the series. 4.2.

BIVARIATE CAUSALITY

Bivariate causality is examined in this section using five variables from the extended Nelson and Plosser (1982) data set. The variables were real GDP, money, wages, unemployment and interest rates. The period over which estimation took place was in each case 80 years (1909 until 1988). The results are presented in Tables V and VI. Table V gives the results for the PICF and Table VI gives the results for the BIC. The standard errors and t-statistics are not presented for the sake of conciseness, and an intercept is included in all models. The tables indicate the findings given by a F-test for causality using a VAR at the 5% level of significance. A superscript ‘A’ denotes agreement and a ‘D’ denotes a disagreement with the results presented in the table. Thus, in Table V, the second row and third column indicates that real GNP was Granger-Caused by money since some lags of x were included (x (3, 5, 6, h) indicates that xt−3 , xt−5 , and xt−6 were included, with the h indicating that the coefficients summed to zero). A unit root was selected within the lag polynomial of the dependent variable since y(1, 2, ur) has a ur which signifies a unit root. Tests for cointegration were performed in a VAR where the lags were first being selected using the BIC criteria, providing there were no serial correlation. Likelihood ratio tests for rank restrictions did not support cointegration between any of the variables at a 5% level of significance. All data were then differenced, and an F-test for causality was performed each way.2 With regard to causality, the PICF and BIC results unambiguously suggest that interest rates neither cause, nor are caused by, the other variables. This is also broadly consistent with the F-tests for causality. The inclusion of the interest rate variable is therefore useful, since it provides the univariate characterization of the data, for each of the variables (i.e. if the univariate models were chosen as in Phillips, 1995a,b, the resulting models would be as they are in Tables V and VI). Therefore, these constitute tests for ‘unit roots’ in the manner of Phillips (1995a,b). As can

222

KELVIN G. BALCOMBE

Table V. Results for bivariate causality (PICF). Dependent Real GNP

Money

Wages

Unemp

Interest

Real GNP

No-trend y(1, 2, ur)


No-trend y(1, 2, 5, ur)


x(3, 5, 6, h) D .05267

x(3, 5, h) D .05085 No-trend y(1, 2, 3, ur)

x(5, nh) D .05108 No-trend y(1, 2, 3, 6, ur)

x(·) D .05234 No-trend y(1, 2, 3, ur)

x(·) A .04369

x(4, nh) D .04330 No-trend y(1, 2, ur)

x(·) A .04369 No-trend y(1, 2, ur)

x(3, 4, nh) D .03397

x(·) A .03507 No-trend y(1, 2, 3, ur)

Money

Wages

Unemp

Interest

No-trend y(1, 2, 3, ur) x(1, 2, h) A .04368 No-trend y(1, ur)


x(1, 2, 3, 5, 6 h) A .3437 No-trend y(1, 2, 3, ur)

x(1, 2, 3, h) A .03471 No-trend y(1, 2, 3, ur)

No-trend y(1, 2, 3, ur)

x(1, 3, h) A .28776 No-trend y(1, 5, 6, ur)

x(1, 2, h) A .3036 No-trend y(1, 5, 6, ur)

x(1, 3, h) A .2919 No-trend No-trend y(1, 5, 6, ur) y(1, 5, 6, ur)

x(·) A .7727

x(·) A .7727

x(·) A .7727

x(·) A .30778

x(·) A .7727

An A(D) superscript denotes agreement (disagreement) with the causality F-test; y(l, 3, ur) denotes that lags yt−1 yt−3 have been selected and ur(nur) indicates that their coefficients add (do not add) to one, x(1, 3, h) denotes that the lags xt−1 xt−3 have been selected and h(nh) indicates that their coefficients add (do not add) to zero; The values at given at the bottom part of each cell are the average root mean square error for the one-step ahead forecast errors.

be examined in the final columns in Tables V and VI, the PICF selects unit roots for all the variables concerned. On the other hand the BIC characterizes real GNP and unemployment as trend stationary, and stationary respectively. Therefore, the PICF tends to characterize the data as having unit roots, with an absence of any cointegrating relationships. The BIC also tends to characterize the data in this way, except for the unemployment and real GNP series. The BIC suggests that these variables are stationary. With regard to causality, both criteria are broadly in agreement with the findings of the F-test. However, the PICF criteria finds more evidence of causality between variables than does the BIC. In particular, real GNP is found to be caused by both money and wages when using the PICF, whereas, the BIC and F-tests do not favor this hypothesis.

223


Table VI. Results for bivariate causality (BIC). Dependent

Real GNP

Real GNP

Money

Wages

Unemp

Interest

Money

Wages

Unemp

Interest

Trend y(1, 2, nur)

Trend y(1, 2, nur)

No trend y(1, 2, nur)

Trend y(1, 2, nur)

x(·) A .05853

x(·) A .05853 No trend y(1, 2, 3, ur)

x(1, nh) D .05190 No trend y(1, 2, ur)

x(·) D .05853 No trend y(1, 2, 3, ur)

x(·) A .04369

x(3, nh) D .04355 No trend y(1, 2, ur)

x(·) A .04369 No trend y(1, 2, ur)

x(3, 4, h) D .034295

x(·) A .03507 No trend y(1, 2, nur)

No trend y(1, 2, 3, ur) x(·) D .04369 Trend y(1, ur)

No trend y(1, 2, ur)

x(3, nh) A .03483 No trend y(1, nur)

x(2, 4, h) A .03480 No trend y(1, 2, 3, nur)

No trend y(1, 2, 3, nur)

x(1, 2, h) A .34968 No trend y(1, 5, 6, ur)

x(1, 2, h) A .34013 No trend y(1, 5, 6, ur)

x(2, 3, h) A .3461 No trend y(1, 5, 6, ur)

No trend y(1, 5, 6, ur)

x(·) A .7727

x(·) A .7727

x(·) A .7727

x(·) A .7727

x(·) A .3647

For description of notation and quantities in Table VI, refer to Table V.

4.3.

THE CONSUMPTION FUNCTION

For this study, consumption, income and price data are used for the US, over the period 1949–1989. The data has been taken from De Crombrugghe et al. (1997). The variables are: logged deficated per-capita food consumption (lc) logged real percapita income (ly), logged food price (lpf), and the logged income deficator (lpy). This data set was chosen because its properties have already been analyzed extensively facilitating a comparison of the automated results with those derived by sequential reduction. A full description of the data is given in De Crombrugghe et al. (1997) and the accompanying articles in the special issue of Applied Econometrics (Magnus and Morgan 1997a,b). The variables are calculated according to the instructions in De Crombrugghe et al. (1997). These authors find that the unit root tests for the variables tend to favor I(1) behavior, with a unique cointegrating vector between the variables. The results for three equations are presented herein. The first two result from a search over specifications that (potentially) include unit roots on the dependent variable and homogeneity restrictions on the explanatory variables. In conducting

224

KELVIN G. BALCOMBE

Table VII. Consumption functions.

Parameters PICF BIC RMSE Prob-Hetero Prob-Serial Prob-ARCH Prob-Norm

PICF

BIC

PICF2

6 4.010530 9.164662 0.012566 0.29163 0.93926 0.2236 0.49016

9 3.762227 9.347696 0.01942 0.17663 0.82809 0.04619 0.0512

9 3.786322 9.329043 0.016388 0.35233 0.69374 0.00873 0.03412

this study, the maximum number of parameters was set to 10 when using the PICF, after having established that the BIC chose a specification with only 9 parameters. As indicated in Table VII, the first column gives the results for the PICF, and the second set for the BIC. However, the last set (PICF2) columns give the results for the PICF, where the homogeneity and unit restrictions have not been imposed a priori. The lack of these restriction ensures that there will be ‘long-run’ parameters that will be consistent estimates of the cointegrating vector, if it exists. As noted later, the PICF chooses a differenced model. The last set of results are included in order to assess the ‘loss’ in forecasting accuracy, incurred as a failure to impose these restrictions, and to compare the specification with that chosen by the BIC. These equations could be reparameterised in a number of ways. Three convenient ways are presented in equations (9), (10) and (12). The standard errors are in parentheses below each of the parameters. In the case of the long-run coefficients the standard errors have been calculated using the Sigma method (Pesaran and Pesaran, 1997). The first equation, chosen by the PICF, can be rearranged as (where k yt = yt − yt−k ) lct = − .040 − .392 3lct−1 (.008)

(.008)

+ .457 5lyt − .401 1lp f t − .215 4lp f t + .303 5lpyt (.064)

(.067)

(.057)

(.050)

(9)

whereas the equation, chosen by BIC, can be expressed as lct = − 1.13 − .017 t (.702) (.002) − lct−1 − 1.049 lyt−1 + 0.579 lp f t−1 − 0.673 lpyt−1 (.068)

(.079)

(.068)

+ 0.378 lyt − 0.369 lyt−1 − 0.736 lp f t − 0.156 lp f t−1 (.091)

(.089)

+ 0.472 lpyt − 0.201 lpyt−1 . (.101)

(.068)

(.056)

(.057)

(10)


225

De Crombrugghe et al. (1997) estimate the long run relationship lct = 1.072lyt − 0.616lp f t + 0.734lpyt

(11)

which is extremely similar to the long-run parameters in (10). The, third model (12), is a model selected using the PICF, but forcing in the lagged dependent variables (i.e. not allowing for a fully differenced model thus requiring long-run parameters), gives the results lct = −1.8704 − .0201.t + 0.200 lct−1 (.077) (.002) (.743) − 1.20 lct−1 − 1.017 lyt−1 + 0.620 lp f t−1 − 0.720 lpyt−1 (.077)

(.058)

(.051)

(.059)

+ 0.3881 lyt − 0.5119 lyt−1 − 0.7457 lp f t (.091)

(.102)

+ 0.505 lpyt − 0.3596 lpyt−1 (.101)

(.070)

(.058)

(12)

Thus, automated selection using the BIC (10) criteria gives rather similar results (in this case) to the VECM approach to modelling the consumption function. Interestingly, it does not select any lags in the dependent variable. Conversely, the PICF does not select a cointegrated relationship. As with the causality results, the PICF imposes a unit root in the polynomial of the dependent variable. Moreover, in doing so there is a substantial reduction in the one-step ahead forecasting errors, which is listed as the RMSE in Table VII (the columns 1, 2 and 3 corresponding to models (9), (10), and (12) respectively). The tendency of the PICF on one hand to characterize processes as unit roots, but not to characterize relationships between variables as cointegrating relationships is interesting. When equations are chosen based on the forecasting accuracy, unit root processes seem to be commonly chosen when examining univariate series. While this criteria lends support to the cointegration approach to modelling series (since it supports the existance of roots), the same criteria tends not to support the existence of cointegrating relationships. When examining (12) it is apparent that the long-run relationships chosen by the PICF criteria (when ‘forced in’) are remarkably similar to those chosen by De Crombrugghe et al. (1997) (11) and this specification is very similar to that chosen by the BIC also. The error correction coefficient is greater than one in absolute value (in (12)). However, given that weak exogeneity is unlikely to hold in this equation, the failure for this coefficient to lie between 0 and unity, is not necessarily surprising. As would be expected, having been chosen by the PICF, the RMSE for this equation improves on that selected by the BIC, but is less than the first PICF equation (and lies roughly in the middle of the two, see Table VII). The diagnostic tests are presented also at the bottom of Table VII. The significance values in this table indicate that the differenced form of the model seems to have better properties in terms of ARCH, and normality, than both the error correction forms.

226

KELVIN G. BALCOMBE

In summary, the automated approaches give remarkably similar estimates of the long-run parameters as to the values presented in De Crombrugghe et al. (1997). However, the PICF criteria selects a purely differenced model. In purely ‘predictive’ terms this model seems to have a better performance than the error correction form. That different criteria select quite different models is a point of concern. Does this undermine the confidence that researchers can have in their results? The answer to this question can only be, yes, at least partly. However, one cannot ignore the fact that the selection of models has always been highly dependent on the strategies employed by researchers, and that these strategies vary from researcher to researcher. If the non-uniqueness of selection is killer criticism of automated model selection then, in fairness, these criticisms should also be leveled at much of accepted econometric practice. 5. Conclusions This article explored automated model searches using Information Criteria for the model selection. Genetic algorithms were described and used for this purpose. These algorithms were shown to be a practical method for model selection when the number of sub-models were large. Where forecasting is the primary goal, the use of automated model searches can reduce the required number of prior decisions regarding the ‘integratedness’ of the data or the existence of cointegrating relationships. When modelling ‘structural’ relationships it was argued that automated selection need not imply a ‘black box’ mentality and this approach has some advantages over the ‘interactive’ sequential reduction approach currently in favor. When testing for seasonal roots, the findings of the automated approach were ostensibly similar to the results obtained in Beaulieu and Miron (1993) regardless of whether the BIC or the PICF criteria were used. Little support was found for the existence of seasonal roots. However, differing from B&M, a zero frequency unit was not found for the nominal rate series. The automated approach to testing for causality gave similar results to the conventional approach when using a sub-set of the Nelson and Plosser (1982) data. However, the findings suggested that, from a forecasting point of view, models that impose unit roots in the polynomials of the dependent variable seem to be preferred. The PICF usually, but not always, leads to more parsimoniously specified models. However, here this did not lead to less evidence in favor of causality as was expected. The ARDL approach to estimating the aggregate consumption function showed that automated selection using the BIC gave results that were remarkably similar to the approach taken by De Crombrugghe et al. (1997). However, the selection of an error correction form did not produce forecasts that are anywhere near optimal. Theoretically, estimation using differenced variables that are cointegrated in ‘levels’, ignores information about the long-run relationships leading, inter alia, to suboptimal forecasts. While it is difficult to generalise on the basis of a few results,


227

the work here does lead to questioning the practical relevance of this argument in finite samples. In conducting this study, other avenues have been opened up which may deserve further analysis. First, a direct extension of the work here would be to examine the role of automated model selection in the VECM context. Second, while information criteria have been used here, model searches could be conducted using model probabilities constructed using g-priors or informative priors. Notes 1 Limited

experiments of this sort on large model spaces seem to support the contention that GAs do well in this regard (up to 250 ). 2 The tests for cointegration assume that the data have unit roots. Likewise, the decision to difference the data assumes this also. Given that there debate over the characterisation of the data this approach is therefore arbitrary but nevertheless serves as a useful basis for comparison.

References Balcombe, K.G., Bailey, A. and Fraser, I. (Forthcoming). Measuring the impact of R&D on productivity from a econometric time series perspective. Journal of Productivity Analysis. Beaulieu, J.J. and Miron, J.A. (1993). Seasonal unit roots in aggregate U.S. data. Journal of Econometrics, 55, 305–328. Brown, R.L., Durbin, J. and Evans, J.A. (1975). Techniques for testing constancy if regression relations over time. Journal of the Royal Statistical Society, 37(1), 149–192. Chao, J.C. and Phillips, P.C.B. (1999). Model selection in partially nonstationary vector autoregressive processes with reduced rank structure. Journal of Econometrics, 91, 227–271. Davidson, R. and McKinnon, J.G. (1993). Estimation and Inference in Econometrics. Oxford University Press, Oxford. De Crombrugghe, D., Palm, F.C. and Urbain, J.P. (1997). Statistical demand function for food in the USA and the Netherlands. Journal of Applied Econometrics, 12(5), 615–645. De Jong, K.A. (1975). An Analysis of the Behaviour of a Class of Genetic Adaptive Systems. Unpublished PhD Dissertation. University of Michigan Department of Computer Science. Dorsey, R.E. and Mayer, W. (1995). Genetic algorithms for estimation problems with multiple optima, nondifferentiability, and other irregular features. Journal of Business and Economic Statistics, 13, 53–66. Fernandez, C., Ley, E. and Steel, M. (2001). Benchmark priors for Bayesian model averaging. Journal of Econometrics, 100, 381–427. George, E.I. and McCulloch, R.E. (1993). Variable selection via Gibbs sampling. Journal of the American Statistical Association, 88(423), 881–889. George, E.I. and McCulloch, R.E. (1996). Approaches for Bayesian variable selection. Statistica Sinca 7, 339–373. Goldberg, D.E. (1989). Genetic Algorithms in Search, Optimisation and Machine Learning. Addison Wesely, Reading, MA. Holden, D. and Perlman, R. (1994). Unit roots and cointegration for the economist. In B. Rao (ed.), Cointegration for the Applied Economist. Chapter 7, 47–112. Hendry, D.F. (1995). Dynamic Econometrics, Advanced Texts in Econometrics. Oxford University Press, Oxford.

228

KELVIN G. BALCOMBE

Hylleberg, S., Engle, R., Granger, C. and Yoo, B.S. (1990). Seasonal integration and cointegration. Journal of Econometrics, 44, 215–238. Koza, J.R. (1992). Genetic Programming A Bradford Book. MIT Press, Cambridge, MA. Koza, J.R. (1991). In Paul Bourgine and Bernard Walliser (eds.), A Genetic Approach to Econometric Modelling. Economics and Cognitive Science. Pergamon Press, Oxford, UK, 57–75. Laud, P.W. and Ibrahm, J.G. (1995). Predictive model selection. Journal of the Royal Statistical Society B, 57(1), 247–262. Magnus, J.R. and Morgan, M.S. (1997a). The data a brief description. Journal of Applied Econometrics, 12, 651–661. Magnus, J.R. and Morgan, M.S. (1997b). Design of the experiment. Journal of Applied Econometrics, 12, 459–465. Miller, A.J. (1990). Subset Selection in Regression. Chapman and Hall, New York. Nelson, C.R. and Plosser, C.I. (1982). Trends and random walks in macroeconomic time series: Some evidence and implications. Journal of Monetary Economics, 10, 130–162. Pesaran, M.H. and Pesaran, B. (1997). Working with Microfit 4. Oxford University Press, Oxford. Ploberger, W. and Phillips, P.C.B. (2001). Rissanen’s theorem and econometric time series. In H.A. Keuzenkamp, M. McAleer and A. Zellner (eds.), Simplicity, Inference and Modelling. Cambridge University Press, Cambridge. Ploberger, W. and Phillips, P.C.B. (2003). Empirical limits for time series econometric models. Econometrica, 71(2), 627–673. Phillips, P.C.B. (1994). Bayes models and forecasts of Australian macroeconomic time series. In C.P. Hargreaves (ed.), Nonstationary Time Series Analysis and Cointegration. Advanced Texts in Econometrics. Oxford University Press, Oxford. Phillips, P.C.B. (1995a). Bayesian model selection and prediction with empirical applications. Journal of Econometrics, 69, 289–331. Phillips, P.C.B. (1995b). Bayesian prediction. A response. Journal of Econometrics, 69, 351–365. Schwarz, G. (1978). Estimating the dimension of a Model. Annals of Statistics, 6(2), 461–464. Smith, M. and Kohn, R. (1996). Nonparametric regression using Bayesian variable selection. Journal of Econometrics, 75, 317–343.