Lett Spat Resour Sci DOI 10.1007/s12076-009-0022-z O R I G I N A L PA P E R
Spatial autocorrelation or model misspecification? The help from RESET and the curse of small samples Andrea Vaona
Received: 9 October 2008 / Accepted: 6 February 2009 © Springer-Verlag 2009
Abstract In regression analysis, model misspecification can produce spurious spatial correlation in the residuals. By means of Monte Carlo simulations, I show that the RESET test can help to disentangle this conundrum in large samples. Small samples can pose a serious challenge to finding the correct model. Keywords Spatial autocorrelation · Model misspecification · RESET · Monte Carlo simulations JEL Classification R12 · C12 · C15 · R15 · C21
1 Introduction McMillen (2003) has recently argued that, in regression analysis, model misspecification, due, for example, to incorrect functional form and omitted relevant variables, can lead researchers to incorrectly fit a spatial autoregressive model to the data. However, nonparametric econometrics can be helpful in detecting this problem.1
1 Other misspecification testing strategies were provided by Graaf et al. (2001) and by Florax et al. (2003).
A. Vaona () University of Lugano, Institute for Economic Research, Via Maderno 24, CP4361, 6904 Lugano, Switzerland e-mail:
[email protected] A. Vaona Department of Economic Sciences, University of Verona, Verona, Italy A. Vaona Kiel Institute for the World Economy, Kiel, Germany
A. Vaona
In this note we show that the regression specification error test (RESET), proposed by Ramsey (1969), will help to disentangle this conundrum as well, if adequate information is available. Suppose to have the following linear model yi = xi β + ui
(1)
where xi and β are two K × 1 vectors, the former of independent variables and the latter of parameters, yi is the dependent variable, ui the disturbance term and i a location subscript. The null hypothesis of RESET is that E(ui |xi ) = 0 and its alternative is that E(ui |xi ) = 0. RESET can be implemented as a test on the hypothesis δ = 0 in the augmented regression yi = xi β + zi δ + ui , where zi and δ are M × 1 vectors of test variables and parameters, respectively. The literature has proposed several choices for zi (for a quick review see Hatzinikolaou and Stavrakoudis 2006). Ramsey and Gilbert (1972) recommended the second, third and fourth powers of the fitted values of (1), while Godfrey and Orme (1994) only their second power.1 Florax (1992) investigated the general properties of RESET in a spatial setting and it showed that RESET is not sensitive to spatial correlation in the disturbance.
2 Simulation designs We start with applying RESET to the context analysed by McMillen (2003), which, though being specific, has a straightforward economic interpretation and it has offered methodological guidance to a variety of applied studies.2 Consider a polycentric city, with a central business district and an employment sub-centre. We assume yi in (1) to be the natural logarithm of employment density. We also suppose that the independent variables of the true model are a constant, the distance from the central business district (DCBDi ), a dummy variable equal to one in proximity of the sub-centre (SUBi ) and an interaction term between SUBi and the distance from the sub-centre (DSUBi ). Distances (di ) are measured in miles, as in the case of an American metropolitan area, and they range from −50 to 50 in equal increments of 0.1001, leading to a sample of 1000 observations. The central business district is located at di = 0 and DCBDi = |di |. Further, SUBi = 1 if −36 < di < −24 and DSUBi = |30 + di |, in other words the employment sub-centre is located at d = −30 and beyond a 6 miles radius its effect disappears. The true model is yi = 1.5 − 0.05DCBDi + 0.9SUBi − 0.15DSUBi × SUBi + ui ,
(2)
1 One other well known option is to use the second, third and fourth powers of x , as proposed by Thursby i
and Schmidt (1977). However, in the present setting this testing strategy turned out to be flawed by multicollinearity. 2 See, for instance, Costa-Font and Moscone (2008), Capone and Boix (2008), Osland et al. (2007), Par-
tridge et al. (2007), Moscone et al. (2007), Cho et al. (2007), Partridge and Rickman (2007), Ferguson et al. (2007), Bivand and Brunstad (2006), Partridge and Rickman (2005).
Spatial autocorrelation or model misspecification?
ui are normally distributed with zero mean and variance equal to 0.3623. A similar model was used in McMillen (2001). McMillen (2003) explores by means of one thousand Monte Carlo simulations the performance of 4 models: a) b) c) d)
y = β0 + β1 DCBD + β2 SUB + β3 DSUB × SUB + u; u ∼ N (0, σa2 ), y = β4 + β5 DCBD + u; u ∼ N (0, σb2 ), y = β6 + β7 DCBD + β8 DSUB + u; u ∼ N (0, σc2 ), y = β9 + ρWy + β10 DCBD + u; u ∼ N (0, σd2 ),
where the labels of the variables without i indicate N × 1 vectors, with N being the number of observations. βs with s = 0, . . . , 10, ρ and σq2 with q = a, . . . , d are parameters to be estimated. W is an N × N spatial contiguity matrix whose elements, Wij , are computed as follows. First, they are set equal to 1 if |i − j | = 1 and to zero otherwise and, then, W is row standardized. McMillen (2003) also estimates Model (b) by means of nonparametric regression using a bi-square kernel and a bandwidth equal to 5.005 and he computes the rejection frequencies of two Lagrange Multiplier (LM) tests for heteroskedasticity and spatial correlation respectively. We compute three RESET tests in order to check whether they can help to detect model misspecification. Given the results by Ramsey and Gilbert (1972) and Godfrey and Orme (1994), we consider different polynomials in the fitted values of the models under analysis, using as diagnostic variables first their squares only, then their squares and their cubes and, finally, their squares, cubes and fourth powers. The RESET tests for Models (a), (b) and (c) are usual F-tests. Model (d) was estimated by a maximum likelihood method following LeSage (1999). Therefore, RESET was implemented as a likelihood ratio test, distributed as a chi squared with a number of degrees of freedom equal to the number of restrictions, that is 3. Having explored the performance of RESET in the baseline setting by McMillen (2003), we take two further steps. First we consider a different true model, having an employment sub-centre in the same location as (2), but with a stronger impact on employment density: yi = 1.5 − 0.05DCBDi + 1.5SUBi − 0.25DSUBi × SUBi + ui ,
(3)
Second, we reduce the sample size of model (2) to 100. In both the cases, we run 1000 Monte Carlo simulations. In the second case, we also investigate the performance of bootstrapped tests. To illustrate our experiment, which is similar to those implemented by Horowitz (1997), we first take the case of Model (a). We generate an estimation dataset of size 100 by random sampling from Model (2). We estimate the parameters of Model (a) by OLS and we compute the RESET statistic, that we call Rn . Afterwards, we generate a bootstrap sample of size 100 from Model (a) using our estimated parameters and we re-estimate the model by OLS. We repeat this step 100 times. Each of this time we compute the RESET statistic, that we call Rn∗ . We estimate ∗ , from the empirical distribution of the 5% critical value of the RESET statistic, r0.05 ∗ . We repeat the ∗ Rn and we reject the null of no model misspecification if Rn > r0.05 steps above 1000 times. We perform the same procedure also for Models (b) and (c). For Model (d) we use a maximum likelihood estimator instead of OLS.
A. Vaona
3 Results and conclusions The main results by McMillen (2003) are reported in Table 1, which illustrates the mean parameter estimates obtained performing Monte Carlo simulations, together with their standard errors and the rejection frequencies of the heteroskedasticity and spatial correlation tests. A potential researcher might be induced to rely on Model (d) even though it is misspecified because it is effective in taking care of spatial autocorrelation in the residuals. McMillen (2003) finds, instead, that nonparametric regression is able to identify the effect of the sub-centre on employment density. The last three lines of Table 1 show our results. RESET has the correct size as it rejects the true model in approximately the 5% of the simulations, independently Table 1 Monte Carlo results for a sample size of 1000 observations Model (a) Constant Distance to city centre
Model (c)
Model (d)
1.4990
1.5208
1.6039
1.2696
(0.0233)
(0.0233)
(0.0267)
(0.0509)
−0.0499 (0.0008)
Sub-centre Dummy
Model (b)
−0.0487 (0.0008)
−0.0475 (0.0008)
−0.0406 (0.0017)
0.8965 (0.0658)
Sub-centre Dummy × Distance to Sub-centre
−0.1491 (0.0188) −0.0033
Distance to Sub-centre
(0.0005)
WY
0.1651 (0.031) 6.4
100.0
100.0
100.0
5.7
100.0
99.4
0.0
% rejections: RESET test 1
6.3
88.0
82.7
81.0
% rejections: RESET test 2
5.1
84.6
81.5
100.0
% rejections: RESET test 3
4.8
72.8
75.3
69.4
% rejections: Heteroskedasticity test % rejections: Spatial autocorrelation test
Note: The results are based on one thousand Monte Carlo replications. Standard deviations are in parentheses. WY = spatial lags term. All the results but those for the RESET test are reproduced from McMillen (2003) to the reader’s convenience. Models (a), (b) and (c) are estimated by OLS, while Model (d) is estimated by a maximum likelihood spatial autoregressive estimator. RESET test 1 uses as diagnostic variables the squares, cubes and fourth powers of the fitted values of the model under analysis. RESET test 2 uses as diagnostic variables the squares and the cubes of the fitted values of the model under analysis. RESET test 3 uses as diagnostic variable only the squares of the fitted values of the model under analysis. The nominal size of the tests is 5%
Spatial autocorrelation or model misspecification? Table 2 Simulated rejection frequencies of diagnostic tests with a sample size of 100 Model (a)
Model (b)
Model (c)
Model (d)
% rejections: Spatial autocorrelation test
6.6
9.0
6.5
10.0
% rejections: RESET test 1
3.8
12.8
11.1
12.1
% rejections: RESET test 2
5.9
13.0
8.2
11.3
% rejections: RESET test 3
5.4
12.0
9.5
11.9
% rejections: Bootstrapped RESET test 1
4.1
12.5
11.2
11.2
Note: RESET test 1 uses as diagnostic variables the squares, cubes and fourth powers of the fitted values of the model under analysis. RESET test 2 uses as diagnostic variables the squares and the cubes of the fitted values of the model under analysis. RESET test 3 uses as diagnostic variable only the squares of the fitted values of the model under analysis. The nominal size of the tests is 5%
from the degree of the polynomial in the fitted values of the model under analysis. It also has good power. Specifically, RESET using the squares and the cubes of the fitted values rejects Model (d) in all our simulations, helping a potential researcher not to choose it. In model (3), the simulated size of the test with the squares, cubes and fourth powers of the fitted values is 4.1%, and the rejection frequencies of Models (b), (c) and (d) reach 100%. The intuition for this result is clear: a larger employment subcentre is easier to detect. Once decreasing the sample size to 100 observations, the pattern highlighted by McMillen (2003) changes drastically. As showed in Table 2, the rejection frequencies of the LM test for spatial correlation does not change substantially across different models. So the risk of choosing Model (d) though misspecified is smaller. However, also RESET is less helpful in choosing the correct model. Resorting to bootstrapping would not improve the performance of the test.3 Figure 1 shows that a nonparametric estimator based on Silverman’s optimal smoothing bandwidth and a bi-square kernel might not be helpful as well. This happens because, as showed, in Fig. 2 a sample size of 100 does not provide enough information to distinguish the variability deriving from the employment sub-centre from that deriving from the disturbance. All in all, the results by McMillen (2003) and ours show that, though specifying correctly an econometric model remains a challenging task, researchers can jointly use parametric specification testing and nonparametric regression to identify misspecification problems. However, this is possible only when adequate information is available. Acknowledgements The author would like to thank Roberto Patuelli, Henk Folmer and two anonymous referees for helpful comments. The usual disclaimer applies.
3 Doubling the number of bootstrap replications from 100 to 200 produces similar results to those showed
in Table 2.
A. Vaona
Fig. 1 Predictions from a nonparametric model with Silverman’s optimal smoothing bandwidth and a bi-square kernel with a sample size of 100 observations
Note: The figure does not display the fitted values of Model (d) for sake of clarity. They overlap with those of Models (b) and (c) Fig. 2 Predictions from different models and true values of the independent variable with a sample size of 100 observations
Spatial autocorrelation or model misspecification?
References Bivand, R., Brunstad, R.: Regional growth in Western Europe: detecting spatial misspecification using the R environment. Pap. Reg. Sci. 85, 277–297 (2006) Capone, F., Boix, R.: Sources of growth and competitiveness of local tourist production systems: an application to Italy (1991–2001). An. Reg. Sci. 42, 209–224 (2008) Cho, S.H., Chen, Z., Yen, S.T., English, B.C.: Spatial variation of output-input elasticities: evidence from Chinese county-level agricultural production data. Pap. Reg. Sci. 86, 139–157 (2007) Costa-Font, J., Moscone, F.: The impact of decentralization and inter-territorial interactions on Spanish health expenditure. Empir. Econ. 34, 167–184 (2008) Ferguson, M., Ali, K., Olfert, M.R., Partridge, M.: Voting with their feet: jobs versus amenities. Growth Change 38, 77–110 (2007) Florax, R.: The University: A Regional Booster? Economic Impacts of Academic Knowledge Infrastructure. Aldershot, Ashgate (1992) Florax, R., Folmer, H., Rey, S.J.: Specification searches in spatial econometrics: the relevance of Hendry’s methodology. Reg. Sci. Urban Econ. 33, 557–579 (2003) Godfrey, L.G., Orme, C.D.: The sensitivity of some general checks to omitted variables in linear models. Int. Econ. Rev. 35, 489–506 (1994) Graaf, T., de Florax, R., Nijkamp, P., Reggiani, A.: A general misspecification test for spatial regression models: dependence, heterogeneity, and nonlinearity. J. Reg. Sci. 41, 255–276 (2001) Hatzinikolaou, D., Stavrakoudis, A.: Empirical size and power of some diagnostic tests applied to a distributed lag model. Empir. Econ. 31, 631–643 (2006) Horowitz, J.L.: Bootstrap methods in econometrics: theory and numerical performance. In: Kreps, D.M., Wallis, K.F. (eds.) Advances in Economics and Econometrics: Theory and Applications, vol. III. Cambridge University Press, Cambridge (1997). Chap. 8 LeSage, J.P.: The theory and practice of spatial econometrics. http://www.spatial-econometrics.com/ (1999). Accessed 9 October 2008 McMillen, D.P.: Nonparametric employment sub-centre identification. J. Urban Econ. 50, 448–473 (2001) McMillen, D.P.: Spatial autocorrelation or model misspecification? Int. Reg. Sci. Rev. 26, 208–217 (2003) Moscone, F., Knapp, M., Tosetti, E.: Mental health expenditure in England: a spatial panel approach. J. Health Econ. 26, 842–864 (2007) Osland, L., Thorsen, I., Gitlesen, J.P.: Housing price gradients in a region with one dominating centre. J. Real Est. Res. 29, 321–346 (2007) Partridge, M., Bollman, R.D., Olfert, M.R., Alasia, A.: Riding the wave of urban growth in the countryside: spread, backwash, or stagnation? Land Econ. 83, 128–152 (2007) Partridge, M.A., Rickman, D.S.: High-poverty non-metropolitan counties in America: can economic development help? Int. Reg. Sci. Rev. 28, 415–440 (2005) Partridge, M.D., Rickman, D.S.: Persistent pockets of extreme American poverty and job growth: is there a place-based policy role? J. Agric. Res. Econ. 32, 201–224 (2007) Ramsey, J., Gilbert, R.: A Monte Carlo study of some small sample properties of tests for specification errors. J. Am. Stat. Assoc. 67, 180–186 (1972) Ramsey, J.: Test for specification errors in classical linear least squares regression analysis. J. R. Stat. Soc., Ser. B 31, 350–371 (1969) Thursby, J.G., Schmidt, P.: Some properties of tests for specification error in a linear regression model. J. Am. Stat. Assoc. 72, 635–641 (1977)