Dealing with spatial data pooled over time in statistical models - LEDi

7 downloads 0 Views 568KB Size Report
Jul 25, 2012 - matrices (Griffith 1981, 1996; Getis and Aldstadt 2004; Fingleton 2009; Getis .... that only immediate neighbours have a defined relation, di .
Lett Spat Resour Sci (2013) 6:1–18 DOI 10.1007/s12076-012-0082-3 ORIGINAL PAPER

Dealing with spatial data pooled over time in statistical models Jean Dubé · Diègo Legros

Received: 26 March 2012 / Accepted: 2 July 2012 / Published online: 25 July 2012 © Springer-Verlag 2012

Abstract Recent developments in spatial econometrics have been devoted to spatiotemporal data and how spatial panel data structure should be modeled. Little effort has been devoted to the way one must deal with spatial data pooled over time. This paper presents the characteristics of spatial data pooled over time and proposes a simple way to take into account unidirectional temporal effect as well as multidirectional spatial effect in the estimation process. An empirical example, using data on 25,357 single family homes sold in Lucas County, OH (USA), between 1993 and 1998 (available in the MatLab library), is used to illustrate the potential of the approach proposed. Keywords

Spatio-temporal data · Weights matrix · Spatial econometrics

JEL Classification

C21 · C23 · C51 · C81 · R15

1 Introduction Spatio-temporal data may be classified in two categories: spatial panel data and spatial data pooled over time. The main difference between the two data sets lies in the fact that spatial data pooled over time consist of different units observed once (or few) over time. Although methods and models have been recently developed for spatial panel

J. Dubé (B) Univeristé du Québec À Rimouski, 300 allée des Ursulines, Rimouski, QC G5L 3A1, Canada e-mail: [email protected] D. Legros Laboratoire d’Économie et de Gestion, 2 boulevard Gabriel, BP 26611, 21066 Dijon Cedex, France e-mail: [email protected]

123

2

J. Dubé, D. Legros

data (Elhorst 2003; Anselin et al. 2006; Anselin 2007; Yu et al. 2008; Monteiro and Kukenova 2009; Yu and Lee 2010), the spatial econometric methods and models are still applied to spatial data pooled over time since no statistical routines or software have been developed to account for spatial and temporal dimensions simultaneously. However, there exist some propositions that deal with spatial and temporal dimensions. The work of Pace et al. (1998, 2000) is probably not the first attempt to deal with the presence of both dimensions in data bases. The approach is based on the development of separate spatial and temporal weights matrices controlling for temporal effect, spatial effect and indirect spatio-temporal effect (see also Tu et al. 2004; Sun and Tu 2005). Instead of considering separate spatial and temporal effects through independent weights matrices, some authors proposed to account simultaneously for both dimensions by developing a unique spatio-temporal weights matrix (Smith and Wu 2009; Huang et al. 2010; Dubé et al. 2011a). The construction of such a weights matrix is in line with what literature noted to be the challenge of developing adapted weights matrices (Griffith 1981, 1996; Getis and Aldstadt 2004; Fingleton 2009; Getis 2009). Moreover, it can be related to what Legendre (1993) calls the “new paradigm” of spatial autocorrelation. Spatio-temporal weights matrices can be used to account for the fact that spatial effect is local and decreases with distance, as stated by the first law of geography (Tobler 1970), while the temporal effect is unidirectional and lagged within a given spatial definition, as pointed out by the seminal work of Hägerstrand (1970). The reflex of using existing spatial econometric routines or software and assuming that temporal dimension is neutral in the treatment and estimation of spatial effect automatically eliminates an important aspect of the data generating process. Applying spatial econometric techniques and methods on spatial data pooled over time neglects the fact that time dimension is unidirectional, as opposed to the multidirectional spatial effect. One implicit hypothesis under such mechanical applications is that individual units have perfect memory, so that two units observed at the actual time period have the same spatial (spillover) effect as the oldest observation can have on the newest. Moreover, it is implicitly assumed that future observations can influence the realization of past observations, which is somewhat awkward (Dubé et al. 2011b). If near future observations can have some potential influence, as a reflection of anticipation, it is rare that such phenomena can be perfectly known many time periods before. Such assumptions may generate overestimation of the spatial autoregressive coefficients (Lesage and Pace 2009; Dubé et al. 2011b; Dubé and Legros 2012) which may ultimately lead to the unit root problem of the spatial autocorrelation coefficient (Fingleton 1999; Lee and Yu 2009) and spurious results. The objective of this paper is to present a simple way to improve existing spatial techniques to deal explicitly with spatial data pooled over time accounting simultaneously for temporal dynamic effect spatially located as well as spatial effect given temporal direction. The originality of the paper is threefold. First, it bridges the gap between spatial applications in a spatio-temporal context when data is different from the panel context (Parent and Lesage 2010, 2011). Second, it allows to deal with spatial and temporal considerations simultaneously since the application of spatial techniques produces a spurious measure of spatial dependence in spatio-temporal context (Dubé and Legros 2012). Third, the structure of the spatio-temporal weights matrix is flexible and allows to generate different set of spatio-temporal variables capturing, as a special

123

Dealing with spatial data pooled over time in statistical models

3

case, dynamic time lagged effects spatially located. Different spatio-temporal weights matrices can be used to develop generalized models, incorporating spatial autoregressive (SAR) and spatial error (SEM) specifications, while results can be obtained using existing spatial estimation procedures or routines given different definition of weights matrix accounting for space and time. The paper is divided into four sections. The first section presents the structure and the particularities of the data generating process of spatial data pooled over time, and discusses how this information can be incorporated in statistical modelling. The second section presents the structure of a spatio-temporal weights matrix based on the specification of spatial and temporal weights matrices. The third section presents an empirical application based on house sales prices for Lucas County, OH (shared in the MatLab library—Lesage and Pace 2009) between 1993 and 1998 used in Lesage and Pace (2004). The final section summarizes the paper and proposes some avenues for further studies.

2 Structure of spatial data pooled over time The spatial link among data is usually derived using information on geographical coordinates. This permits a graphical representation based on two dimensions (Fig. 1) representing latitude (X) and longitude (Y). The location of the dots (individual units) are used to calculate distance between each observation. The spatial distances are then used to build a spatial weights matrix helpful in detecting a spatial autocorrelation pattern and evaluating the spatial spillover effect in statistical models. In this case, the spatial weights matrix accounts only for a multidirectional effect: one observation is influenced by other observations closely related while it also exerts influence on the other observations.

Fig. 1 Spatial structure of observations pooled on only one layer

123

4

J. Dubé, D. Legros

Fig. 2 Spatial observations collected in a discrete time structure (pooled layers)

For spatial data pooled or collected over time, this representation never considers the constraints that temporal relation is unidirectional: yesterday can exert influence on today’s pattern, but the inverse may not be possible. The spatial multidirectional effect holds for data collected in the same time period, but the unidirectional effect of the temporal dimension implies that the multidirectional spatial effect no longer exists among different time periods. Moreover, since spatial units are only observed once over time, the spatial and temporal relations appear to be clearly different from the panel case, for which the same spatial units are observed in each time period. Thus, a realistic representation of such a data generating process needs to account for a third (unidirectional) dimension: time (T). The third dimension may be represented by the one-way arrow, as opposed to the two-way arrows for the spatial dimensions axis (Figs. 2, 3). There exist two simple ways of dealing with the generating process of spatial data pooled over time. The first way is by assuming that temporal representation is based on discrete time period, such as a month, quarter or year. In this case, the data generating process consists of many different spatial layers accumulated, or pooled, on the third dimension (Fig. 2). The probability of observing many individuals at the same time period is positive and defined, while the probability of observing one spatial unit more than once on a different time period1 is marginal or null. The total sample size of such a data base is defined by the sum of the total observations in each time period t, T Nt . This is different from the panel context since the sample size in Nt : N T = t=1 each time period varies (different from N observations repeated T times).2 1 Two observations having the same geographical coordinates. 2 Except if the panel is unbalanced.

123

Dealing with spatial data pooled over time in statistical models

5

Fig. 3 Spatial observations collected in a continuous time structure

The second way is by assuming that there was no strict definition of time period, or that observations are collected in a continuous way. In this case, a strict representation of spatial layers does not exist since time dimension is not discrete (Fig. 3). The probability of observing many individuals at the same time is marginal or null, as it is the case for observing the same spatial unit in the sample. Again, the total sample T Nt , not by N T , since the sample size size of the data base is defined by N T = t=1 for a given time interval is defined by Nt and vary in all time intervals. The similarity between both generating processes concerns the multidirectional spatial effect that can appear among spatial units within a given time period or interval (Figs. 4, 5). For other observations, the effect is no longer multidirectional since temporal dimension is unidirectional. Thus, the usual strictly spatial representation no longer holds while it clearly diverges from the panel case. In fact, strictly spatial representation may generate spurious relations that can exert influence on the estimated spatial autocorrelation pattern or on spatial autoregressive parameters (Dubé et al. 2011a). The data structure should account for multidirectional relations among individuals observed at the same time period, or interval, while unidirectional relations for past observations in direction of actual observations should be explicitly specified to eliminate the possible spurious relations that link future observations to actual and past observations.3 The unidirectional temporal relations may reveal relevant information about dynamic effect spatially located. By defining different spatial and temporal kernels,4 it is possible to exploit a full set of spatio-temporal autoregressive patterns and use this 3 Of course, it can also account for potential anticipation by considering that the first period after may exert influence on the actual observations, as noted by Dubé et al. (2011a). 4 Of course, the determination of the optimal kernel size remains an important topic that is not explicitly

treated in the paper.

123

6

J. Dubé, D. Legros

Fig. 4 Spatio-temporal relations for a given observation (black point) (discrete time period representation) (color figure online)

Fig. 5 Spatio-temporal relations for a given observation (black point) (continuous time period representation)

information in statistical models. It is possible to account for the unidirectional temporal effect of spatial data located in a given radius of influence to evaluate the possible dynamic (peer) effect, as represented by connecting past observations (dark blue dots) to the actual observations (the solid one-way grey arrows), as well as multidirectional spatial effect for data observed at the same time period, as represented by the two-way black arrows connecting one observation to other observations (Figs. 6, 7). These possibilities can also be extended to consider higher order temporal and spatial effects. These effects can be captured through the construction of different

123

Dealing with spatial data pooled over time in statistical models

7

Fig. 6 Possible spatio-temporal relations for a given observation (black point) (discrete time period representation) (color figure online)

Fig. 7 Possible spatio-temporal relations for a given observation (black point) (continuous time period representation) (color figure online)

spatio-temporal weights matrices. Building a spatio-temporal weights matrix can help to generate new predictors, or explanatory variables, that can be used in statistical models (Dubé and Legros 2012) to capture some kind of dynamic temporal effects (see also Dubé et al. 2011a) spatially localized, approaching the idea of peer effect (Des Rosiers et al. 2011).

123

8

J. Dubé, D. Legros

3 Building spatial, temporal and spatio-temporal weights matrices The relationships expressed in the last section can easily be handled using a spatiotemporal weights matrix. The construction of such a weights matrix, accounting for multidirectional and unidirectional relations, lies on prior definition of spatial weights matrix and temporal weights matrices combining the different effects. The multidirectional spatial effect can be evaluated based on geographical coordinates. The Euclidian distance between observation i and observation j, di j , can be obtained from the Pythagorean theorem (Eq. 1) based on the geographical coordinates (X, Y) of each observation. di j =



(X i − X j )2 + (Yi − Y j )2

(1)

These distances can, thereafter, be used to determine the individual elements of a non-standardized spatial weights matrix S, of dimension N T × N T , si j , using a general inverse distance function (Eq. 2). ⎧ −α ⎨ di j if di j ≤ d si j = 1 if di j = 0 ∀i = j ⎩ 0 otherwise

(2)

where d is an upper bound critical distance cut-off value that could be fixed by modellers considering prior information (Getis 2009) or determined statistically by maximizing the value of estimated spatial autocorrelation through a measure such as Moran’s I index (Boots and Dufournaud 1994) and α is a penalty parameter imposed on distance. The contiguity matrix is a special case of such specification when α = 0 and the cut-off distance, d, is adapted to each observation and is relatively small so that only immediate neighbours have a defined relation, di . The unidirectional temporal effect can be evaluated using the difference in the time (date) observations are collected. To simplify the construction of the temporal weights matrix, it is assumed that data are chronologically ordered, so the first line in the database corresponds to the latest observation while the last line represents the newest observation (Pace et al. 1998; Smith and Wu 2009). A general function vi , based on the actual year and month of observation i(yyyyi − mm i ), can be developed to ensure that the first observation has the lowest values and that such values increase with time (Eq. 3). vi = 12 × (yyyyi − yyyymin ) + mm i ∀i, j

(3)

where, yyyymin represents the first year the observation was collected.5 5 Of course, this definition can be extended to include more details such as days dd. The general form of the function can be approximated by vi = 365 × (yyyyi − yyyymin ) + 31 × (mm i − 1) + ddi , assuming that a difference of a few days does not affect the structure of the matrix since the months do not have the same number of days.

123

Dealing with spatial data pooled over time in statistical models

9

A (non-symmetric and non-standardized) temporal distance between each observation is obtained from the difference between the general function of the observation i and the observation j, vi − v j . The difference expresses the time elapsed between the observation of the unit i and the observation of the unit j. The function, which can take a negative value if i is observed after j, is used to calculate the general elements of the temporal weights matrices T of dimension N T × N T , ti j (Eq. 4). ⎧ ⎨ |vi − v j |−γ if |vi − v j | < v if vi = v j ∀i = j ti j = 1 ⎩ 0 otherwise

(4)

where v is an upper bound critical cut-off value for temporal influence that needs to be fixed a priori by modellers6 and γ is an exogenously defined friction parameter.7 Since observations have been previously chronologically ordered, the temporal weights matrix can be decomposed through triangular components.8 The lower triangular specification indicates how actual observations can be connected to past and actual observations, while the upper triangular specification indicates how actual observations can be related to actual and future observations (anticipation). Finally, to account for the fact that a multidirectional spatial relation is only valid over a given time period (or interval) while unidirectional temporal relation is concentrated in a given spatial window, the spatial and temporal weights matrices are combined to obtain a unique spatio-temporal weights matrix (Eq. 5). Such structure of the spatio-temporal weights matrix is obtained using the Hadamard9 matrix product and represents an important innovation as compared to the proposition of Pace et al. (1998, 2000) and an extension of actual development (Nappi-Choulet and Maury 2009, 2011; Smith and Wu 2009; Huang et al. 2010; Dubé et al. 2011a). A general element of the spatio-temporal weights matrix, wi j , is defined by combining the spatial relation, si j , and the temporal relation, ti j , between observations i and j.10 ⎛ ⎜ ⎜ ⎜ W = S  T =⎜ ⎜ ⎝

0 s21 × t21 s31 × t31 .. .

s12 × t12 0 s32 × t32 .. .

s13 × t13 s23 × t23 0 .. .

· · · s1NT · · · s2NT · · · s3NT .. .

s NT 1 × t NT 1 s NT 2 × t NT 2 s NT 3 × t NT 3 · · ·

× t1NT × t2NT × t3NT .. .

⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠

(5)

0

The spatio-temporal weights matrix, W , can then be row-standardized and used in usual spatial econometric tests and models related to actual routines developed (Lesage 6 The definition of the temporal weights matrix generalizes the specification of Pace et al. (1998, 2000)

and Smith and Wu (2009). 7 As is the case for the definition of the spatial relation, s , in Eq. (2), γ usually takes a value 0, 1 or 2. ij 8 The definition uses absolute values to ensure that the matrix T has non-negative values since the difference

in date of observation gives non-symmetric measures. Such specification ensures that the temporal weights matrix is also symmetric. 9 Defining a term by term multiplication of matrices and noted . 10 w = s × t . ij ij ij

123

10

J. Dubé, D. Legros

1999). The final form of the weights matrix respects the direction of spatial (multidirectional) relations and temporal (unidirectional) relations and takes into account spatio-temporal constraints given specifications retained by modellers. 4 An empirical application To illustrate the possibilities and the flexibilities related to the use of spatio-temporal weights matrices in empirical work, an example based on a hedonic price model (Rosen 1974) using single family home transactions in Lucas County, OH, 1993–1998 is presented. The database is available in the MatLab library (Lesage and Pace 2009) and has been used by Lesage and Pace (2004). It consists of 25,357 transactions and contains information on the nominal sale price in US dollars, the vector of dependent variable, pit , of dimension (N T × 1), on the lot size (in square feet—Z 1it ), on the living area (in square feet—Z 2it ), and on many other physical characteristics such as: number of rooms (Z 3it ), number of bedrooms (Z 4it ), number of bathrooms (Z 5it ), type of house (Z 6it ) and the presence of a garage (Z 7it ). The database also contains information on the year the house was built. This makes it possible to calculate the age (in years) of the house (Z 8it ). To account for some non-linearity in the effect of age on house price, a gamma transformation is considered ( Des Rosiers et al. 1996, 2001). This function has proven to be more flexible than a polynomial specification and can easily be obtained by adding the logarithm of the age (Z 9it ) in the price equation.11 Each vector of independent variables is of dimension (N T × 1). The hedonic price model (Eq. 6) also includes a series of (T − 1) vector of year dummy variables (Dit ), of dimension (N T × (T − 1)), to control for differences in the composition of the sample in each year as well as for temporal heterogeneity (Wooldridge 2000, 2002). pit =

T

Dit δt +

t=2

9

Z kit βk + it

(6)

k=1

The vector of parameters δt , of dimension ((T − 1) × 1) controls for the nominal price evolution, while the vector βk , of dimension (K × 1) measures the implicit (hedonic) price of each characteristic and the vector it , of dimension (N T × 1) is the error term usually assumed independent and identically distributed. The hedonic price model, estimated using the ordinary least squares (OLS) method, accounts for about 73 % of the total variance of the sales prices (column 1—Table 1). All coefficients have the expected sign, suggesting that better amenities lead to higher sales prices. A spatio-temporal weights matrix is constructed using a spatial weights matrix based on optimal distance set to 500 m (Fig. 8), noted S0 and a temporal weights matrix considering transactions occurring in the same month, designated by T0 .12 The spatio-temporal weights matrix is used to detect spatial autocorrelation patterns among 11 The optimal impact of the age on house prices can be obtained by: β log(age) × (−1/βage ). 12 The general form of the matrix is W = S  T . The matrix W is used to detect spatial autocorrelation 0 0 0 0

pattern among residuals using the Moran’s I index.

123

Dealing with spatial data pooled over time in statistical models

11

Table 1 Estimation results using ordinary least squares—log (price) Variables

Model HPM (Eq. 6)

Model CHPM (Eq. 10)

Model CHPMP (Eq. 10)

Coefficients

t-statistic

Coefficients

t-statistic

Coefficients

t-statistic

Constant

3.5123

47.81

3.0330

40.63

3.1131

41.28

Log (age)

0.4825

62.43

0.4632

60.51

0.4603

60.12

Age

−0.0231

−110.40

−0.0220

−104.30

−0.0219

−103.91

Log (living area)

0.6406

63.33

0.6206

62.06

0.6191

61.96

Log (lot size)

0.1870

48.667

0.2107

54.16

0.2138

54.68

Bathrooms

0.1043

14.49

0.0100

14.09

0.1006

14.19

Garage

0.3518

46.08

0.3367

44.61

0.3356

44.51

Stories 2 and more

0.1953

31.83

0.1932

31.94

0.1930

31.94

Sold in 1993

Ref.

Ref.

Ref.

Ref.

Ref.

Ref.

Sold in 1994

0.0469

4.94

0.0023

0.24

0.0049

0.51

Sold in 1995

0.0901

9.71

0.0428

4.55

0.0458

4.87

Sold in 1996

0.1255

13.95

0.0753

8.23

0.0790

8.63

Sold in 1997

0.1620

18.16

0.1101

12.10

0.1132

12.45

Sold in 1998

0.2078

22.68

0.1563

16.77

0.1603

17.20

Time lag (1) 1 km





0.0180

10.27

0.0225

12.09

Time lag (2) 1 km





0.0170

10.09

0.0190

11.15

Time lag (3) 1 km





0.0103

6.81

0.0116

7.61

Time lag (1) 2 km









−0.0164

−7.19

R2

0.7311



0.7387



0.7392



R

2

0.7310



0.7385



0.7390



Moran I with W

0.2234

77.50

0.2180

75.70

0.2151

74.72

Moran I with S

0.3240

337.43

0.3083

321.56

0.3051

318.43

No. obs.

25,357



25,357



25,357



Significance levels:  significant at 1 %,  significant at 5 %,  significant at 10 %

residuals. Since the statistic is highly significant, an adjustment to the model is required because the error terms are not spatially independent. One interesting question is whether or not the possibility of introducing dynamic temporal effects spatially located can play a significant influence on price determination and eliminating spatial autocorrelation pattern among residuals. Part of this question can be answered by introducing dynamic temporal effects within a 1,000 m radius of a house. The dynamic temporal effects lie on units observed r time periods before, pit−r . These variables are built using the vector of dependent variables, pit , with the definition of the direct spatial effect considering houses located within a 1,000 m radius (S1 ) and different definition of the temporal weights matrices. Temporal weights matrices permit the capture of such effect for transactions recorded before the first quarter (first time lag—T1 ), the second quarter (second time lag—T2 ), and third quarter (third time lag—T3 ) (Eqs. 7, 8, 9). This permits the adjustment of the hedonic price model for dynamic effect spatially located generalized by the equation

123

J. Dubé, D. Legros

Moran’s index

12

Distance (in meters) Fig. 8 Determination of the optimal distance considering Moran’s I index value with W = S  T

suggested by Des Rosiers et al. (2011) (Eq. 10) and can be viewed as a natural application of the spatial autoregressive model (SAR) controlling for temporal lag dynamic effect instead of multidirectional simultaneous spatial effect.

pit =

3

r =1

p0t−1 = pit × (S1  T1 )

(7)

p0t−2 = pit × (S1  T2 ) p0t−3 = pit × (S1  T3 )

(8) (9)

p0t−r ρr +

T

t=2

Dit δt +

9

X kit βk + it

(10)

k=1

Equation (10) is, once again, estimated using the OLS method since there is no time simultaneity between the observations: the new vectors of independent variables are based only on previous observations. The additional variables are designed respectively, in the table, by: time lag (1) 1,000 m; time lag (2) 1,000 m and time lag (3) 1,000 m (column 2—Table 1). The coefficients related to the new variables are all significant, suggesting that dynamic temporal effect plays a significant role in price determination. The combined effect explains about 4.5 % of the price determination (1.8 % + 1.7 % + 1 %). The results suggest a slight raise in the total variance of sales prices explained by the model. However, spatial autocorrelation is still detected among residuals, as the Moran’s I index and its significance indicate. Thus, even if the new variables help to explain part of the price determination process, they fail to control for all the latent spatial patterns hidden in residuals. The definition of the dynamic variables can be extended to other spatial delimitations. Such new variables may also help to diminish spatial autocorrelation. Additional spatial lag effects consider transactions recorded in a radius within 1,000 and 2,000 m (S2 ). This permits the construction of a new independent variable (Eq. 11). Of course, this specification may be generalized to other spatial considerations. The hedonic price

123

Dealing with spatial data pooled over time in statistical models

13

model adjusted for dynamic effect spatially located and including a higher spatial order is just a simple extension of the last equation (Eq. 12). p1t−1 = pit × (S2  T1 ) pit =

1 3

r =1 j=0

p jt−r ρ j,r +

T

t=2

Dit δt +

9

(11) X kit βk + it

(12)

k=1

Again, the model is estimated with the OLS method. The additional spatially time lagged variables for higher spatial order are designed respectively, in the table, by: time lag (1) 2 km (column 3—Table 1). The results suggest that higher autoregressive order only slightly affected the impact of the immediate spatial time lag variable, which indicates an effect of −1.6 %, diminishing the effect of the immediate time lag. However, the combined effect for the higher spatially temporal lag explains about 3.7 % of the price determination (5.3–1.6 %). Since spatial autocorrelation is detected in all specifications, the models need to be adjusted to ensure an adequate interpretation of the coefficients (LeGallo 2002). In order to correct for spatial autocorrelation among residuals, a spatial error model (SEM) specification13 is estimated using a spatio-temporal weights matrix allowing for multidirectional spatial effect only in a given time period.14 The importance of the autoregressive parameter associated with the SEM specification, λ, and their significance suggests that this improvement controls for latent structure otherwise present among residuals (Table 2). The introduction of the spatial error component in the specification slightly diminishes the effect of the temporal dynamic variables, but doesn’t affect the previous general conclusion: the variables play a significant role in price determination (columns 2 and 3—Table 2). However, the estimated cumulative impact is now

Suggest Documents