ENVIRONMETRICS Environmetrics 2007; 18: 245–264 Published online 16 March 2007 in Wiley InterScience (www.interscience.wiley.com) DOI: 10.1002/env.837
Air quality monitoring using heterogeneous networks Alessandro Fass`o∗,† , Michela Cameletti and Orietta Nicolis University of Bergamo, Via Marconi 4, 24044 Dalmine BG, Italy
SUMMARY In this paper, we consider some approaches to spatio-temporal modeling of environmental data obtained from an heterogeneous network. Besides discussing modeling details for spatio-temporal dynamics and calibration of different instruments, we consider crossvalidation issues and extensions to the monitoring network assessment based on sensitivity analysis. We then consider a case study based on the heterogeneous network related to fine particulate matter (PM10 ) coming from the Po Valley, North Italy. Copyright © 2007 John Wiley & Sons, Ltd. key words: particulate matters; spatio-temporal models; mapping; sensitivity analysis; crossvalidation
1. INTRODUCTION Air quality is recognized to be an issue of primary importance for human health after a number of epidemiological studies assessed the health effects of air pollution, see for example, Ostro et al. (1993) and Biggeri et al. (2004). Because of this, various public authorities regulated air quality monitoring and settled standards, see for example, the European Union Council Directive 96/62/CEE (1996). Hence, local public institutions have been lead to invest in networks to assess the concentration levels of various pollutants around the country. In this work, we consider data about particulate matter characterized by particles with a diameter of less than 10 microns (PM10 ) which are collected around the Po Valley, Northern Italy. Recently, whenever the attention of many researchers has been focussed on smaller particles, see for example, Hauser et al. (2001), we have just considered PM10 because, neither European regulation nor Northern Italian local agencies have yet developed procedures and monitoring networks for the smaller particles. In this area, the management of the air monitoring network is within the competence of the local administrations, which have often taken different decisions about its policy due to technical, administrative, and historical reasons. This caused a heterogeneity which mainly regards the type of measurement instrument used and the net maintenance. Moreover, network heterogeneity may arise because of different length of the time series, missing values or different sampling frequencies. In fact, ∗ Correspondence to: A. Fass` o, University † E-mail:
[email protected]
of Bergamo, Via Marconi 4, 24044 Dalmine BG, Italy.
Copyright © 2007 John Wiley & Sons, Ltd.
Received 4 August 2006 Accepted 28 December 2006
246
` M. CAMELETTI AND O. NICOLIS A. FASSO,
some monitoring instruments collect data slower than others, for example, daily frequency versus hourly frequency, or some stations change their sampling strategy, for example, from every 2 h to hourly. The method of reference for sampling and measuring PM10 concentrations is given by the European norm (EN 12341) and is based on the collection of the PM10 on a filter and the determination of its mass following the gravimetric principle: the high- and low-volume gravimetric samplers (HVG and LVG) comply with the above mentioned regulation. Moreover, the norm suggests standardizing the measurements coming from other samplers that do not respect the gravimetric method. In this heterogeneity context, it is of great importance to unify such different data for comparative analysis, retrospective and trend analysis, quality standard attainments and online correction. In this work, we propose a geostatistical dynamical calibration model, named GDC, intended to manage a heterogeneous network for air quality monitoring. The model, for its flexible formulation, is intended to work with measurements affected by instrumental biases, with different sampling frequencies and missing data. It is a spatio-temporal model based on a state-space formulation, where the so-called ‘true phenomenon’ to be monitored is given by a time-varying linear function of an unobserved Markovian process. For solving the consequent problem connected with the high dimension of the state equation, we follow the Empirical Orthogonal Function (EOF) approach. The model aims at the correction of biased readings using displaced data obtained by gravimetric instruments and at mapping the calibrated values in the region of interest. In particular, in our study, we use measurements coming from low volume gravimetric monitors (LVG) to perform a dynamical calibration of data coming from automatic monitors based on a tapered element oscillating microbalance (TEOM). These instruments are known to underestimate the ‘true’ PM10 level; in spite of this, the TEOM monitoring system is widely used in Italy because of the low cost and automatic operations and high frequency sampling. We then combine the state-space method with the calibration techniques and apply them to PM10 data in the spatio-temporal dimension. The calibration problem has been considered in various fields such as metrology, chemistry, engineering, and biometrics and different approaches have been proposed, see Osborne (1991) for a review. Recently, some authors have used state-space models for the calibration of environmental measurements: Brown et al. (2001) analyze the calibration problem for the rainfall radar data; McBride and Clyde (2003) consider the PM2.5 calibration from a Bayesian point of view and Fass`o and Nicolis (2005, 2006) introduce a displaced dynamical calibration model for PM10 data. On the other hand, the geostatistical state-space approach has had a considerable role in the spatiotemporal prediction modeling. In particular, the state-space formulation may be very useful for analyzing and forecasting environmental time series, also when they are generated by a non-stationary process. However, the presence of both temporal and spatial components gives rise to a multivariate Kalman filter, which can be very high dimensional, depending on the extent and resolution of the spatial component. To deal with this problem, Goodall and Mardia (1994) introduce the idea of a reduced dimension space time Kalman filter in which the state process is written in terms of a set of basic functions. Mardia et al. (1998) provide the details and full implementation of the general reduced dimension model, known as Kriged Kalman Filter (KKF). Recently, Wikle et al. (1998) and Wikle (2003) implement an empirical Bayesian state-space dynamical model. Wikle and Cressie (1999) assume that the process is composed of a linear combination of EOF’s and a non-dynamic term that capture small-scale spatial variability. Finally, Xu and Wikle (2006) propose a Bayesian parametrization for the matrices of the dynamic spatio-temporal model. Our approach to the dimension reduction problem is similar to Wikle and Cressie (1999) and Xu and Wikle (2006) but, in our work, the EOF method is about the special case of calibration and the model parameters are estimated by the maximum likelihood procedure. The paper is organized as follows. In Section 2, we introduce the general form of the geostatistical dynamical calibration model which includes systematic and stochastic components. Moreover, in
Copyright © 2007 John Wiley & Sons, Ltd.
Environmetrics 2007; 18: 245–264 DOI: 10.1002/env
247
HETEROGENEOUS NETWORKS MONITORING
Section 3, it is tailored to the case study at hand. In Section 4, we discuss the model construction: the spatially descriptive data analysis with empirical orthogonal functions, the estimation of the parameters via the maximum likelihood, and, finally, the estimation and mapping of the ‘calibrated values’. In Section 5, we introduce the data coming from the area of the Po Valley, Northern Italy. These data come from an heterogenous network since they are collected by the three different local protection agencies, called ARPA, from the administrative regions Piemonte, Lombardia, and Emilia Romagna. Although they are going to join the European regulations, they do not share a common data base nor a common monitoring strategy. After discussing the model building and validation strategy in Subsection 5.6, we assess the information and redundancy of the network by using crossvalidation and network sensitivity analysis. This is done by extending some sensitivity analysis techniques, which have been developed for independent observations and time series models, to the spatio-temporal case. In Subsection 5.7, we discuss the role of the systematic and stochastic components. Finally, Section 6 gives concluding remarks.
2. THE GEOSTATISTICAL DYNAMICAL CALIBRATION MODEL We consider data coming from a monitoring network G composed of n stations located at points G = (z1 , . . . , zn ) belonging to a certain region or domain D, so that G ⊂ D. We will use the coordinates zs ∈ G interchangeably or their index s = 1, . . . , n to identify the elements of the network. Let y(t, s) = y(t, zs ) be the measured concentration of a certain pollutant at time t = 1, . . . , N in location zs ∈ G and s = 1, . . . , n. Moreover, let y(t) = y(t, s1 ), . . . , y(t, spt ) denote the pt -dimensional array of observed pollution levels at locations sj = sj (t), j = 1, . . . , pt . Note that pt ≤ n, the equality holds if there are no missing values at time t. The network is composed of instruments of two or more different types, some are considered unbiased whilst others have biases to be estimated. The model for the observed pollution levels is expressed by the following pt -dimensional measurement equation y(t) = A(t) + B(t)y∗ (t) + C(t)x(t) + ε(t)
(1)
for t = 1, . . . , N. Here, y∗ (t) is n-dimensional with elements y∗ (t, zj ) which are realizations of the discrete-time continuous-space process y∗ (t, z), which can be considered as the underlying pollution level at time t and location z ∈ D. Matrices A(t) and B(t) are the additive and multiplicative bias, respectively; x(t) is a vector of meteorological and/or anthropic site specific covariates with coefficient matrix C(t). We assume that y∗ (t), is a linear function of a common regional k-dimensional process, k ≤ n, denoted by µ(t), namely y∗ (t, s) = (s)µ(t) + µ0 (t, s)
(2)
where the loadings matrix = ((s1 ) , . . . , (sn ) ) , is obtained by a k-dimensional set of EOF as discussed in Subsection 4.1. The process µ0 , which is known as the small-scale spatial variation component, can be used to describe the spatial correlation not explained by the EOF truncation, see for example, Wikle and Cressie (1999). It is supposed independent on ε(t) and white noise over time; it can be spatially correlated with the spatial covariance matrix given by µ0 which is usually related to a stationary and isotropic spatial process.
Copyright © 2007 John Wiley & Sons, Ltd.
Environmetrics 2007; 18: 245–264 DOI: 10.1002/env
248
` M. CAMELETTI AND O. NICOLIS A. FASSO,
The underlying regional process µ(t) has Markovian dynamics, which can be written as µ(t) = Hµ(t − 1) + η(t)
(3)
where H is a diagonal matrix H = diag(h1 , . . . , hk ), with hj > 0. The process η is a k-dimensional zero mean Gaussian white noise with diagonal covariance matrix η = diag(ση21 , . . . , ση2k ). Note that the diagonal structure of both H and η is a natural consequence of the orthogonality properties of the EOF decomposition. The matrices A(t) and B(t) are the calibration components accounting for heterogeneity among instruments of different types. The former is the additive bias, which can be constant or time varying and is described in the next section, where the case of a two-instrument network is considered in more detail. The latter is the multiplicative bias matrix and, in the special case of next section, is given by constant elements, say B∗ . In some cases B(t) = B∗ F (t) where F (t) is a matrix of time averaging weights accounting for changes in sampling frequency. Matrix C(t) is usually composed of a possibly time-dependent set of regression coefficients. Missing data affect the time-varying dimension of measurement Equation (1) and can be managed by a (pt × n)-dimensional matrix M(t) which pre-multiplies the matrices A, B, and G. The error component ε(t) = (ε(t, s1 ), . . . , ε(t, spt )) is a Gaussian spatially and time-independent process, with covariance matrix ε(t) which, in the case of pure measurement error, can be assumed diagonal, ε(t) = diag(σε21 , . . . , σε2pt ) where the different variances σε2j account for the varying precision of the different instruments.
3. THE SIMPLIFIED GDC MODEL In this section, we discuss a special case of the GDC model which is especially useful in European PM10 networks. On the one hand, as a particular case of the previous section, we have only two kinds of instruments, on the other hand, we may have more than one instrument in the same station. This is the case of the above introduced instruments, namely LVG and TEOM which, in some stations, are both installed for calibration and research purposes. Considering daily data, let yG (t, s) denote the concentrations of PM10 measured by the LVG monitor at time t = 1, . . . , N and stations s = s1 , . . . , sp . Moreover, let yT (t, s ) denote the PM10 measurements of the TEOM monitors at stations s = s1 , . . . , sq which may be different from the previous LVG stations but, as mentioned, overlaps are possible. The measurements yG and yT are related by the following equations yG (t, s) = y∗ (t, s) + xG (t, s) + εG (t, s)
(4)
yT (t, s ) = α(t) + βy∗ (t, s ) + xT (t, s ) + εT (t, s ). Note that, in this setup, the number of all instruments is n = p + q. Moreover, we assume that the error components εG (t, s) and εT (t, s ) are independent Gaussian white noises with mean zero and standard deviations σεG and σεT , respectively. The small scale variation µ0 (t, s) can be used for refining the spatial prediction with Kriging techniques as in Wikle and Cressie (1999) or can be encompassed in the error component, as in Xu and
Copyright © 2007 John Wiley & Sons, Ltd.
Environmetrics 2007; 18: 245–264 DOI: 10.1002/env
HETEROGENEOUS NETWORKS MONITORING
249
Wikle (2006). In the latter case y∗ (t, s) = (s)µ(t)
(5)
and the error variances are correspondingly inflated by a term which depends on the variance of the residual component of the EOF truncation, namely ∗ 2 Var(yG − yG ) = Var(µ0 ) + σG
Var(yT − yT∗ ) = β2 Var(µ0 ) + σT2 . Note that the weight (s) in Equation (5) depends only on s. Hence, in the case of two or more instruments in the same station, the corresponding rows of the matrix have equal elements. The bias components α(t) and β may be relevant for TEOM calibration. For example, we may assume that α(t) is generated by a scalar Markovian process α(t) = hα α(t − 1) + ς(t)
(6)
where 0 < hα < 1 and ς(t) is a Gaussian white noise with variance σζ2 . This time a varying additive bias may be useful when the instrument errors depend on temperature and other meteorological quantities which are not fully observed but have time inertia. As a special case, we may have a simple constant additive bias, namely α(t) = α. In general, if proxies x(t) of the above meteorological quantities are available, then they can be used to account for at least a part of such a variation. In the sequel, we will call y∗ (t, s) and α + βy∗ (t, s ) the pure GDC components of model Equation (4), to emphasize the case where observations y have been adjusted for a deterministic component x(t, s).
4. METHODOLOGY In this section, we discuss the empirical estimation of the GDC model, starting with the computation of the loading matrix. It then proceeds with the estimation of the parameter set, say , which includes error and innovation variances, dynamical parameters h’s, and calibration parameters α and β. It then finalizes with mapping the obtained calibrated values.
4.1. Empirical orthogonal functions As a first step, we compute the loading matrix by EOF analysis which is a multivariate statistical technique useful for reducing the dimension of the data generated by a spatio-temporal process (Mardia et al., 1998, Wikle and Cressie, 1999). This method is widely used in the oceanographic and meteorological sciences where, the great amount of data provided, for example, by satellite images, needs to be reduced in a smaller number of components able to retain the major quota of the variability of the data. Using the notation of previous sections, the EOF method is based on the decomposition of the spatiotemporal process y(t, s), s = 1, . . . , p + q, using a set of deterministic spatial functions {j (s), j = 1, . . . , (p + q)} which meet the properties of completeness and orthonormality, and a projection process Copyright © 2007 John Wiley & Sons, Ltd.
Environmetrics 2007; 18: 245–264 DOI: 10.1002/env
250
` M. CAMELETTI AND O. NICOLIS A. FASSO,
(µ1 (t), . . . , µp+q (t)), so that y(t, s) =
p+q
j (s)µj (t)
j=1
where j (s) is the j-th generic eigen-function, or the j-th EOF, obtained from a linear decomposition of the covariance function cov[y(t, s), y(t, s )]. The dimension reduction is obtained using only the first k < p + q functions weighted by the corresponding space invariant time series of coefficients µ(t) = (µ1 (t), . . . , µk (t)) . The number k of EOFs to be chosen is strictly connected to the valuation of the cumulative quota of variance explained. As described in Wikle (2003), EOF technique corresponds to the Principal Component Analysis (PCA) in a discrete framework, while in the continuous one it is based on the Karhunen-Lo´eve (K-L) expansion. As we have discrete data defined on the network G and we use the empirical covariance matrix of the observed data, we obtain EOFs by performing an eigenvector decomposition. In particular, j is the j-th eigen-vector (j = 1, . . . , k) given by the vector j = [j (s1 ), . . . , j (sn )], that is the j-th column of the loading matrix . In our model, the EOF method is applied to reconstruct the process y∗ (t, s) = (s)µ(t), where (s) are obtained as just described above. Since the temporal component is included in the state equation (as seen in Equation 3), the coefficients µj (t) are not given by EOF but as the output of the estimation procedure explained in the next section. 4.2. Parameter estimation Conditionally on the loading matrix of the previous subsection, we estimate the GDC model parameter set using the maximum likelihood method. Considering the ‘two instruments’ model of Section 3 the vector includes propagation coefficients h1 , . . . , hk from the Markovian Equation (3), which, under the homogenous propagation hypothesis reduce to a constant h say. Next, it includes the calibration constants from measurement Equation (4), namely α, if appropriate, and β. Moreover, we have innovation variances, namely 2 and σ 2 . The dimension of is, η = diag(ση21 , . . . , ση2k ) and σζ2 , and, finally, error variances σG T of course, a concern for both interpretation and Gaussian likelihood maximization. Hence, further parameter simplification is welcomed and will be further considered in Section 5. 4.3. Mapping As our aim is to have a continuous map of the daily air pollution pattern in the area where the network G is located, we first need to define an arbitrarily fine grid of regularly displaced points covering the region. On this grid, we then estimate the values of the loading matrix using the locally weighted polynomial regression known as loess discussed in Cleveland and Devlin (1988). According to this, at each grid point, we fit a low-degree polynomial surface of the geographical coordinates using weighted robust least squares, even if the orthogonality of the interpolated loadings is not ensured. Once the loading matrix has been estimated over the considered area, the general calibrated values can be mapped following Equation (5), where the temporal component, µ(t), is given by the Kalman filter smoother. Copyright © 2007 John Wiley & Sons, Ltd.
Environmetrics 2007; 18: 245–264 DOI: 10.1002/env
251
HETEROGENEOUS NETWORKS MONITORING
Note that Wikle and Cressie (1999), in a similar context, applied spatial interpolation to the observed data before doing PCA. Moreover, Sahu and Mardia (2005) as well as Mardia et al. (1998) used interpolation based on spatial correlations. Although the latter point may be superior to our approach in some cases, we prefer non-parametric interpolation since, as will be clear from Section 5, our data do not show a suitable stationary spatial structure and a model based on the Kriging technique would not give useful results. In order to assess the mapping uncertainty, we consider the uncertainty of both the spatial component (s) and time component µ(t) in Equation (5). For the former, we use the estimated uncertainty, say 2 σ , given by the loess procedure and for the latter we use the smoother conditional variance, say j (s)
σµ2 j , given by the Kalman filter. Remembering that the components of µ(t) are approximately independent by construction and assuming that, conditionally on the PCA decomposition, the spatial estimates of (s) from loess are independent from the smoothed values µ(t), we assess uncertainty by the standard error based on the formula σ 2 (y∗ (t, s)) ∼ =
k j=1
2 σ σ2 + j (s) µj
k j=1
2 σ µ2 + j (s) j
k j=1
j (s)2 σµ2 j
(7)
5. APPLICATION TO THE Po VALLEY In this section, we consider the PM10 daily concentrations coming from a network of n = 54 sites in the Po Valley, Italy. The network area in Figure 1 is approximately 400 × 200 km wide with, say, a
Figure 1.
Copyright © 2007 John Wiley & Sons, Ltd.
The Po Valley PM10 monitoring network
Environmetrics 2007; 18: 245–264 DOI: 10.1002/env
` M. CAMELETTI AND O. NICOLIS A. FASSO,
252 Table 1.
TEOM LVG All instruments
Instrument classification among the Po Valley regions
Piemonte
Lombardia
Emilia
Po Valley
2 17 19
14 2 16
4 16 20
20 35 55
frame of mountains in the south, west, and north and a see coast in the east. The central plain is densely inhabited with heavy vehicle traffic, heating, and industrial emissions. Moreover, intensive agricultural activity is an important source of land and water contamination. In the year 2003, which is the year being analyzed in this paper with N = 365, we have a marked instrumental heterogeneity. As a matter of fact, as shown in Table 1, some regions have many TEOM’s and few LVG’s while the opposite holds for others. Note that we have one station with both LVG and TEOM instruments. This is the Consolata station in Turin, which was conceived for instrumental testing purposes and will be especially useful also for statistical validation of our model. Hence we use p + q = 55 instruments. Unfortunately, at the time of carrying out this study, no homogenous covariates were available for the entire area. Nevertheless, some seasonal adjustment will be made in Subsection 6.7. Moreover, Table 2 shows the amount of missing data for our network. In the rest of the paper, we discuss an empirical model for log-transformed data which is intended to explain spatio-temporal variation, calibration, and smoothing of the data. Along this process in Subsection 5.7, we discuss parallel modeling of deseasonalized and mean-removed log-transformed data, which may be interpreted as explaining the local variation after spatio-temporal trend removal. Differences and peculiarities will be commented in the sequel. 5.1. Data transformation In order to reduce data long tails, we tried square root and logarithmic transformations, which gave very close results from the point of view of the resulting distribution shape. Since Graf-Jaccottet and Jaunin (1998) and Fass`o and Negri (2002) noted that high frequency and daily air quality data often give rise heteroskedasticity, we opted for log-transform as it transforms multiplicative seasonality and heteroskedasticity into additive components. 5.2. Spatial component Although particulate matters are known to show strong spatial persistency, the directional variograms of the raw data reported in Figure 2 show that this structure does not have a simple stationary Table 2.
LVG TEOM All instruments
Missing data in the Po Valley network
Piemonte
Lombardia
Emilia
Po Valley
5.4% 6.4% 5.5%
12.9% 2.0% 3.3%
5.2% 6.0% 5.3%
5.7% 3.2% 4.8%
Copyright © 2007 John Wiley & Sons, Ltd.
Environmetrics 2007; 18: 245–264 DOI: 10.1002/env
253
HETEROGENEOUS NETWORKS MONITORING
Figure 2.
Directional variograms of raw data (distance in decimal degrees)
interpretation either in the isotropic or in the anisotropic case. Note that we get a similar variogram description also if we consider data coming from log-transform, deseasonalization, or the residuals of model Equation (4). Similar results are obtained also for the spatial covariance of the annual averages. Although in principle, this problem could be attached to non-stationary and nonseparable spatio-temporal structures of increasing complexity as recently discussed by Porcu and Mateu (2005), we choose the non-stationary approach based on EOF of Subsection 4.1. To do this, we consider the covariance matrix with listwise deleted missing data. Figure 3 shows the cumulative percentage of variance explained by the first principal components. In usual PCA applications, the choice of the dimension k is simply based on this graph. Since our PCA is model oriented, we choose k = 14 using the mean absolute error (MAE) criterion in the crossvalidation approach of Subsection 5.6. 5.3. MLE component The parameter estimates are given in Table 3; in this section we give some comments about the model. As a result of the discussion in Subsection 5.2, we omit the small scale component µ0 (t) because the Copyright © 2007 John Wiley & Sons, Ltd.
Environmetrics 2007; 18: 245–264 DOI: 10.1002/env
` M. CAMELETTI AND O. NICOLIS A. FASSO,
254
Table 3.
Estimate SE
α
β
h
2.320 0.029
0.337 0.007
0.996 0.002
MLE parameters of GDC model log ση21 log ση22 log ση23
log ση24
log σε2
−0.378 0.148
−4.354 0.117
−1.282 0.011
2.162 0.082
−0.522 0.161
residuals of Equation (1) do not show an interesting spatial correlation. Nevertheless, since the empirical ˆ ε is not diagonal, we use a two-step estimation procedure. At the first step, residuals covariance matrix we perform MLE of Subsection 4.2 conditional on the spherical hypothesis ε = σε2 I, then at the second ˆ where R ˆ is the residual empirical correlation step, we re-estimate the model conditioning on ε = σε2 R, ˆ of the previous step. In doing this, on the ground of a preliminary unreported model, we use matrix R 2 = σ 2 . Note that, whenever these two variances may have slightly the homoskedastic assumption σG T different interpretations, this result is coherent with the known good precision properties of the TEOM instruments. The stochastic additive component α(t) has been considered only for a preliminary model applied to log-transformed data. With these data, it results useless as it does not improve either unbiasedness of reconstructed observations or quadratic or absolute errors. Hence, we are led to a model with α(t) = const = α. The PCA decomposition of Figure 3 showed a marked reduction of the individual variance quota after the third component. In the GDC model, the corresponding innovations η1 , . . . , ηk have variances
Figure 3.
Cumulative percentage variance of principal components
Copyright © 2007 John Wiley & Sons, Ltd.
Environmetrics 2007; 18: 245–264 DOI: 10.1002/env
HETEROGENEOUS NETWORKS MONITORING
255
which decrease even faster, hence we make the simplifying assumption ση24 = · · · = ση2k which is based on empirical evidence. Moreover, note that ση22 is quite similar to ση23 and a further dimension reduction would be possible. Note that, the calibration coefficient β is positive and less than one as expected. Moreover, the Markovian propagation coefficient h is very close to one related to the presence of a temporal component as discussed in Subsections 5.4 and 5.7. Residuals for both measurement and dynamical equations give only partial agreement with the Gaussian distribution.
5.4. Latent components The first latent component µ1 (t) has a quite large variance ση21 = 8.69 and empirical average 26.76. Together with the first four components, it determines the common average level and seasonality as shown by Figure 4. As expected, the remaining components are closer to zero and describe the residual spatial variations.
Figure 4.
Copyright © 2007 John Wiley & Sons, Ltd.
First six components of µ(t)
Environmetrics 2007; 18: 245–264 DOI: 10.1002/env
256
` M. CAMELETTI AND O. NICOLIS A. FASSO,
Figure 5.
Log values regional map of the phenomenon (21st day)
5.5. Mapping Using the method of Subsection 4.3, we map the smoothed pattern y∗ (t, s). For example, Figure 5 shows the log values for 21st January 2003, in gray-scales with superimposed contour lines and the station network (circles). Excluding the edge effects which can be related to the technique used, the map gives an idea of the zone where the PM10 concentrations are higher. For example, the high pollution, dark gray zone on the north-east side of the map covers the heavily inhabited and industrialized area of the Lombardia region which is crossed by a motorway carrying heavy traffic. Moreover, Figure 6 gives the corresponding uncertainty map with the standard errors σ(y∗ ) obtained by Equation (7). Note that this quantity is given by three additive terms. It is interesting to note that the level of variability is not very high, especially if we exclude the darker area which is influenced by the edge effect mentioned above. 5.6. Crossvalidation and sensitivity analysis In order to evaluate the sensitivity to the network configuration, we carry out a crossvalidation analysis by removing one station at a time. In particular, after excluding the i-th location from the data input,
Copyright © 2007 John Wiley & Sons, Ltd.
Environmetrics 2007; 18: 245–264 DOI: 10.1002/env
HETEROGENEOUS NETWORKS MONITORING
Figure 6.
257
Standard errors regional map (21st day)
we compute the loading matrix, −i say, and using the GDC model of Section 5, we estimate the corresponding parameter vector, −i say. Since the predictions of yG and yT at site si are given by (si )µ(t) and α + β(si )µ(t), respectively, we predict the i-th column of the loadings matrix by the local polynomial regression of Subsection 4.3 applied to −i . After this, we calculate the daily crossvalidation errors for each station, as differences between the predicted and the observed data. Subsequently, the absolute value of station bias given by the yearly average error for each station is drawn in Figure 12. Moreover, the network bias, given by the network average of the absolute values of the station bias, the root mean square error (RMSE) and MAE are reported in Table 4, which is computed for various PCA dimensions k. Hence, minimizing the MAE gives k = 14. In Figures 7 and 8, we plot the instrument additive bias α−i and the calibration coefficient β−i given by each GDC model of the crossvalidation procedure. These and the remaining figures are divided into two areas, the left one is for LVG monitors and the right one for TEOM monitors. Moreover, we add a solid line referring to the parameter value of the general GDC model (see Table 3) and dashed lines for the ±2σ interval, where the estimated standard deviation of the MLE is used for σ. Copyright © 2007 John Wiley & Sons, Ltd.
Environmetrics 2007; 18: 245–264 DOI: 10.1002/env
` M. CAMELETTI AND O. NICOLIS A. FASSO,
258
Figure 7. Instrument additive bias α−i for LVG (left) and TEOM (right). Legend: solid circle, Lombardia; square, Emilia Romagna; cross, Piemonte. Solid line, coefficient α estimated from the full network (see Table 3); dashed lines, α ± 2σα
It is apparent that, the two parameters α−i and β−i have a strong negative correlation as shown by the correlation from both the MLE and the cross validation procedure which are −0.93 and −0.85, respectively. Figure 9 shows that, the network has a negligible influence on the Markovian persistence parameter h. This is in accord with Fass`o and Nicolis (2006). To take into account all the parameters together and their mutual correlation, we assess the influence of each station to the model using the so-called Cook distance, reported in Figure 10, which is Table 4.
Network crossvalidation performance for PCA dimension k k
RMSE MAE Bias
6
10
14
18
22
0.4465 0.3454 0.2056
0.4506 0.3485 0.2197
0.4339 0.3297 0.1967
0.4451 0.3394 0.1988
0.4326 0.33 0.1926
Copyright © 2007 John Wiley & Sons, Ltd.
Environmetrics 2007; 18: 245–264 DOI: 10.1002/env
HETEROGENEOUS NETWORKS MONITORING
259
Figure 8. Instrument multiplicative bias β−i for LVG (left) and TEOM (right). Legend: solid circle, Lombardia; square, Emilia Romagna; cross, Piemonte. Solid line, coefficient β estimated from the full network (see Table 3); dashed lines, β ± 2σβ
given by D−i = ( −i − )T V −1 ( −i − ) where is the parameter vector estimated on the full network with estimated variance– covariance matrix V. This extends sensitivity analysis, discussed for example by Fass`o and Perri (2002), to heterogeneous networks. Note that D−i has an unknown distribution, nevertheless, following Cook (1977), we use the χ2 percentile which has a reference descriptive value. Figures 11 and 12 report the MAE and absolute bias respectively allowing us to assess the reconstruction capability of the model. Note that, as often happens, largest values of MAE are associated with the largest bias. This point is further discussed in the next Subsection 5.7. Moreover, the comparison of these figures with Cook distance and related Figures 7, 8, 9, and 10 allows us to discriminate between stations which have influence on parameter estimation and spatial prediction. For example, Station n.3, located in Piacenza and called the Pubblico Passeggio Station, has extreme values in all the considered diagnostic graphical tools. Hence it is quite influential to
Copyright © 2007 John Wiley & Sons, Ltd.
Environmetrics 2007; 18: 245–264 DOI: 10.1002/env
260
Figure 9.
` M. CAMELETTI AND O. NICOLIS A. FASSO,
Propagation coefficient h−i for LVG (left) and TEOM (right). Legend: solid circle, Lombardia; square, Emilia Romagna; cross, Piemonte. Solid line, coefficient h estimated from the full network (see Table 3)
both parameter estimation and data reconstruction and, in this sense, it could be considered as an outlier.
5.7. Separating the model components As mentioned above, for some stations, the model of Table 3 may give a non-negligible bias and the network bias of Table 4 is 0.197. A question which arise is whether this bias depends on the pure GDC component or on some unaccounted deterministic factors. To deal with this problem, we performed parallel modeling with deseasonalized and mean removed data. In particular, for each station, we removed the mean and a non-parametric seasonal component given by the mentioned loess procedure with λ = 0.5. Using this data, namely y − x, and the crossvalidation approach introduced above, we found a little smaller PCA dimension, k = 10, which is coherent with the interpretation of Subsection 5.4. The resulting model is reported in Table 5, where it can be seen that the additive calibration bias α has been omitted. Note that, as a consequence of the detrending procedure, the Markovian persistence coefficient h is quite smaller than the corresponding value of Table 3. With this
Copyright © 2007 John Wiley & Sons, Ltd.
Environmetrics 2007; 18: 245–264 DOI: 10.1002/env
261
HETEROGENEOUS NETWORKS MONITORING
Figure 10.
Cook distance D−i . Legend: solid circle, Lombardia; square, Emilia Romagna; cross, Piemonte. Solid line, 95◦ percentile from χ2 distribution with dim( ) degrees of freedom
approach the pure GDC component bias or network bias is now considerably smaller, being 0.004. In this deseasonalized model, we do not make any spatial prediction of the systematic component x(t, s) which is supposed known. Since, as discussed in Subsection 5.2, the component x(t, s), and in particular the annual average, does not have a simple spatial covariance, one can use loess spatial interpolation for x(t, s). In doing this, an error of the same order of magnitude as the model of the previous section arises. In particular, the general network bias for the yearly means spatial interpolation is 0.18. The same problem can also be seen using the model for the undetrended data of previous section and an ANOVA-like decomposition applied to the network average of Equation (7). In particular, Table 6
Table 5.
Estimate SE
β
h
0.756 0.008
0.532 0.017
MLE parameters of deseasonalized pure GDC model log ση21 log ση22 log ση23 log ση24
Copyright © 2007 John Wiley & Sons, Ltd.
1.614 0.076
−0.356 0.077
−0.583 0.085
−2.133 0.042
log σε2 −3.229 0.011
Environmetrics 2007; 18: 245–264 DOI: 10.1002/env
262
Figure 11.
` M. CAMELETTI AND O. NICOLIS A. FASSO,
Mean absolute error for LVG (left) and TEOM (right). Legend: solid circle, Lombardia; square, Emilia Romagna; cross, Piemonte
shows that the second component, which is related to the site-specific average level, dominates the other two.
6. CONCLUSIONS AND FURTHER DEVELOPMENTS We discussed spatio-temporal modeling of fine particulate data from the heterogeneous network of the Po Valley. In particular, both systematic and stochastic components are discussed. It is shown that the GDC model can be used to analyze such data especially after
Table 6. Source i σφ2i σµ2 i i σφ2i µ2i i φi2 σµ2 i
Copyright © 2007 John Wiley & Sons, Ltd.
ANOVA of network averages of Equation 7 Network average 0.0442 1.4438 0.0923
Environmetrics 2007; 18: 245–264 DOI: 10.1002/env
HETEROGENEOUS NETWORKS MONITORING
Figure 12.
263
Absolute Bias for LVG (left) and TEOM (right). Legend: solid circle, Lombardia; square, Emilia Romagna; cross, Piemonte
removal of the systematic or deterministic component. This calls for additional site-specific information. Moreover, some techniques for model validation and sensitivity analysis are discussed which allow to assess the network redundancy and station information content. The crossvalidation analysis show that, model estimation in general and the bias coefficients α and β in particular, are strongly influenced by the network configuration. Hence the network design should be carefully considered using, for example, sequential design, see for example, Wikle and Royle (1999) and Arbia and Lafratta (2002).
ACKNOWLEDGMENTS
This work was supported partially by PRIN 2004 n. 137478 grant. The authors are also grateful to an anonymous referee for the useful comments. We thank Dott.ssa R. Ignaccolo for the Piemonte data, Dott. F. Greco for the Emilia Romagna data, and Ing. V. Gianelle for the Lombardia data.
Copyright © 2007 John Wiley & Sons, Ltd.
Environmetrics 2007; 18: 245–264 DOI: 10.1002/env
264
` M. CAMELETTI AND O. NICOLIS A. FASSO,
REFERENCES Arbia G, Lafratta G. 2002. Anisotropic spatial sampling designs for urban pollution. Applied Statistics 51: 223–234. Biggeri A, Bellini P, Terracini B. 2004. Meta-analysis of the Italian studies on short term effects of air pollution 1996–2002. Epidemiologia e Prevenzione 28-suppl: 1–100. Brown PE, Diggle PJ, Lord ME, Young P. 2001. Space-time calibration of radar rainfall data. Journal of the Royal Statistical Society, Series C 50: 221–241. Cleveland SC, Devlin SJ. 1988. Locally weighted regression: an approach to regression analysis by local fitting. Journal of American Statistical Society 83: 596–610. Cook RD. 1977. Detection of influential observations in linear regression. Technometrics 19: 15–18. European Union Council Directive 96/62/CEE. 1996. Official Journal L 296: 55–63. Fass`o A, Negri I. 2002. Nonlinear statistical modelling of high frequency ground ozone data. Environmetrics 13: 225–241. Fass`o A, Nicolis O. 2004. Modelling dynamics and uncertainty in assessment of quality standards for fine particulate matters. Statistica e Applicazioni (to appear). Fass`o A, Nicolis O. 2005. Space-time integration of heterogeneous networks in air quality monitoring. Proceedings of the Italian Statistical Society Conference on ‘Statistica e Ambiente’, Messina, 21–23 September 2005, 1. Fass`o A, Perri PF. 2002. Sensitivity analysis. In Encyclopedia of Environmetrics, El-Sharaawi A, Piegorsch W (eds). Wiley: New York 4: 1968–1982. Goodall C, Mardia KV. 1994. Challenges in Multivariate Spatial modelling. Proceedings of the XVIIth International Biometric Conference, Hamilton, Ontario, Canada: 8–12. Graf-Jaccottet M, Jaunin MH. 1998. Predictive models for ground ozone and nitrogen dioxide time series. Environmetrics 9: 393–406. Hauser R, Godleski JJ, Hatch V, Christiani DC. 2001. Ultrafine particles in human lung macrophages. International Archives of Environmental Health 56: 150–156. Kyriakidis PC, Journel AG. 1999. Geostatistical space-time models: a review. Mathematical Geology 31: 651–684. Mardia K, Goodall C, Redfern E, Alonso F. 1998. The kriged kalman filter. Sociedad de Estadistica y Investigacion Operativa Test 7: 217–285. McBride S, Clyde M. 2003. Hierarchical Bayesian Calibration with reference priors: an application to airborne particualate matter monitoring data. Discussion paper n.23/2003. ISDS, Duke University; 1–36. Osborne C. 1991. Statistical calibration: a review. International Statistical Review 59: 309–366. Ostro B, Lipsett M, Mann J, Krupnick A, Harrington W. 1993. Air pollution and respiratory morbidity among adults in Southern California. American Journal of Epidemiology 137: 691–700. Porcu E, Mateu J. 2005. Mixture-based modeling for space-time data. Environmetrics, (to appear). Sahu SK, Mardia KV. 2005. A Bayesian Kriged-Kalman model for short-term forecasting of air pollution levels. Journal of the Royal Statistical Society, Series C 54: 223–244. Wikle CK. 2003. Spatio-temporal models in climatology. Encyclopedia of Life Support Systems. EOLSS, Paris. Wikle CK, Berliner LM, Cressie N. 1998. Hierarchical Bayesian space-time models. Environmental and Ecological Statistics 5: 117–154. Wikle CK, Cressie N. 1999. A dimension-reduced approach to space-time kalman filtering. Biometrika 86: 815–824. Xu K, Wikle CK. 2007. Estimation of parameterized spatio-temporal dynamic models. Journal of Statistical Inference and Planning, 137: 567–588. Wikle CK, Royle JA. 1999. Space-time dynamic design of environmental monitoring networks. Journal of Agricultural, Biological and Environmental Statistics 4: 489–507.
Copyright © 2007 John Wiley & Sons, Ltd.
Environmetrics 2007; 18: 245–264 DOI: 10.1002/env