used approaches. A theme of the paper is to examine how plausible individual-level models relate to .... median zip code income, for both black and white men.
3B2 Version 6.05e/W (Mar 29 1999)
{Kluwer}Eest/Eest 11_1/254725/5254725.3d
Date: 26/11/03
Time 11:47am
Page 31 of 54
Environmental and Ecological Statistics 11, 31±54, 2004
A critique of statistical aspects of ecological studies in spatial epidemiology J O N AT H A N WA K E F I E L D Departments of Statistics and Biostatistics, University of Washington, Seattle, USA and Small Area Health Statistics Unit, Department of Epidemiology and Public Health, Imperial College School of Medicine, London, UK Accepted June 2003 In this article, the mathematical assumptions of a number of commonly used ecological regression models are made explicit, critically assessed, and related to ecological bias. In particular, the role and interpretation of random effects models are examined. The modeling of spatial variability is considered and related to an underlying continuous spatial ®eld. The examination of such a ®eld with respect to the modeling of risk in relation to a point source highlights an inconsistency in commonly used approaches. A theme of the paper is to examine how plausible individual-level models relate to those used in practice at the aggregate level. The individual-level models acknowledge confounding, within-area variability in exposures and confounders, measurement error and data anomalies and so we can examine how the area-level versions consider these aspects. We brie¯y discuss designs that ef®ciently survey individual responses would appear to be useful in environmental settings.
Uncorrected Proof
Keywords: cross-level bias, confounding, ecological fallacy, exposure misclassi®cation, pure speci®cation bias, spatial epidemiology, within-area variability 1352-8505 # 2004
Kluwer Academic Publishers
1. Introduction Ecological studies utilize data at the level of the group rather than the individual, and have a long history in epidemiology (Morgenstern, 1998), as well as sociology (Robinson, 1950), political science (Cleave et al., 1995; King, 1997) and geography (Openshaw, 1984). In this paper we consider epidemiological studies in which the groups correspond to geographical areas. The term ``ecological'' has other connotations in environmetrics where studies such as those considered here may be referred to as ``aggregate'' data studies; we wish to avoid this term, however, since in epidemiology this is associated with a speci®c method of Prentice and Sheppard (1995). In this article, the groups will be geographical areas, and the response will be a measure of disease incidence/mortality. Ecological studies are controversial due to the additional biases, beyond those of individual-level observational studies, that may be present. When ecological data are used to make inference about individuals, then reported associations may be highly misleading with the discrepancy sometimes referred to as the ecological fallacy. Ecological, aggregate and cross-level bias are all terms that have been used as umbrella terms for the biases that 1352-8505 # 2004
Kluwer Academic Publishers
M11109 Kluwer Academic Publishers
Environmental and Ecological Statistics (EEST)
Tradespools, Frome, Somerset
3B2 Version 6.05e/W (Mar 29 1999)
{Kluwer}Eest/Eest 11_1/254725/5254725.3d
Date: 26/11/03
32
Time 11:47am
Page 32 of 54
Wake®eld
may result from the use of ecological data. Ecological studies are appealing, however, since they may use routinely collected data, and can produce greater exposure contrasts than those available in individual-level studies, since they are typically carried out over larger geographical areas than the latter. In addition to the usual biases that may arise in observational studies, such as those due to confounding, effect modi®cation and exposure misclassi®cation, each of which may be more complex in ecological studies, there are a number of additional biases that arise, and are due to within-area variability in exposures and confounders. When an aggregated risk model is formed from an individual-level risk model, the two mathematical forms will generally differ, and ignoring this can lead to what has been termed pure speci®cation bias (Greenland, 1992). The lack of mutual standardization between the response and exposure may also produce bias (Rosenbaum and Rubin, 1984). For example, if the response is directly standardized for, say, an age distribution, then the exposure of interest must be standardized in the same way. The latter is dif®cult to achieve since typically exposure by age (for example) distributions are not routinely available. There is a large literature on ecological bias and the ecological fallacy in the epidemiological literature; see, for example, Richardson et al. (1987), Piantodosi et al. (1988), Greenland and Morgenstern (1989), Greenland and Robins (1994), Richardson and Montfort (2000), Wake®eld and Salway (2001) and Wake®eld (2003). The uses of aggregate data include: investigation of the variability of disease risk or relative risks across a region (disease mapping), examining the association between risk and environmental and/or behavioral exposures (ecological regression), and surveillance of routine health statistics for early detection of ``hot spots'' of risk (cluster detection). Recent statistical developments in environmental epidemiology may be found in the edited volume of Elliott et al. (2000). In this paper, we concentrate on ecological regression studies (which are also known as geographical correlation studies) and ®rst describe a number of examples that have been described in the literature. There are various distinctions that may be made for ecological regression studies; ``exposures'' of interest may be environmental (in air, water or soil), behavioral (e.g., diet, alcohol, smoking), social characteristics (for example, measures of socioeconomic status that may or not be related to behavioral variables), genetic, or demographic (e.g., race). The geographical scale of examination, and the suitability of a study is determined by data availability. Exposures arising from a point or line source offer exposure contrasts at small scales and so small-area data are required. In contrast, dietary variables show little variation across small scales and so international studies are used (INTERSALT Cooperative Group, 1988; Prentice and Sheppard, 1990; Riboli, 1992). In this paper, the focus is on environmental exposures which vary on a smaller geographical scale, and this is the scale we emphasize. A number of ecological studies of point sources of pollution have been carried out by the Small Area Health Statistics Unit (SAHSU) in the United Kingdom (Elliott et al., 1992b). We describe the form of the data in a generic study carried out by the SAHSU. Health data are available at the level of the postcode (which contain on average 14 households), with population data at the census enumeration district (which contain on average 400 people). The health outcome is often cancer incidence with a variety of cancer sites being examined. Age and gender are known confounders and may be controlled for at the level of the enumeration district. Indices of socioeconomic status are also typically constructed at this level from census variables such as the percentages of overcrowding, car ownership, unemployment, etc. Such indices only allow partial control, however, since
Uncorrected Proof
M11109 Kluwer Academic Publishers
Environmental and Ecological Statistics (EEST)
Tradespools, Frome, Somerset
3B2 Version 6.05e/W (Mar 29 1999)
{Kluwer}Eest/Eest 11_1/254725/5254725.3d
Date: 26/11/03
Time 11:47am
Page 33 of 54
Statistical aspects of ecological studies in spatial epidemiology
33
they are re¯ecting many different aspects including behavioral variables, and access to services, and they are subject to ecological bias since a single measure will not allow control for the complete distribution of socio-economic confounders within the area. The exposure measure is usually very crude and is taken as distance from population-weighted centroid to the point source. SAHSU examples include: all incinerators of waste solvents and oils in Great Britain (Elliott et al., 1992a); a single petrochemical works at Baglan Bay, Wales (Sans et al., 1995); radio and TV transmitters (Dolk et al., 1997a,b); municipal incinerators (Elliott et al., 1996), cokeworks (Dolk et al., 1999); a pesticides factory (Wilkinson et al., 1997); land®ll sites (Dolk et al., 1998); and industrial complexes that include major oil re®neries (Wilkinson et al., 1999). Line sources include roads and railway lines. Examples of pollution from a line source include the study of Best et al. (2001) who investigated the association between benzene and childhood leukemia (a large contribution to ambient benzene exposure is from cars though point sources also contribute). Studies of air pollution usually examine the short-term effects though a number of spatial ecological studies have been carried out, Pope and Dockery (1996) provide a review. Semi-ecological studies are those in which individual health and confounder data are available, along with an ecological exposure. Such studies are bene®cial in terms of confounding, but may still suffer from problems of ecological bias. The Six Cities Study (Dockery et al., 1993) provides an example of such a study in an air pollution setting. There are a number of examples of the use of area data to investigate associations between health endpoints and socio-economic status (e.g., Krieger, 1990; Krieger et al., 1997; Williams and Collins, 1995). Guest et al. (1998) describe the association between education, unemployment and racial segregation on age-, sex- and race-speci®c mortality rates in Chicago. Using data from men recruited to the Multiple Risk Factor Intervention Trial (MRFIT), Davey Smith et al. (1996a,b) observed a clear gradient in mortality with median zip code income, for both black and white men. Blanchard et al. (2001) examined associations between Crohn's disease and ulcerative colitis and socio-economic status variables in 52 postal areas in Manitoba over the years 1987±1996. A large number of ecological studies have been carried out to investigate the association between water hardness and cardiovascular events. A typical study carried out in the northwest of England in 1990±1992 and investigating the association between myocardial infarction and magnesium in the residential water supply is described in Maheswaren et al. (1999), along with other references to this literature. A large number of studies have examined the association between radon and lung cancer at the ecological level, see Whitley and Darby (1999) for references, and Gelman et al. (2001) for methodological considerations. Often ecological studies have discovered increases in risk due to occupational rather than environmental exposures. In the United States, clues to etiology were obtained for oral cancer and snuff dipping in the South (Winn et al., 1981), and between occupational exposure and lung and laryngeal cancers in coastal Virginia (Blot et al., 1980). In Elliott et al. (1992b), a point source study was carried out to investigate increased risk of meseothelioma in the vicinity of Plymouth docks. This analysis revealed an estimated relative risk at source of 11, but further analysis revealed that this excess was due to occupational, rather than environmental, risk factors. Geographical correlation studies can also give clues to common etiologies. For example, Bernardinelli et al. (1999) examine the association between insulin dependent
Uncorrected Proof
M11109 Kluwer Academic Publishers
Environmental and Ecological Statistics (EEST)
Tradespools, Frome, Somerset
3B2 Version 6.05e/W (Mar 29 1999)
{Kluwer}Eest/Eest 11_1/254725/5254725.3d
Date: 26/11/03
34
Time 11:47am
Page 34 of 54
Wake®eld
diabetes mellitus (IDDM) and malaria with the hypothesis being that genetic selection caused by malaria has affected the susceptibility to IDDM, and Yasui et al. (2001) have hypothesized that primary Epstein-Barr virus (EBV) infection occurring during adolescence or adulthood is associated with elevated breast cancer risk. Data on EBV are unavailable routinely but EBV is thought to be a causal factor for Hodgkin's disease (IARC, 1997), and so the latter can be used as a surrogate. Other examples of ecological studies may be found in Richardson and Montfort (2000). In studies of environmental pollution from point sources in developed countries, unless there is an accident resulting in a large increase of pollutant (such as that at Chernobyl), the increases in risk are often modest. Occupational studies tend to produce much larger increases, Siemiatycki et al. (1988). The studies carried out by the SAHSU have reported excesses in risk at source in the range 0.1±1.0 that is, relative risks of 1.1±2.0, and are consistent with the value of 1.5 quoted by Pekkanen and Pearce (2001) as being typical. These relative risks must be viewed in light of the fact that for cancers in particular there are risk factors that are far more predictive of disease than environmental factors, for example diet, smoking, alcohol consumption and genetic factors. Consequently, the potential for confounding is strong since ecological studies do not directly utilize individual-level risk factor data though, as noted above, strati®cation by age and gender is routinely carried out. The effects of the bias not only cast doubt on the conclusions of studies that reveal a small detrimental effect, but also on studies that reveal no association. Problems of interpretation in ecological study also arise from data quality issues, beyond the observational and aggregate nature of such studies. In particular, because of the routinely collected nature of much of the data utilized in environmental epidemiology errors in the reported number of cases, and the population at risk and in exposure variables may be extremely in¯uential. The number of cases and the size of the population are often referred to as ``numerator'' and ``denominator'' in the epidemiology literature since a common risk estimate follows from the ratio of these quantities. For cancer cases and hospital admissions, respectively, registry (Best and Wake®eld, 1999) and provider (Aylin et al., 2001) effects may also be signi®cant. With respect to the denominators, the effects of under-enumeration at census and migration must be considered, and we examine how such errors may be represented by commonly used statistical models in Section 3. These and other data anomalies are not likely to be spatially neutral and could be a signi®cant contribution to observed variation in risk estimates. Such errors have been described in detail by Elliott and Wake®eld (1999) and will not be concentrated upon here though the acknowledgment of their presence is vital, both when statistical analyzes are interpreted, and when the level of statistical sophistication of the analysis is considered. In particular, for rare diseases analyzed at the small area level a small number of cases may effectively drive the analysis and the sensitivity of the conclusions should be assessed by removing these in¯uential cases, with the checking of the accuracy of anomalies also being desirable. In terms of bias in estimates of association between risk and an environmental pollutant, it will often be the case that modeling spatial dependence will be of secondary importance when compared to biases arising from the presence of data anomalies and confounding. Wake®eld (2003) describes how sensitivity to unmeasured confounding may be considered in an ecological setting. The structure of this article is as follows. In Section 2, we introduce notation, review a number of approaches that are currently used to analyze ecological data, and brie¯y summarize a number of aspects of ecological bias arising from within-area variability in
Uncorrected Proof
M11109 Kluwer Academic Publishers
Environmental and Ecological Statistics (EEST)
Tradespools, Frome, Somerset
3B2 Version 6.05e/W (Mar 29 1999)
{Kluwer}Eest/Eest 11_1/254725/5254725.3d
Date: 26/11/03
Time 11:47am
Page 35 of 54
Statistical aspects of ecological studies in spatial epidemiology
35
exposures and confounders. Section 3 provides a motivation for random effects models that are often utilized in ecological analyses. In Section 4, the modeling of a continuous risk surface is considered. In the context of the modeling of disease risk in relation to a point/line source of pollution a number of inconsistencies in parametric approaches are highlighted. In Section 5, we discuss a number of design issues that should be considered when an ecological study is envisaged. Section 6 contains a concluding discussion.
2. Statistical framework 2.1 Notation and ecological models We consider a study area A that may be partitioned into sub-areas Ai ; i P 1; . . . ; n, C according to data availability. Within area i we suppose there are Ni c 1 Nic individuals where Nic denotes the number of individuals in confounder stratum c; c 1; . . . ; C. By a confounder we mean, roughly, a variable that is associated with both response and exposure, does not lie on the causal pathway between exposure and response, and is not caused by the response (Rothman and Greenland, 1998). Typically these confounder variables will include age and gender, and possibly a measure of socioeconomic status. Richardson and Montfort (2000) provide a review of approaches to ecological inference in epidemiology. PC Let Yic denote the number of cases in stratum c of area i, and Yi c 1 Yic the total number of cases in area i, over some ®xed time period. We brie¯y comment on spatiotemporal models in Section 4. For a non-infectious disease we may assume that
Uncorrected Proof
Yic j pic *Binomial
Nic ; pic ;
1
where pic is the probability of disease in stratum c, area i. There are typically insuf®cient cases to estimate the collection pic and so a common simpli®cation is to assume that pic Ri 6pc ; where pc is a ``reference probability'' (and may be estimated using data from another region), and Ri is the relative risk associated with area i; in Section 4, we discuss the exact interpretation of Ri . Under this model the effect of being in area i is common to all stratum (which is equivalent to saying that there is no interaction between stratum and area). Most diseases are statistically rare and so we may approximate (1) by Yic j Ri ; pc *Poisson
Nic 6Ri 6pc ; from which it follows that Yi j Ri *Poisson
Ei 6Ri ;
P
2
where the expected numbers Ei are given by Ei c Nic pc. If we knew the ``true'' probabilities pc , and the rest of the model assumptions are correct then we could assume independence between the counts in different areas. In practice, however, the relative risks are modeled as a function of observed covariates, and since we do not observe all relevant covariates we might expect that the relative risks in areas that are geographically close are positively correlated; shortly we will see how spatial dependence may be modeled by allowing the relative risks to depend on random effects with spatial structure. Spatial
M11109 Kluwer Academic Publishers
Environmental and Ecological Statistics (EEST)
Tradespools, Frome, Somerset
3B2 Version 6.05e/W (Mar 29 1999)
{Kluwer}Eest/Eest 11_1/254725/5254725.3d
Date: 26/11/03
36
Time 11:47am
Page 36 of 54
Wake®eld
dependence may also arise from data anomalies with spatial structure. Due to unmeasured variables, the marginal distribution of the disease counts will rarely be binomial. A simple approach to ecological inference is to obtain the MLE from (2) as R^i Yi =Ei , and regress R^i on Xi , an area-level measure of exposure via an additive (linear) or multiplicative (loglinear) model. Cook and Pocock (1983) carry out such an analysis of cardiovascular mortality in British towns with a number of ecological variables, including measures of water quality, accounting for spatial dependence via a simple spatial error model. The model (2) suffers from a number of dif®culties. It is well-known that the MLE R^i is highly variable for rare diseases and so high/low values for particular areas may just be a re¯ection of sampling variability. Also, environmental epidemiological data often display overdispersion, that is the variance exceeds the mean. As just brie¯y touched on, and described in greater detail in Section 3, this variability may be due to unmeasured risk factors with or without spatial dependence, as well as within-area variability in exposures/ confounders, and data anomalies. Both imprecision due to small numbers, and overdispersion, may be addressed via the introduction of random effects. Clayton et al. (1993) extended the disease mapping model of Besag et al. (1991) to give the ecological regression model
Uncorrected Proof
log Ri b0 b1 Xi Ti Si ;
3
where Ti j s2t *N
0; s2t denote unstructured (independent) random effects, and Si random effects with spatial structure. Besag et al. (1991) modeled the latter using the intrinsic conditional autoregressive (ICAR) model in which one considers the conditional distribution of Si j Sj ; j [ qi, where qi represents the indices of P a set of ``neighboring'' areas. Speci®cally, Si j Sj ; j [ qi*N
Si ; s2s =mi , where Si 1=mi j [ qi Sj and mi denotes the number of neighbors. This model has the advantage of being computationally straightforward to implement and non-stationary. The latter is appealing in terms of ¯exibility, though the level of non-stationarity, in terms of the set of risks that may be represented, has not been investigated. For example, it is unlikely that discontinuities due to geographical features such as rivers or mountains could be well-modeled with a normal model, though a Laplacian model may be more amenable to such features (Besag et al., 1991; Best et al., 1999). If discontinuities of this type are expected then appropriate covariates may be incorporated in (3), or the neighborhood structure may be altered to have distinct spatial models in different regions separated by physical barriers. An overall aim is always to include suf®cient covariates to remove the residual variability. The model does not consider the positions, sizes and shapes of the areas, and the interpretation of s2s , a conditional variance, is dif®cult unless each area has a constant number of neighbors, which makes interpretation and prior elicitation more dif®cult, though the latter may be carried out by simulating sets of relative risks under different prior speci®cations for s2s . The non-stationarity also makes model checking more troublesome since the joint model is improper (it de®nes a singular multivariate normal distribution with the rank of the distribution being n 1), and so there is no distribution with which marginal random effect estimates may be compared (though the speci®cation of one random effect, or the mean of the collection, yields a proper prior). 0 An alternative approach is to model the joint collection S
S1 ; . . . ; Sn via a multivariate speci®cation. An example that has been considered by, amongst others, Best et al. (1999) and Wake®eld and Morris (1999) (as part of a sensitivity analysis), is
M11109 Kluwer Academic Publishers
Environmental and Ecological Statistics (EEST)
Tradespools, Frome, Somerset
3B2 Version 6.05e/W (Mar 29 1999)
{Kluwer}Eest/Eest 11_1/254725/5254725.3d
Date: 26/11/03
Time 11:47am
Page 37 of 54
Statistical aspects of ecological studies in spatial epidemiology
37
P P P S j s *Nn
0n ; s where 0n denotes an n61 vector of zeroes, and sij s2s exp
dij f where dij is the distance between the centroids of areas i and j (this model was also used by Cook and Pocock, 1983). This model has the advantage of simple interpretation of the parameters s2s and f, so that prior distributions are more straightforward to specify; s2s is a marginal variance and so is directly comparable to s2t and log 2=f denotes the distance at which correlations fall to 0.5. We note that measuring the extent of spatial variability as a function of the total variability is not straightforward sinceP f needs to be considered also; one possibility is to consider the posterior distribution of j s j. The joint model also ignores the topology of the areas and is far more computationally intensive than the ICAR alternative. The stationarity assumption may also be restrictive, and the importance of the choice of correlation function also needs to be determined, typically in an aggregate study there is very little information available on the spatial aspect and directly learning about correlations at very small or very large distances is not possible. For both the conditional and the joint models, realizations of Ti ; Si and log Ri may be generated and examined to assess the level of variability, spatial and otherwise. Rather than model conditionally or jointly, it is more natural (though more computationally challenging) to construct a model from the underlying continuous risk surface, an approach we discuss in Section 4.
Uncorrected Proof
2.2 Ecological bias We brie¯y discuss pure speci®cation bias in a very simple situation which there are no Pin N confounders and we have Ni individuals within area i, and Yi j i 1 Yij represents the total number of cases. We assume that within area i, the distribution of exposures is given by X j fi *fi
? j fi ; where fi denotes a set of parameters that characterize the exposure. We also assume that for individual j in area i the risk model is given by p
xij ; i 1; . . . ; n; j 1; . . . ; Ni, where we have suppressed the dependence on the parameters of the model. From this point onwards, b will denote the ``true'' values of the parameters of the risk model, and b the parameters of an assumed model. Depending on the form of the individual-level model and the within-area distribution, the latter may or may not correspond to the true values. Then, Yij j b ;
fi *Bernoullifp
fi g;
where
Z p
fi EX j fi fp
Xg
p
xfi
x j fi dx;
4
is the average risk. If exposures are independent within areas then for a rare and noninfectious disease Yi j b ;
fi *PoissonfNi p
fi g:
M11109 Kluwer Academic Publishers
Environmental and Ecological Statistics (EEST)
Tradespools, Frome, Somerset
3B2 Version 6.05e/W (Mar 29 1999)
{Kluwer}Eest/Eest 11_1/254725/5254725.3d
Date: 26/11/03
38
Time 11:47am
Page 38 of 54
Wake®eld
For the additive model p
X b0 b1 X no bias arises since p
fi b0 b1 Xi , where Xi denotes the mean exposure, and hence the aggregate form is the same as the individual form and we have no pure speci®cation bias. The latter also occurs if there is no withinarea variability in the exposure. Now suppose that within area i the exposure is normally distributed, that is X j Xi ; s2i *N
Xi ; s2i . For the loglinear model EYij j b0 ; b1 exp
b0 b1 Xij ; we obtain the aggregate mean EYi j b0 ; b1
! b0 b1 Xi b12 s2i Ni exp ; 2
5
(Richardson et al., 1987) and bias will result if we ®t the naive model EYi j b0 ; b1 Ni exp
b0 b1 Xi ; unless the variance is uncorrelated with the means across areas (in which case the ®nal term of (5) is absorbed into the intercept). An understanding of pure speci®cation bias in this scenario follows from (5) by observing that s2i =2 is acting like an unmeasured variable 2 which has a positive effect (since b1 > 0) (Wake®eld, 2003). Hence if the variances increase with the means (which would be the usual case for an environmental variable on its original scale) then if b1 > 0 we will overestimate the true effect if we ignore the within-area variability. If b1 < 0 then the estimate may change sign. For general withinarea distributions of the exposure higher order moments appear within the aggregate form, and the condition for no bias is that none of the higher moments are related to the means. We now describe the aggregate data method of Prentice and Sheppard (1995). Let m T Xi i
Xi1 ; . . . ; Ximi denote survey data collected within area i on exposures/ confounders for mi individuals; individual health/confounder/exposure data are not available and so we do not have an individual-level study. We have m m Yij j Xi i *Bernoullifp
Xi i g where the average risk is given by
Uncorrected Proof
i e b0 X exp
b1 Xij ; mi j 1
m
p
Xi i
m
6 m
which is a Monte Carlo estimate of (4). The likelihood for Yi j Xi i is no longer Poisson because the Bernoulli trials are not independent (since the Xij values are based on sampling without replacement). Prentice and Sheppard (1995) suggest an estimating functions approach to inference, and allow a single non-spatial random effect whose distribution is left unspeci®ed. The above may be extended to consider multivariate exposures and confounders. As a simple example suppose we have a single exposure and a single confounder, but we measure the area means Xi and Ui . If we assume that within areas the exposure and confounder follow a bivariate normal distribution: Px Pxu Xij Xi Piu *N ; Puxi ; Uij Ui i i then
M11109 Kluwer Academic Publishers
Environmental and Ecological Statistics (EEST)
Tradespools, Frome, Somerset
3B2 Version 6.05e/W (Mar 29 1999)
{Kluwer}Eest/Eest 11_1/254725/5254725.3d
Date: 26/11/03
Time 11:47am
Page 39 of 54
Statistical aspects of ecological studies in spatial epidemiology Z EYij j Xi ; Ui exp
b0 b1 x b2 ufi
x; udx du; and we obtain
(
EYij j Xi ; Ui exp
b0
b1 Xi
b2 Ui
b12
Px i
b22
39
Pu 2
i
2b1 b2
Pxu !) i
:
Px Pu bias while the The terms exp
b12 i =2 and exp
b22 i =2 arise due to pure speci®cation Pxu within-area confounding is responsible for the term exp
2b1 b2 i =2. If the variances/ covariances are independent of Xi then these terms will be absorbed into the intercept (so that a random effect has been induced), that is, we obtain
EYij j Xi ; Ui exp
b0i b1 Xi b2 Ui ; and there will no bias in estimation of b1 will result from ®tting the naive model, in this special case. If non-rare outcomes are modeled (for example, some respiratory diseases are relatively common) then a logistic model is more appropriate. We do not consider such models here, but further details may be found in Salway and Wake®eld (2001).
Uncorrected Proof
3. Interpretation of random effects models 3.1 Overdispersion In practice, one indicator of the presence of unmeasured variables is the extent of overdispersion. Quasi-likelihood provides a simple method for estimating the latter via speci®cation of the ®rst two moments only, at least when the covariance between responses in different areas is zero. Alternatively, random effects models, such as those described in Section 2 may be ®tted; these are more ¯exible, but require more assumptions. We begin by considering the simple situation in which there is no within-area variability in exposures and confounders and the ``true'' model is given by Yij j Xi ; Zi *PoissonfZi exp
b0 b1 Xi g; where we suppress dependence on b in the conditioning and Zi exp
b2 Ui (for example) summarizes a variable Ui and its effect; we now suppose that the variable Ui is unmeasured. We then obtain EYij j Xi exp
b0 b1 Xi EZi j Xi ; and var
Yij j Xi var
EYij j Xi ; Zi Evar
Yij j Xi ; Zi EYij j Xi
1 EYij j Xi 6ci ;
7
where
M11109 Kluwer Academic Publishers
Environmental and Ecological Statistics (EEST)
Tradespools, Frome, Somerset
3B2 Version 6.05e/W (Mar 29 1999)
{Kluwer}Eest/Eest 11_1/254725/5254725.3d
Date: 26/11/03
40
Time 11:47am
Page 40 of 54
Wake®eld ci
var
Zi j Xi EZi j Xi
2
:
If Xi and Zi are independent then there is no bias in estimation of b1 since we have the marginal expectation EYij j Xi exp
b0 b1 Xi ;
8
where b0 b0 log EZi . We now consider speci®c choices for the distribution of Z, in each case assuming that X and Z are independent, to show how these correspond to commonly used random effects models. Suppose ®rst that Zi *Ga
a; b (with the gamma parameterization such that EZi a=b). We then have (8) with b0 b0 log a log b, and (7) with ci 1=a. In this case, the distribution of Y j X is negative binomial. To summarize: the marginalization across U via a gamma mixing distribution for Z has modi®ed the Poisson to a negative binomial distribution. The choice Zi *N
mz ; s2z leads to b0 b0 mz s2z =2 and ci exp
s2z 1. In this case, the distribution of Y j X is not available in closed form, but can be obtained via approximation, if a likelihood approach is taken, or the complete model can be ®tted in a straightforward fashion using Markov chain Monte Carlo, in a Bayesian analysis. The speci®cation
Uncorrected Proof
Zi j Xi *Gafeb0 b1 Xi mz b; eb0 b1 Xi bg
where mz EZi j Xi EZi , leads to (8) with b0 b0 log mz and var
Yi j Xi EYi j Xi 6k;
9
with k 1 1=b so that the variance is proportional to the mean (rather than being a quadratic function). The distribution of Yi j Xi is again negative binomial. This variance function was assumed by Diggle et al. (1997), within a quasi-likelihood inferential approach, as a pragmatic means of incorporating extra-Poisson variability in an environmental epidemiological setting. This model is not so easily interpretable from the random effects perspective, however, since it has a variance form given by var
Zi j Xi
exp
b0
mz ; b1 Xi b
so that the variance of the random effect depends on Xi . Note that Xi and Ui are not independent here. Although the above models do not correct for bias due to confounding, a large value of c or k does indicate there are unmeasured variables (and/or data anomalies, model misspeci®cation), some of which may be confounders, and indicate that caution should be exercised when interpreting observed associations, particularly if the estimated relative risks are small. Residuals may be examined to attempt to distinguish between the linear and quadratic models given by (7) and (9), though interpretation will be dif®cult unless there is large variability in the expected counts. Christiansen and Morris (1997) give an example in which the choice of mean±variance relationship makes a substantive difference to inference. McCullagh and Nelder (1989, pp. 198±99) also discuss these two choices of model. Wake®eld et al. (2000) brie¯y motivated the use of random effects in disease mapping;
M11109 Kluwer Academic Publishers
Environmental and Ecological Statistics (EEST)
Tradespools, Frome, Somerset
3B2 Version 6.05e/W (Mar 29 1999)
{Kluwer}Eest/Eest 11_1/254725/5254725.3d
Date: 26/11/03
Time 11:47am
Page 41 of 54
Statistical aspects of ecological studies in spatial epidemiology
41
here we expand upon their development, to determine whether they correspond to plausible data generating mechanisms. We show that when model (3) is ®tted, one interpretation of the random effects is that they are accounting for unmeasured confounders, and for errors in the data. To be more explicit, suppose the ``true'' model is given by Yi j Ei ; Xi ; Ui *Poisson
Ei Ri , with log Ri b0 b1 Xi b2 Ui : Suppose we observe Ei where log Ei log Ei Ti ;
10
Ti j s2t *iid N
0; s2t ,
so that we have a Berkson errors-in-variables model (e.g., Carroll et al., 1995) for data anomalies in the denominator (e.g., migration, over/under-enumeration at census). FurtherPassume that we do not P observe the risk factor Ui but we assume that Ui j Xi *N
mu ; s2u u ; i 1; . . . ; n, where u is an n6n correlation matrix. Note that we have assumed that Ui and Xi are independent. Under the above assumptions we now examine the consequences of modeling on the basis of observing fYi ; Xi ; Ei g; i 1; . . . ; n. We have
Uncorrected Proof
EYi j Ei ; Xi EfEYi j Ei ; Ui g;
where the outer expectation is with respect to the distribution of Ei ; Ui j Ei ; Xi . Under the assumption that Ui and Ti are conditionally independent we obtain the marginal distribution EYi j Ei ; Xi EfEi exp
b0 b1 Xi b2 Ui g
EEi j Ei exp
b0 b1 Xi Eexp
b2 Ui j Xi ! s2t 2 2 2 b0 b1 Xi b2 mu b2 su Ei exp : 2
11
Note that although cov
Yi ; Yi0 j Ei ; Xi ; Ui ; Ei0 ; Xi0 ; Ui0 0; because of the P dependence induced by the correlation between Ui and Ui0 (as given by element i, i0 of u ), we have cov
Yi ; Yi0 j Ei ; Xi ; Ei0 ; Xi0 6 0: The above model is equivalent to the model (3) EYi j Xi ; Ti ; Si exp
b0 b1 Xi Ti Si ; that was described earlier (with a proper multivariate speci®cation for
S1 ; . . . ; Sn with b0 b0
s2t mu b2 ; 2
P P P 0 P and Ti j s2t *N
0; s2t ;
S1 ; . . . ; Sn j s *N
0; s with s b22 u s2u . We will have b1 b1 in this case because Ui and Xi were assumed to be independent, but in general this will not be the case, and in general we cannot control for confounding using random effects.
M11109 Kluwer Academic Publishers
Environmental and Ecological Statistics (EEST)
Tradespools, Frome, Somerset
3B2 Version 6.05e/W (Mar 29 1999)
{Kluwer}Eest/Eest 11_1/254725/5254725.3d
Date: 26/11/03
42
Time 11:47am
Page 42 of 54
Wake®eld
Clayton et al. (1993) introduced spatial random effects in an ecological regression setting to produce appropriate standard errors, and to prevent ``confounding by location''. The latter is a risky enterprise, however, since although adding in terms should never cause bias in a loglinear model if the functional form is correct, the exposure±risk relationship may be distorted if there is measurement error in the exposure. In time series studies of the acute health effects of air pollution generalized additive models are used to remove trends on a larger time scale, and these trends could be due to exposure from previous time periods. In a spatial setting the exposure is obtained locally, though exposures in other areas could be in¯uential depending on an individual's movements. In the above, we have assumed that the risk factors are constant within areas. The random effects could also be representing within-area variability in exposures/ confounders. For example, in the model (5) we obtain ( !) 2 b0 b1 Xi b2 s 1 i Yi j Xi *Poisson Ni exp ; 2 and, analogously to unmeasured confounders, random effects may be re¯ecting the term s2i =2. So again the use of random effects models cannot hope to control for bias from within-area variability in exposures and confounders (which will arise when the variances/ covariances depend on Xi ). We note that previously in this section we described models that were marginalized across unmeasured area-level variables. Now we are considering conditional models. In the former case, the response will usually be no longer Poisson (due to overdispersion), while in the latter the Poisson assumption may be reasonable. To summarize, random effects may be thought to be accommodating any or all of data anomalies, within-area variability in risk factors and unmeasured between-area risk factors. The spatial pattern of such quantities determines whether spatial or non-spatial random effects are dominant.
Uncorrected Proof
3.2 Non-constant relative risks We now consider the particular situation in which data from multiple point or line sources of pollution are available. Such a design is appealing since confounding is less likely over multiple sites, and problems of the lack of an a priori hypothesis are not present. A number of multiple-site studies have been carried out, for example, Elliott et al. (1996) considered 72 municipal incinerators in the United Kingdom, analyzing each separately via Stone's test (Stone, 1998) and combining the resultant p-values, and Dolk et al. (1998) examined 21 European land®ll sites and modeled the relative risks from each as random effects. We examine a number of situations that motivate such models. We assume that we ®t the model Yi j yi *PoissonfNi exp
yi g;
12
for i 1; . . . ; n, where exp
yi represents the relative risk associated with ``proximity'' (for example, within 2 km) to, and Ni denotes the number of individuals close to, point/ line-source i; i 1; . . . ; N. For clarity we have ignored strati®cation variables; these may be considered by replacing Ni by Ei (after assuming that the exposure distribution is
M11109 Kluwer Academic Publishers
Environmental and Ecological Statistics (EEST)
Tradespools, Frome, Somerset
3B2 Version 6.05e/W (Mar 29 1999)
{Kluwer}Eest/Eest 11_1/254725/5254725.3d
Date: 26/11/03
Time 11:47am
Page 43 of 54
Statistical aspects of ecological studies in spatial epidemiology
43
constant across stratum, and a loglinear model without exposure-stratum interaction). At the second stage of the model, the log relative risks yi are assumed to arise from some distribution, with the usual choice being the normal. Three situations in which model (12) can arise are: i. when the exposure distributions vary by area, ii. when there are unmeasured variables that are associated with the response, and vary by area, and iii. when there is effect modi®cation by area. Again it is bene®cial to begin at the level of the individual. We begin by assuming the model EYij j Xij exp
b0 b1 Xij , where Xij represents the exposure of individual j who is close to point source i, j 1; . . . ; Ni ; i 1; . . . ; n. In scenario i suppose that the exposure distribution close to point source i is fi
x j fi , in which case the average risk close to the source is given by Z EYij j area i
exp
b0 b1 xfi
x j fi dx:
Uncorrected Proof
P P P A particular example is Xij j Xi ; xi *N
Xi ; i ; xi in which case we obtain P x yi b0 b1 Xi b2 1 i =2. The response for each of the Ni individuals is Bernoulli and so, under independence of the outcomes, the distribution of Yi is binomial with indices Ni and eyi . For a rare disease we can take a Poisson approximation to the binomial to give (12). A constant relative risk is obtained only if the exposure distribution does not vary across sites. Hence in this scenario the random effect is accounting for between-site variability in exposure distributions. In the second situation, we again suppose there is a constant effect b1 but there is an additional predictor, U, with risk model EYij j Xij ; Uij exp
b0 b1 Xij b2 Uij . Suppose also that the exposure/confounder distribution close to point source i is given by fi
x; u fi
u j xf
x, so that the distribution of the exposure is constant across sites. We then have ZZ
exp
b0 b1 x b2 ufi
u j xf
xdx du Z Z exp
b0 exp
b1 x exp
b1 ufi
u j xdu f
xdx:
EYij j area i
If the covariate distribution is constant across areas ( point sources) then we obtain constant risk, otherwise the risk changes by area. If U is a confounder so that fi
u j x depends on x, then we will also obtain bias in the estimate of the effect of the exposure due to proximity to a point source. Finally, in scenario iii, the random effects model could be due to effect modi®cation by area so that at the individual-level the risk at site i is given by EYij j Xij exp
b0 b1i Xij ; i 1; . . . ; n; j 1; . . . ; Ni . Such modi®cation could be due to an interaction between the effect of exposure and characteristics of the individuals in area i (for example the socio-economic status of the individuals). We then obtain
M11109 Kluwer Academic Publishers
Environmental and Ecological Statistics (EEST)
Tradespools, Frome, Somerset
3B2 Version 6.05e/W (Mar 29 1999)
{Kluwer}Eest/Eest 11_1/254725/5254725.3d
44 EYij j area i exp
b0
Z
Date: 26/11/03
Time 11:47am
Page 44 of 54
Wake®eld exp
b1i xf
x j fdx:
The plausibility that at least one of scenarios i, ii and iii holds, would indicate that it would be bene®cial to incorporate random effects into the modeling of relative risks when multiple point/line-source data are available.
4. Continuous risk surface modeling The majority of the approaches to risk modeling with aggregate data have directly modeled the quantities Ri in Equation (2). An alternative that has intuitive appeal is to consider the underlying relative risk surface, R
s, where s denotes spatial location. A variety of choices exist for the underlying continuous model, the dif®culty lies in the aggregation step. Diggle (1990) originally described in detail Poisson point process models for non-aggregate data, Wolpert and Ickstadt (1998) and Best et al. (2001) gamma random ®eld models, and Kelsall and Wake®eld (2002) Gaussian random ®eld models. Diggle et al. (1998) discuss Gaussian random ®eld models in the context of point data. In the following, we assume that a person's risk is a function of the exposure associated with their residential location only, X
s. We begin by assuming that cases follow a Poisson process with intensity
Uncorrected Proof
l
s6p
s; where l
s represents the population at risk at spatial location s and p
s the risk at location s. For simplicity of notation, we assume there are no confounders. We assume that p
s eb0 6R
s where eb0 is the baseline risk and R
s is the relative risk at location s. For area i we then have
Yi j Ri *Poisson
Ni 6eb0 6Ri ; where
Z Ri
Ai
fi
sR
sds;
13
the population density at location s and fi
s l
s=Ni is a distribution representing R within area i, where Ni is an estimate of A fi
sds. The naive interpretation is that Ri i represents the relative risk of each of the individuals within area i. This is only true if R
s Ri I
s [ Ai , that is, if there is constant relative risk, and to make this assumption would lead the analysis open to the possibility of ecological bias. The correct interpretation is that Ri is the average relative risk with respect to fi
s; s [ Ai . As already discussed, ecological bias arises due to within-area variability in exposures/confounders. Hence the naive interpretation of Ri will be less susceptible to ecological if the within-area variability in exposure is small, and information has been collected on confounders and adequate adjustment made. Studies with geographical areas as small as possible are clearly better equipped to produce such circumstances. ``Large'' relative risks are more indicative of a true association, since it will be dif®cult to make these disappear when sensitivity analyzes with respect to within-area variability and confounding are carried out. Deciding
M11109 Kluwer Academic Publishers
Environmental and Ecological Statistics (EEST)
Tradespools, Frome, Somerset
3B2 Version 6.05e/W (Mar 29 1999)
{Kluwer}Eest/Eest 11_1/254725/5254725.3d
Date: 26/11/03
Time 11:47am
Page 45 of 54
Statistical aspects of ecological studies in spatial epidemiology
45
on how large ``large'' is, is dif®cult to characterize, but relative risks of 0.8±1.2 should be viewed with scepticism, unless the study is extremely well-controlled. In many cases, we may be able to model fi
s as piecewise uniform, i.e., as fi
s
Ki X k1
I
s [ Aik 6fik ;
k 1; . . . ; Ki ;
where Ki is the number of uniform subregions (and could be equal to Ni ) and so Z Ai
fi
sds
Ki X k1
fik j Aik j 1;
where jAik j denotes the area of Aik . For example, in the United Kingdom, if a study were carried out at the level of the enumeration district, then uniform population distributions could be taken over each postcode (with a geographical informations system (GIS) being used to create the required boundaries). In this case, we obtain Ri
Ki X
Uncorrected Proof k1
where Rik jAik j Ri
Rik fik ;
R
Aik
R
sds. In the limiting case of Ki Ni we obtain
Ki 1X R ; Ni k 1 ik
or, if pi represents the average risk i e b0 X pi R ; Ni k 1 ik
K
again providing a link with (4) and to (6) (the latter when Rik exp
b1 Xik ). This formulation shows that we have a deconvolution problem and to identify R
s we need to propose a model since there are an in®nite number of possible collections fRi1 ; . . . ; RiKi g that result in any particular value of Ri . This again indicates the dif®culty of ecological inference; the possibility of ecological bias in which individual and ecological inferences disagree should always be kept in mind. As a simple example we assume that the only spatial variability in risk arises from an exposure X
s via a multiplicative exposure/risk model; in this case we have R
s expfb1 X
sg. Expression (13) may be thought of as representing the average risk corresponding to a hypothetically in®nite population in area i, with population density fi
s. If we have a sample of Ni individuals in area i then we may de®ne fi
? to be the uniform distribution with mass Ni 1 at the locations sij ; j 1; . . . ; Ni , then again we obtain the discrete analog of (13) Ri
Ni X 1 R
sij ; N j 1 ij
with R
sij corresponding to the relative risk at location sij at which there is exposure xij .
M11109 Kluwer Academic Publishers
Environmental and Ecological Statistics (EEST)
Tradespools, Frome, Somerset
3B2 Version 6.05e/W (Mar 29 1999)
{Kluwer}Eest/Eest 11_1/254725/5254725.3d
Date: 26/11/03
46
Time 11:47am
Page 46 of 54
Wake®eld
As an example, suppose that X
s is such that, conditional on the locations we have Xij j Xi ; s2i *N
Xi ; s2i then evaluating (13) we obtain ! 2 b1 Xi b2 1 si ; Ri exp 2 which is equivalent to Equation (5) in Section 2.2. The Poisson model is appropriate if the exposures are independent within areas (which is unlikely to be true for environmental exposures). Wake®eld and Salway (2001) describe an estimating functions approach that may be used in the case of dependent exposures. A number of authors have considered the modeling of disease risk in relation to a point source of pollution using a parametric exposure/risk model (e.g., Diggle, 1990; Diggle and Rowlingson, 1994; Lawson, 1993; Diggle et al., 1997; Wake®eld and Morris, 2001). Lawson (1993) considers risk as a function of orientation and allows a non-monotonic distance±risk relationship, the remainder of the approaches model risk as a monotonic function of distance. All approaches ignore the aggregate version of the model, however. Diggle and Elliott (1995) have discussed the problems of aggregation in the situation of modeling risk in relation to a point source. We consider the model
Uncorrected Proof
R
s 1 a exp
b d;
14
where d ks s0 k and s0 is the location of a putative point source (Diggle, 1990). The ecological model Ri 1 a exp
b dave i
15
dave i
where represents the population-weighted centroid is that which is often assumed. Then, from (14) Z Ri 1 a fi
d exp
bddd; Di
where fi
d represents the population density as a function of distance, and Di represents the range of distances in area i. If, for example, we assume fi
d is uniform on
dave d0i ; dave d0i , then we obtain i i ( ) 0 b d0i e b di ave
e b di Ri 1 a exp :
16 2d0i though areas of the same Model (16) will not necessarily be monotonic decreasing in dave i size and uniform density give the closest link between the point and ecological models. If information is available on the population density within area i then (16) may be directly used. Although it is important to be aware of the inconsistency between (14) and (15), the latter was initially proposed as a model that could pick up broad trends in the risk surface and so its use is still merited, though estimated risk/distance functions should not be overinterpreted. Similarly estimates for particular areas should be viewed with caution. Stone (1988) and Bithell and Stone (1989) discuss a test of the hypothesis H0 : R1 R2 Rn versus the alternative H1 : R1 R2 Rn with at least one inequality holding, and where the areas have been ordered so that Ri represents the
M11109 Kluwer Academic Publishers
Environmental and Ecological Statistics (EEST)
Tradespools, Frome, Somerset
3B2 Version 6.05e/W (Mar 29 1999)
{Kluwer}Eest/Eest 11_1/254725/5254725.3d
Date: 26/11/03
Time 11:47am
Page 47 of 54
Statistical aspects of ecological studies in spatial epidemiology
47
relative risk in the area that is i-th closest to s0 . Again this approach is not consistent when the underlying population density is considered since monotonic R
d is not equivalent to monotonic Ri , though again there will be a small loss in power if the true risk surface is monotonic and the discretized version is not. If the areas have highly irregular shapes (stretching over large distances with the ends close and far from the point source) then the power may be more signi®cantly reduced. In this paper, the temporal behavior of exposures has not been considered. We now brie¯y discuss a space/time formulation and assume that cumulative exposure with a loglinear model is relevant. In general we have Yi j Ri *Poisson
Ni 6eb0 6Ri , where Z Ri
T T
Z Ai
expfb1 X
s; tgfi
s; tds dt;
and
T; T is the study period. Again we assume a hypothetical in®nite population, here with space/time density fi
s; t, and with fi
s; t is constant over time (so that there is no migration). As a simple example suppose X
s; t is such that for individuals sampled at time t within area i the distribution of the exposures is given by N
Xi bxi t; s2i so that the exposure follows a linear relationship over time (though the relationship is areaspeci®c). Then
Uncorrected Proof
! 2 b1 Xi b2 1 si Ri exp 6ci 2 where ci
exp
b1 bxi T
exp
b1 bxi T
b1 bxi
and so the temporal aspect may be ignored (in the sense that the ``usual'' model is correct) if bxi is independent of i, i.e., if time trends are constant across areas. Hence the naive ecological regression in which the relative risk is regressed on Xi implicitly assumes that any within-area variability in exposure is constant across areas, and that time trends are similar across areas. We note here the fundamental difference, in this idealized situation, between time and space, time is truly continuous while space is discrete since we have assumed each individual occupies one location only, at least in terms of their exposure. In reality, an individual passes through various exposure surfaces over time. This brief discussion indicates that it is bene®cial to have exposure/confounder data available both within areas and across time to gain an understanding of the spatio-temporal exposure/confounder surface. Shaddick and Wake®eld (2001) carried out an analysis of daily monitored levels of four air pollutants across eight sites in London; the ultimate aim was to use the modeled levels within a study of the health effects of acute pollution. They found that for each pollutant the majority of the variability was across time and so modeling the spatial variability was of secondary importance. This is not likely to be the case in general, however, particular for areas that are topologically and meteorologically heterogenous.
M11109 Kluwer Academic Publishers
Environmental and Ecological Statistics (EEST)
Tradespools, Frome, Somerset
3B2 Version 6.05e/W (Mar 29 1999)
{Kluwer}Eest/Eest 11_1/254725/5254725.3d
Date: 26/11/03
48
Time 11:47am
Page 48 of 54
Wake®eld
5. Ecological study design There has been little consideration of design in an ecological setting, Plummer and Clayton (1996) and Sheppard et al. (1996) are two exceptions. Beyond the need for exposure variability across areas and a consideration of data quality issues, there is a need to understand the likely extent of within- and between-area confounding and the withinarea variability in exposure (which unfortunately is likely to be larger if between-area contrasts are large). The fundamental dif®culty of ecological inference is to characterize the within-area distribution of exposures and confounders. The collection of individual-level data is vital to aid in the alleviation of ecological bias. Richardson et al. (1987) proposed making parametric assumptions on exposures and confounders within areas (the spirit of which was followed in parts of Sections 3 and 4) while, as we brie¯y mentioned in Section 2.2, Prentice and Sheppard (1995) described an alternative strategy in which subsamples of exposures and confounders were obtained within areas. Both of these procedures require samples within areas, the latter explicitly, and the former implicitly in order to estimate the relevant moments (and to examine the appropriateness of the assumed distributional form). Prentice and Sheppard (1995) proposed their method in the context of the situation in which samples were routinely available through surveys for example, hence random sampling was utilized. Here, we consider the situation in which an environmental epidemiological study is envisaged and individual-level data may be collected. Nonrandom sampling may be carried out to increase ef®ciency; for example, ``two-phase'' approaches have proved useful in a range of epidemiological contexts (White, 1982; Breslow and Cain, 1988; Breslow and Holubkov, 1998; Scott and Wild, 1997). Wake®eld (2004) illustrated the ef®ciency gains that are possible with non-random sampling in a social science ecological context. We brie¯y describe one possible two-phase design that may be potentially useful in an environmental epidemiological setting. As a context we describe a study that investigated death from myocardial infarction as a function of the water constituents magnesium, calcium, ¯uoride and lead, with known confounders age, gender and socio-economic status (Maheswaran et al., 1999). For each area we have the number of cases and the population at risk, by age and gender. Following Breslow and Chatterjee (1999), we let j 1; . . . ; J index the J strata that consist of a strati®cation of the exposure variables and the confounders. For example, each of the N individuals in the study region can be assigned a low/high value of each of the water constituents and socio-economic status, based on the area in which they live. Along with (say) ten age bands and gender we therefore have 23 6261062 320 J strata. Sampling within the 2 J strata for cases and controls may then be carried out. Breslow and Cain (1988) advocate sampling numbers as equal as possible within each stratum for improved ef®ciency. A crucial assumption here is that once we have adjusted for confounders and exposures, the area is no longer important as a predictor; this will be inappropriate when, as will usually be the case, there is clustering of unmeasured risk factors within areas. To overcome this we are currently working on a hybrid two-stage, two-phase design in which the clustered sampling is accounted for. Korn and Graubard (1999) discuss multistage sampling in the context of health surveys. A dif®culty with the overall approach is the retrospective
Uncorrected Proof
M11109 Kluwer Academic Publishers
Environmental and Ecological Statistics (EEST)
Tradespools, Frome, Somerset
3B2 Version 6.05e/W (Mar 29 1999)
{Kluwer}Eest/Eest 11_1/254725/5254725.3d
Date: 26/11/03
Time 11:47am
Page 49 of 54
Statistical aspects of ecological studies in spatial epidemiology
49
collection of the exposure data, diseases with short latencies (congenital malformations, for example) would therefore be more amenable to studies of this type. Studies investigating the link between air pollution and hospitalization may also be well-suited to this design.
6. Discussion In this paper, we have described a number of the sources of bias in ecological studies. Although the modeling of residual spatial variability in risk is important, it will often be of secondary importance when compared to assessing/modeling the effect of unmeasured confounding and within-area variability and data anomalies. The existence and usual extent of data anomalies also indicate that robust methods are important in this setting. The discussion of Section 3 indicates that when ecological regression analyzes are carried out, both spatial and non-spatial random effects should be included (though the exact form of the latter will often be of less importance), and if multiple point/line-sources are considered then a random effects approach is recommended. When multiple sites are considered the model (12) may be considered, or (15) with a and b , the parameters describing the risk±distance relationship, taken to be random effects; such a model was described by Wake®eld and Morris (2001). Whether the quality of the data support such a choice must be evaluated on a case-by-case basis. Whenever random effects are included, addressing the sensitivity of inference to the prior distributions, particularly on variance components, is vital. As with all random effects modeling, it is also important to try to determine sources of variability that reduce the need for the random effects; in Section 3 a number of potential sources were described. If ``small'' relative risks are envisaged then within-area samples of exposures/ confounders are essential if any faith is to be placed in observed associations. In particular, an understanding of the spatial and temporal variability in exposures, and the role played by measurement error is essential for choosing an appropriate statistical model.
Uncorrected Proof
Acknowledgments The author would like to thank Ruth Salway for detailed comments on an earlier draft. Although the research described in this article has been funded in part by the United States Environmental Protection Agency through agreement CR825173±01±0 to the University of Washington, it has not been subjected to the Agency's required peer review and therefore does not necessarily re¯ect the views of the Agency and no of®cial endorsement should be inferred.
References Aylin, P., Bottle, A., Wake®eld, J., Jarup, L., and Elliott, P. (2001) Proximity to coke works and hospital admissions for respiratory and cardiovascular disease in England and Wales. Thorax, 56, 228±33. Bernardinelli, L., Pascutto, C., Montomoli, C., Komakec, J., Gilks, W., Songini, M., Fiorani, O., Lisa,
M11109 Kluwer Academic Publishers
Environmental and Ecological Statistics (EEST)
Tradespools, Frome, Somerset
3B2 Version 6.05e/W (Mar 29 1999)
{Kluwer}Eest/Eest 11_1/254725/5254725.3d
Date: 26/11/03
50
Time 11:47am
Page 50 of 54
Wake®eld
A., Zie, G., Solinas, G., and Bottazzo, G.G. (1999) Bayesian analysis of ecological data for studying the association between insulin-dependent diabetes mellitus and malaria. In Statistics for the Environment 4: Statistical Aspects of Health and the Environment, V. Barnett, A. Stein and K.F. Turkman, John Wiley and Sons, Chichester, pp. 29±47. Besag, J., York, J., and MollieÂ, A. (1991) Bayesian image restoration with two applications in spatial statistics. Annals of the Institute of Statistics and Mathematics, 43, 1±59. Best, N.G., Arnold, R.A., Thomas, A., Waller, L.A., and Conlon, E.M. (1999) Bayesian models for spatially correlated disease and exposure data. In Bayesian Statistics 6, J.M. Bernardo, J.O. Berger, A.P. Dawid, and A.F.M. Smith (eds), Oxford University Press, Oxford, pp. 131±56. Best, N.G., Cockings, S., Bennett, J., Wake®eld, J., and Elliott, P. (2001) Ecological regression analysis of environmental benzene exposure and childhood leukaemia: sensitivity to data inaccuracies, geographical scale and ecological bias. Journal of the Royal Statistical Society, Series A, 164, 155±74. Best, N.G. and Wake®eld, J.C. (1999) Accounting for inaccuracies in populations counts and case registration in cancer mapping studies. Journal of the Royal Statistical Society, Series A, 162, 363±82. Best, N.G., Ickstadt, K., and Wolpert, R.L. (2001) Spatial Poisson regression for health and exposure data measured at disparate spatial scales. Journal of the American Statistical Association, 95, 1076±88. Bithell, J.F. and Stone, R.A. (1989) On statistical methods for analysing the geographical distribution of cancer cases near nuclear installations. Journal of Epidemiology and Community Health, 43, 79±85. Blanchard, J.F., Bernstein, C.N., Wajda, A., and Rawsthorne, P. (2001) Small-area variations and sociodemographic correlates for the incidence of Crohn's disease and ulcerative colitis. American Journal of Epidemiology, 154, 328±35. Blot, W.J., Morris, L.E., Stroube, R., Tagnon, I., and Fraumeni, J.F. (1980) Lung and laryngeal cancers in relation to shipyard employment in coastal Virginia. Journal of the National Cancer Institute, 65, 571±5. Breslow, N.E. and Cain, K.C. (1988) Logistic regression for two-stage case-control data. Biometrika, 75, 11±20. Breslow, N.E. and Chatterjee, N. (1999) Design and analysis of two-phase studies with binary outcome applied to Wilms tumor prognosis. Applied Statistics, 48, 457±68. Breslow, N.E. and Holubkov, R. (1997) Maximum likelihood estimation of logistic regression parameters under two-phase outcome-dependent sampling. Journal of the Royal Statistical Society, Series B, 59, 447±61. Carroll, R.J., Ruppert, D., and Stefanski, L.A. (1995) Measurement Error in Non-linear Models, Chapman and Hall. Christiansen, C. and Morris, C. (1997) Hierarchical Poisson regression models. Journal of the American Statistical Association, 92, 618±32. Clayton, D., Bernardinelli, L., and Montomoli, C. (1993) Spatial correlation in ecological analysis. International Journal of Epidemiology, 22, 1193±202. Cook, D.G. and Pocock, S.J. (1983). Multiple regression in goegraphical mortality studies, with allowance for spatially correlated errors. Biometrics, 39, 361±71. Cleave, N., Brown, P.J., and Payne, C.D. (1995) Methods for ecological inference: an evaluation. Journal of the Royal Statistical Society, Series A, 158, 55±75. Davey Smith, G., Wentworth, D., Neaton, J.D., Stamler, R., and Stamler, J. (1996a) Socioeconomic differentials in mortality among men screened for Multiple Risk Factor Intervention Trial. I: white men. American Journal of Public Health, 86, 486±96. Davey Smith, G., Wentworth, D., Neaton, J.D., Stamler, R., and Stamler, J. (1996b) Socioeconomic differentials in mortality among men screened for Multiple Risk Factor Intervention Trial. II: black men. American Journal of Public Health, 86, 497±504.
Uncorrected Proof
M11109 Kluwer Academic Publishers
Environmental and Ecological Statistics (EEST)
Tradespools, Frome, Somerset
3B2 Version 6.05e/W (Mar 29 1999)
{Kluwer}Eest/Eest 11_1/254725/5254725.3d
Date: 26/11/03
Time 11:47am
Page 51 of 54
Statistical aspects of ecological studies in spatial epidemiology
51
Diggle, P.J. (1990) A point process modeling approach to raised incidence of a rare phenomenon in the vicinity of a pre-speci®ed point. Journal of the Royal Statistical Society, Series A, 153, 349±62. Diggle, P.J. and Rowlingson, B.S. (1994) A conditional approach to point process modeling of raised incidence. Journal of the Royal Statistical Society, Series A, 157, 433±40. Diggle, P.J. and Elliott, P. (1995) Statistical issues in the analysis of disease risk near point sources using individual or spatially aggregated data. Journal of Epidemiology and Community Health, 49, S20±7. Diggle, P.J., Morris, S.E., Elliott, P., and Shaddick, G. (1997) Regression modeling of disease risk in relation to point sources. Journal of the Royal Statistical Society, Series A, 160, 491±505. Diggle, P.J., Tawn, J.A., and Moyeed, R.A. (1998) Model-based geostatistics. Applied Statistics, 47, 299±350. Dockery, D.W., Pop, C.A. III, Xu, X., Spengler, J.D., Ware, J.H., Fay, M.E., Ferris, B.G., and Speizer, F.E. (1993) Mortality risks of air pollution: a prospective cohort study. New England Journal of Medicine, 329, 1753±9. Dolk, H., Elliott, P., Shaddick, G., Walls, P., and Thakrar, B. (1997a) Cancer incidence near radio and television transmitters in Great Britain: all high power transmitters, American Journal of Epidemiology, 145, 10±17. Dolk, H., Shaddick, G., Walls, P., Grundy, C., Thakrar, B., Kleinschmidt, I., and Elliott, P. (1997b) Cancer incidence near radio and television transmitters in Great Britain: Sutton Cold®eld transmitter, American Journal of Epidemiology, 145, 1±9. Dolk, H., Thakrar, B., Walls, P., Landon, M., Grundy, C., Suez-Lloret, I., Wilkinson, P., and Elliott, P. (1999) Mortality among residents near cokeworks in Great Britain. Occupational and Environmental Medicine, 56, 34±40. Dolk, H., Vrijheid, M., Armstrong, B., Abramsky, L., Bianche, F., Garne, E., Nelen, V., Robert, E., Scott, J.E.S., Stone, D., and Tenconi, R. (1998) Risk of congenital anomalies near hazardouswaste land®ll sites in Europe: The EUROHAZCON study. Lancet, 352, 423±7. Elliott, P., Hills, M., Beresford, J., Kleinschmidt, I., Jolley, D., Pattenden, S., Rodrigues, L., Westlake, A., and Rose, G. (1992a) Incidence of cancer of the larynx and lung near incinerators of waste solvents and oils in Great Britain. Lancet, 339, 854±8. Elliott, P. and Wake®eld, J.C. (1999) Small-area studies of environment and health. Statistics for the Environment 4: Health and the Environment, V. Barnett, A. Stein, and K.F. Turkman (eds), John Wiley, New York, pp. 3±27. Elliott, P., Westlake, A., Hills, M., Kleinschmidt, I., Rodrigues, L., McGale, P., Marshall, K., and Rose, G. (1992b) The Small Area Health Statistics Unit: a national facility for investigating health around point sources of environmental pollution in the United Kingdom. Journal of Epidemiology and Community Health, 46, 345±9. Elliott, P., Shaddick, G., Kleinschmidt, I., Jolley, D., Walls, P., Beresford, J., and Grundy, C. (1996) Cancer incidence near municipal solid waste incinerators in Great Britain. British Journal of Cancer, 73, 702±10. Elliott, P., Wake®eld, J.C., Best, N.G., and Briggs, D.B. (2000) Spatial Epidemiology: Methods and Applications, Oxford University Press, Oxford. Gelman, A., Park, D.K., Ansolabehere, S., Price, P.N., and Minnite, L.C. (2001) Models, assumptions and model checking in ecological regressions. Journal of the Royal Statistical Society, Series B, 164, 101±18. Greenland, S. (1992) Divergent biases in ecologic and individual-level studies. Statistics in Medicine, 11, 1209±23. Greenland, S. and Morgenstern, H. (1989) Ecological bias, confounding, and effect modi®cation. International Journal of Epidemiology, 18, 269±74. Greenland, S. and Robins, J. (1994) Ecological studies-biases, misconceptions and counterexamples. American Journal Epidemiology, 139, 747±60.
Uncorrected Proof
M11109 Kluwer Academic Publishers
Environmental and Ecological Statistics (EEST)
Tradespools, Frome, Somerset
3B2 Version 6.05e/W (Mar 29 1999)
{Kluwer}Eest/Eest 11_1/254725/5254725.3d
Date: 26/11/03
52
Time 11:47am
Page 52 of 54
Wake®eld
Guest, A.M., Almgren, G., and Hussey, J.M. (1998) The ecology of race and socioeconomic distress: infant and working-age mortality in Chicago. Demography, 35, 23±34. IARC (1997) Epstein-Barr virus and Kaposi's sarcoma herpesvirus/human herpesvirus 8. IARC Monograph on the Evaluation of Carcinogenic Risks to Humans, IARC Scienti®c Publication 70. INTERSALT Cooperative Research Group (1988) INTERSALT: an international study of electrolyte excretion and blood pressure. Results for 24-hour urinary sodium and potassium excretion. British Medical Journal, 297, 319±28. Kelsall, J.E. and Wake®eld, J.C. (2002) Modeling spatial variability in disease risk. Journal of the American Statistical Association, 97, 692±701. King, G. (1997) A Solution to the Ecological Inference Problem, Princeton University Press, Princeton, New Jersey. Korn, E.L. and Graubard, B.I. (1999) Analysis of Health Surveys, John Wiley and Sons. Krieger, N. (1990) Social class and the black/white crossover in the age-speci®c incidence of breast cancer: a study linking census-derived data to population-based registry records. American Journal of Epidemiology, 131, 804±14. Krieger, N., Williams, D.R., and Moss, N.E. (1997) Measuring social class in U.S. public health research: concepts, methodologies and guidelines. Annual Review of Public Health, 18, 342± 78. Lawson, A.B. (1993) On the analysis of mortality events associated with a prespeci®ed ®xed point. Journal of the Royal Statistical Society, Series A, 156, 363±77. McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models, second edition, Chapman and Hall, London. Maheswaran, R., Morris, S., Falconer, S., Grossinho, A., Perry, I., Wake®eld, J., and Elliott, P. (1999) Magnesium in drinking water supplies and mortality from acute myocardial infarction in north west England. Heart, 82, 455±60. Morgenstern, H. (1998) Ecologic studies. In Modern Epidemiology, K.J. Rothman and S. Greenland (eds), second edition, Lipincott-Raven, pp. 459±80. Openshaw, S. (1984) The Modi®able Areal Unit Problem, CATMOG No. 38, Geo Books, Norwich. Pekkanen, J. and Pearce, N. (2001) Environmental epidemiology: challenges and opportunities. Environmental Health Perspectives, 109, 1±5. Piantadosi, S., Byar, D.P., and Green, S. B. (1988) The ecological fallacy. American Journal of Epidemiology, 127, 893±904. Plummer, M. and Clayton, D. (1996) Estimation of population exposure in ecological studies (with discussion). Journal of the Royal Statistical Society, Series B, 58, 113±26. Pope, A. and Dockery, D. (1996) Epidemiology of chronic health effects: cross-sectional studies. In Particles in Our Air: Concentrations and Health Effects, R. Wilson and J. Spengler (eds), Harvard University Press, Boston pp. 149±67. Prentice, R.L. and Sheppard, L. (1990) Dietary fat and cancer: consistency of the epidemiologic data, and disease prevention that may follow from a practical reduction in fat reduction. Cancer Causes Control, 1, 81±97. Prentice, R.L. and Sheppard, L. (1995) Aggregate data studies of disease risk factors. Biometrika, 82, 113±25. Riboli, E. (1992) Nutrition and cancer: background and rationale of the European perspective investigation into cancer and nutrition (EPIC). Ann. Onc., 3, 783±91. Richardson, S. and Montfort, C. (2000) Ecological correlation studies. In Spatial Epidemiology: Methods and Applications, P. Elliott, J.C. Wake®eld, N.G. Best, and D.B. Briggs, (eds), Oxford University Press, Oxford, pp. 205±20. Richardson, S., Stucker, I., and Hemon, D. (1987) Comparison of relative risks obtained in ecological and individual studies: some methodological considerations. International Journal of Epidemiology, 16, 111±20.
Uncorrected Proof
M11109 Kluwer Academic Publishers
Environmental and Ecological Statistics (EEST)
Tradespools, Frome, Somerset
3B2 Version 6.05e/W (Mar 29 1999)
{Kluwer}Eest/Eest 11_1/254725/5254725.3d
Date: 26/11/03
Time 11:47am
Page 53 of 54
Statistical aspects of ecological studies in spatial epidemiology
53
Robinson, W.D. (1950) Ecological correlations and the behavior of individuals. American Sociological Reviews, 15, 351±7. Rosenbaum, P.R. and Rubin, D.B. (1984) Dif®culties with regression analyses of age-adjusted rates. Biometrics, 40, 437±43. Rothman, K.J. and Greenland, S. (1998) Modern Epidemiology, (second edition), Lipincott-Raven. Salway, R. and Wake®eld, J.C. (2002) Sources of bias in ecological studies of non-rare events. Submitted for publication. Sans, S., Elliott, P., Kleinschmidt, I., Shaddick, G., Pattenden, S., Walls, P., Grundy, C., and Dolk, H. (1995) Cancer incidence and mortality near the Baglan Bay petrochemical works, South Wales. Occupational and Environmental Medicine, 52, 217±24. Scott, A.J. and Wild, C.J. (1997) Fitting regression models to case-control data by maximum likelihood. Biometrika, 84, 57±71. Shaddick, G. and Wake®eld, J. (2001) Modeling multivariate pollutants at multiple sites. Applied Statistics, 51, 351±72. Sheppard, L., Prentice, R.L., and Rossing, M.A. (1996) Design considerations for estimation of exposure effects on disease risk, using aggregate data studies. Statistics in Medicine, 15, 1849±58. Siemiatycki, J., Wacholder, S., Dewar, R., Cardis, E., Greenwood, C., and Richardson, L. (1988) Degree of confounding bias related to smoking, ethnic group, and socioeconomic status in estimates of the associations between occupation and cancer. Journal of Occupational Medicine, 30, 617±25. Stone, R.A. (1988) Investigations of excess environmental risks around putative sources: Statistical problems and a proposed test. Statistics in Medicine, 7, 649±60. Wake®eld, J.C. (2003) Sensitivity analyzes for ecological regression. Biometrics, 59, 9±17. Wake®eld, J.C. (2004) Ecological inference for 2 6 2 tables (with discussion). To appear in Journal of the Royal Statistical Society, Series A. Wake®eld, J.C., Best, N.G., and Waller, L.A. (2000) Bayesian approaches to disease mapping. Spatial Epidemiology: Methods and Applications, P. Elliott, J.C. Wake®eld, N.G. Best, and D. Briggs (eds), Oxford University Press, pp. 104±27. Wake®eld, J.C. and Morris, S.E. (1999) Spatial dependence and errors-in-variables in environmental epidemiology. In Bayesian Statistics, 6, J.M. Bernardo, J.O. Berger, A.P. Dawid, and A.F.M. Smith (eds), Oxford University Press, Oxford, pp. 657±84. Wake®eld, J.C. and Morris, S.E. (2001) The Bayesian modeling of disease risk in relation to a point source. Journal of the American Statistical Association, 96, 77±91. Wake®eld, J.C. and Salway, R. (2001) A statistical framework for ecological and aggregate studies. Journal of the Royal Statistical Society, Series A, 164, 119±37. White, J.E. (1982) A two stage design for the study of the relationship between a rare exposure and a rare disease. American Journal of Epidemiology, 115, 119±28. Whitley, E. and Darby, S.C. (1999) Quantifying the risks from residential radon. In Statistics for the Environment 4: Statistical Aspects of Health and the Environment, V. Barnett, A. Stein, and K.F. Turkman (eds), John Wiley and Sons, Chichester, pp. 71±89. Wilkinson, P., Thakrar, B., Shaddick, G., Stevenson, S., Pattenden, S., Landon, M., Grundy, C., and Elliott, P. (1997) Cancer incidence around the Pan Britannica industries pesticide factory, Waltham Abbey. Occupational and Environmental Medicine, 54, 101±7. Wilkinson, P., Thakrar, B., Walls, P., Landon, M., Falconer, S., Grundy, C., and Elliott, P. (1999) Lymphohaematopoietic malignancy around all industrial complexes that include major oil re®neries in Great Britain. Occupational and Environmental Medicine, 56, 577±80. Williams, D.R. and Collins, C. (1995) US socioeconomic and racial differences in health: patterns and explanations. Annual Review of Sociology, 21, 349±86. Winn, D.M., Blot, W.J., Shy, C.M., Pickle, L.W., Toledo, A., and Fraumeni, J.F. (1981) Snuff dipping and oral cancer among women in the southern United States. New England Journal of Medicine, 304, 745±9.
Uncorrected Proof
M11109 Kluwer Academic Publishers
Environmental and Ecological Statistics (EEST)
Tradespools, Frome, Somerset
3B2 Version 6.05e/W (Mar 29 1999)
{Kluwer}Eest/Eest 11_1/254725/5254725.3d
Date: 26/11/03
54
Time 11:47am
Page 54 of 54
Wake®eld
Wolpert, R.L. and Ickstadt, K. (1998) Poisson/gamma random ®eld models for spatial statistics. Biometrika, 85, 251±67. Yasui, Y., Potter, J.D., Stanford, J.L., Rossing, M.A., Winget, M.D., Bronner, M., and Daling, J. (2001) Breast cancer risk and ``delayed'' primary Epstein-Barr virus infection. Cancer Epidemiology, Biomarkers and Prevention, 10, 9±16.
Biographical sketch Dr Jon Wake®eld is Professor in the Departments of Statistics and Biostatistics at the University of Washington. He received his bachelor's degree in 1985 and his Ph.D. in 1992, both from the University of Nottingham in the United Kingdom. His research interests are in spatial epidemiology, ecological inference and, more generally, in the modeling of medical data.
Uncorrected Proof
M11109 Kluwer Academic Publishers
Environmental and Ecological Statistics (EEST)
Tradespools, Frome, Somerset