Evaluating the Goodness of Fit in Models of ... - Semantic Scholar

4 downloads 0 Views 119KB Size Report
Poisson models, fitted as generalized linear models, were used to ... Keywords: sparse data, Poisson regression models, simulation, end-stage renal failure ...
International Journal of Epidemiology © International Epidemiological Association 1997

Vol. 26, No. 3 Printed in Great Britain

Evaluating the Goodness of Fit in Models of Sparse Medical Data: A Simulation Approach PAUL BOYLE,* ROBIN FLOWERDEW** AND ANDREW WILLIAMS† Boyle P (School of Geography, University of Leeds, Leeds LS2 9JT, UK), Flowerdew R and Williams A. Evaluating the goodness of fit in models of sparse medical data: a simulation approach. International Journal of Epidemiology 1997; 26: 651–656. Background. Epidemiological studies of rare events, which are common in the medical literature, often involve modelling sparse data sets. Assessing the fit of these models may be complicated by the large numbers of observed zeros in the data set. Methods. Poisson models, fitted as generalized linear models, were used to investigate the referral patterns of patients suffering from end-stage renal failure in south west Wales. The usual method for assessing the goodness of fit is to compare the deviance with a χ2 distribution with appropriate degrees of freedom. However, this test may be invalid when the data set is sparse, as the deviance values may be unusually low compared to the degrees of freedom. This would suggest that there is a problem with underdispersion when, in fact, the large numbers of zeros in the data set make the comparison with the χ2 distribution unreliable. A simulation approach is advocated as an alternative method of assessing model fit in these situations. Results. Three models are considered in detail here. The first modelled the total referrals in each of the 245 wards in the study area and included two explanatory variables. These observations were not unusually sparse and both the χ2 goodness of fit test and the simulation methodology outlined here suggested that the model did not fit. The second model included the population ‘at risk’ as an offset and the model improved considerably. Both the χ2 test and the simulation approach suggested that this model did fit. Finally, the data were disaggregated into five age groups providing 1225 observations and a very sparse data set. According to the χ2 goodness of fit test, the deviance was very low suggesting that the model was underdispersed. Using simulated data, it was shown that the deviance was not unusually low and that the model fitted the data reasonably well. Conclusion. In cases where the data set being modelled is sparse, it is useful to test the goodness of fit of a Poisson model using a simulation approach, rather than relying on the χ2 test. Keywords: sparse data, Poisson regression models, simulation, end-stage renal failure

The sparsity of data sets of this type depends on a range of factors which may, or may not, be under the control of the researchers. The data may be provided for a set of predetermined geographical units, such as Local Health Authority Districts, and further disaggregation may be impossible. Alternatively, the original data may be provided at the individual level and aggregation is undertaken for confidentiality reasons or to allow comparison with other variables available for administrative zones, such as wards or enumeration districts (ED). The choice of appropriate units will depend on the availability of other relevant data and, perhaps, the problem of small numbers of cases. The disaggregation of the medical data into subgroups may also increase the sparsity of the data set. At the very least, it is often sensible to separate the data by age group, depending on the specific relationships being examined.

MODELLING DISCRETE EVENTS Medical studies often involve modelling disease outcomes or deaths where events are reported as aggregations of cases within some set of, often arbitrary, geographical units. Lovett et al.,1 for example, considered geographical variations in mortality from ischaemic heart disease and drew attention to the problem of dealing with small numbers. Rather than measuring mortality as rates for the ‘at risk’ population they advocated a Poisson regression approach according to which the number of deaths has a Poisson distribution and is modelled as a function of several explanatory variables. * School of Geography, University of Leeds, Leeds LS2 9JT, UK. ** Department of Geography, Lancaster University, Lancaster LA1 4YB, UK. † Department of Nephrology, Morriston Hospital, Swansea SA6 6NL, UK.

651

652

INTERNATIONAL JOURNAL OF EPIDEMIOLOGY

In this study Poisson regression was used to model the age-specific referral patterns of patients suffering from end-stage renal failure (ESRF) in south west Wales between April 1985 and March 1994. The count data were aggregated into wards, resulting in a very sparse data set and the deviance values obtained from the generalized linear model suggested that the model was underdispersed. In reality, the large number of zero observations invalidated the use of the χ2 goodness of fit statistic. A relatively straightforward simulation approach was implemented which allows the deviance figure to be interpreted more satisfactorily and it is suggested that, in studies which are modelling sparse data sets, methods of this type should be considered. This helps to prevent erroneous conclusions being drawn from models where the deviance value suggests that the models do not fit.

END-STAGE RENAL FAILURE IN SOUTH WEST WALES In a previous analysis2 the spatial distribution of 539 adult patients resident in Dyfed and West Glamorgan suffering from ESRF who were referred for renal replacement therapy between April 1985 and March 1994 was examined. The primary aim was to test whether there was a relationship between these referrals and distance from the renal unit and, if so, whether this relationship varied for patients in different age groups. The resources allocated to renal services within Wales have been increased significantly since 19853 and the levels of treatment are known to be high in this study area compared with other parts of Britain. Despite the improvement in the renal services offered, there was still concern that the referral of patients was spatially biased. It was hypothesized that the referral of patients, particularly those in the older age groups, was unusually low from the more remote regions of the study area. The catchment area for the renal unit at Morriston Hospital, Swansea encompasses Dyfed and West Glamorgan, with a small number of patients being referred from Powys (these were excluded from the analysis). It is a coastal area which is relatively self-contained and the problem of cross-border outflows of patients is negligible. The study area also contains a mixture of small urban centres and more remote rural areas providing a diverse set of residential environments for the analysis. Table 1 disaggregates the patients into their respective age groups and also provides the total populations in each age group, which were extracted from the 1991 Census Local Base Statistics (LBS). As expected, the incidence of ESRF increases with age.4 The 539 cases were aggregated into the 245 wards within the two

TABLE 1 The age distribution of renal failure referrals and the total population, in south west Wales Age group 16–29 30–44 45–59 60–75 75+ Total

Patients 32 66 132 218 91 539

Population 129 141 123 116 54 565

490 308 941 690 205 634

counties (the populations of the wards varied between 578 and 13 080 according to the 1991 British Census) and there were 67 wards with zero cases (27%). The maximum number of cases in any single ward was 20 and only 10 wards had 10 or more cases. As we would expect from the underlying population distribution a large proportion of these cases lived close to the renal unit. The research hypothesis required that these 539 cases were separated into five broad age groups in order to examine whether there was evidence of a distance decay effect in the referral patterns of each age group. This produced a data set with 1225 observations (245 × 5), 882 (72%) of which were zero. In the previous analysis2 a series of hypotheses concerning the distribution pattern of these referrals were tested. The optimum model included distance, which was negatively related to referrals, and the percentage of people resident in council housing, which was positively related to referrals, as explanatory variables. For comparative purposes, three models were tested here. The first used the total number of cases in each ward as the dependent variable and did not control for the population at risk, or disaggregate by age. This provided 245 observations, only 67 (27%) of which were zero. The second included the log of population as an offset to control for the population at risk (245 observations); and the third is that used in Boyle et al.2 which disaggregated the data into five age groups and included the log of age-specific populations as an offset (1225 observations). The third model was shown to fit the data significantly better than the first two, but the two ‘unsatisfactory’ models are useful for comparing the methods used to assess the model fit.

POISSON REGRESSION MODELLING The Poisson distribution is often appropriate for the analysis of count data and has been adopted in various medical studies. This distribution has a variance

653

GOODNESS OF FIT FOR SPARSE DATA

equal to its mean and is the outcome of a probability process: e–λλk

Pr(k) =

k!

where Pr(k) is the probability of k occurrences, e is a constant and λ is the mean number of occurrences. In this analysis a generalized linear modelling approach5 was adopted, which has the structure: J

λˆ i = f (∑ β j Xij ) j =1

where Xij represents the value of the jth of J explanatory variables X for the ith observation of I areal units. The parameter λi is estimated to be equal to a function f of a linear combination of explanatory variables, and the observed data are a realization of the process with parameter λi. The Poisson regression model can be written as: J

λˆ i = exp (∑ β j Xij ) j =1

distribution whose expected value where λi has a Poisson J is equal to exp(Σ β X ij) and the explanatory variables j=1 j (Xij) may be continuous or categorical data, or some mixture of the two. A priori information may also be included in generalized linear models and in order to control for the expected numbers of referrals in each ward, the population ni was included as an offset.6 Such a priori information is incorporated into the model by treating it as a covariate with a known parameter value of one and this can be expressed as: J

λˆ i = exp (∑ β j Xij ) exp (β n ln ni ) j =1

where βn is always equal to one.

ASSESSING THE FIT It is usual to assess the fit of such models using a likelihood ratio statistic known as the deviance. In the Poisson case this may be calculated as: I

D = 2 (∑ λ i ln (λ i / λˆ i )) i =1

where ˆλi is the expected number of cases estimated by the model. The distribution of the deviance is

asymptotically χ when the sample size is large, allowing the goodness of fit of the model to be assessed by comparing the size of the deviance to the critical χ2 value. The degrees of freedom are calculated as the number of observations minus the number of coefficients in the linear predictor. Normally a deviance which is close to the number of degrees of freedom suggests that the model is a good fit. If the deviance is much larger than the degrees of freedom, the model is said to be overdispersed. Overdispersion may arise if important explanatory variables are omitted from a model or if the process being modelled has a variance greater than the mean (in the Poisson distribution the mean and variance are equal); Congdon7 discusses ways of treating the case where the variance is larger than the mean. Underdispersion, where the deviance is less than the degrees of freedom, has received less attention; it might be regarded as implying that the variance is less than the mean, but recent work8,9 has shown that low deviance values may also be obtained when large sparse data sets are being modelled, invalidating the asymptotic convergence of the Poisson deviance to the χ2 distribution. An alternative method of interpreting the model deviance is to estimate what the deviance value should be for a sparse data set if the model fitted the data well and this is possible using simulations of the data. The fitted values which are derived from the original Poisson model may be regarded as the means of a set of Poisson random variables and, assuming that these fitted values are correct, random numbers for each observation can be generated and compared with the Poisson cumulative distribution function to provide simulated data. A new set of fitted values may then be estimated by fitting a Poisson model to these simulated data and the deviance of this new model may be calculated by comparing the simulated data, which are now treated as the observed values, with the new set of fitted values. Because the data have been produced according to a known model, the deviance is approximately what we would expect if a correct model were fitted to the original, sparse data set. If the observed model deviance lies within the middle 95% of the simulated distribution, it is reasonable to accept the model at the 0.05 significance level. The application of this approach to the sparse data set analysed in this study is described below. 2

INTERPRETING MODEL FITS A range of Poisson models is described below which become increasingly complex. These are used to show that the interpretation of the deviance becomes more complicated as the sparsity of the data set increases,

654

INTERNATIONAL JOURNAL OF EPIDEMIOLOGY

TABLE 2 Parameters and standard errors (models 1–3) Variable

Model 1 Parameter

Intercept ln dij Ci Deviance Degrees of freedom

Model 2

Standard Error

Parameter

Standard Error

0.04295

2.314 –0.6327 0.003568

0.1475 0.04538 0.000547

0.7885

742.8 244

451.75 242

Model 3 Parameter

Standard Error

–6.517 –0.2263 0.001816

0.1365 0.04224 0.0005267 247.06 242

ln dij = the natural logarithm of distance from each ward to the renal unit. Ci = the percentage of total residents in council housing in each ward.

and data simulations are used to provide an alternative method of assessing model fit. Initially, the total referrals from each ward were modelled without accounting for age differences or the population at risk. This establishes the usefulness of the deviance measure when the data set was not especially sparse, even though the models may not be the best to test the study hypotheses. The null model is a baseline model where a single intercept is fitted and each of the estimated flows is equal to the mean observed flow. This produced a deviance of 742.8 with 244 degrees of freedom (model 1, Table 2) where deviance is calculated as defined above. In the previous analysis,2 two variables (the natural logarithm of distance from the renal unit and the percentage of residents in council housing) were found to be important explanatory variables and these were included in the model used here. Their inclusion reduced the deviance to 451.75 (model 2, Table 2) with 242 degrees of freedom and both of these variables were significant and in the hypothesized directions. This deviance was higher than the upper χ2 value of 279.0 at the 95% significance level and consequently this model did not appear to fit the data. As mentioned above, according to the conventional χ2 test the degrees of freedom will approximately equal the deviance if the model fits. The data set used in this model was not especially sparse and consequently comparing the deviance value to the critical χ2 value is valid. For the purposes of this analysis, however, it is useful to confirm this finding by simulating the data. Figure 1 provides a histogram of 1000 simulated deviance values and a comparison of the model deviance with the simulated mean of 270.4 supports the conclusion that the model does not fit the data because the figures are so different; the simulated average value was much closer to the number of degrees of freedom (242).

FIGURE 1 Deviances from 1000 simulations (model 2)

This model was improved by including the natural logarithm of population within wards as an offset, which is equivalent to controlling for the expected number of cases, and two explanatory variables (model 3, Table 2). The deviance was reduced to 247.06 with 242 degrees of freedom suggesting that the model fits the data well as the deviance was close to the number of degrees of freedom. The upper critical χ2 value was 279.0 and the lower value was 206.7 and the deviance falls neatly between these two values. Figure 2 provides a histogram of the 1000 simulated deviances derived from the revised estimates from model 3. The mean value of 267.9 (with 242 degrees of freedom) was close to the actual model deviance and this confirms that the model fits the data well; this conclusion would be reached using either of the goodness of fit methods described here. It was hypothesized that there would be significant differences in the referral patterns of those in different age groups and for this reason the model was extended by disaggregating the observations into five age groups (Table 1). This provides 1225 observations; five for each ward. Table 3 provides the resulting deviances and

655

GOODNESS OF FIT FOR SPARSE DATA

FIGURE 2 Deviances from 1000 simulations (model 3)

parameters and the initial null model deviance of 1564.3 (model 4, Table 3) was reduced to 1171.6 when the population offset (calculated separately for each age group within wards) was included. This result is suspicious as the null deviance is lower than the number of degrees of freedom. This model was extended to include a five-level dummy variable which related to age group and the interactions between this variable and the natural logarithm of distance and the percentage of residents in council housing (model 5, Table 3). The final deviance fell to 853.69 which is far lower than the degrees of

freedom. The critical χ2 values were 1292 and 1130 and it is evident that the deviance has fallen well below the lower threshold. Using the χ2 goodness of fit test, it would be reasonable to conclude that underdispersion was a problem. Figure 3 provides a histogram of 1000 deviances simulated from the values estimated in model 5. The highest simulated deviance was 1000.0, and the lowest was 805.4, with a mean of 889.5, with 1210 degrees of freedom. The average simulated deviance, which was close to the model deviance, was also smaller than the lower χ2 value and this suggests that the unusually low model deviance resulted from the sparsity of the data set, rather than the process having a variance less than its mean. It appears that the model was actually a relatively good fit based on the simulation results and this is an example of where the sparsity of the data set results in potentially erroneous conclusions being drawn about model fit. It should be remembered of course, as with any goodness of fit test, that a model that fits the data need not necessarily be correct.

CONCLUSION A simulation method has been described which can be used to assess the fit of Poisson models in cases where the dependent variable is sparse. The resulting deviance from model 2 (Table 2) which used the smaller data set based on 245 observations indicated that the model failed to fit, based on the usual χ2 goodness of fit test. This was confirmed by the 1000 simulations which

TABLE 3 Parameters and standard errors (models 4–5) Variable

Intercept age(2) age(3) age(4) age(5) age(1).ln dij age(2).ln dij age(3).ln dij age(4).ln dij age(5).ln dij age(1).Ci age(2).Ci age(3).Ci age(4).Ci age(5).Ci Deviance Degrees of freedom

Model 4

Model 5

Parameter

Standard Error

Parameter

–0.821

0.04302

–7.537 0.2189 0.6674 1.615 2.659 –0.382 –0.159 –0.0864 –0.2149 –0.6015 0.0103 0.0045 0.0143 0.0117 0.0015

1564.3 1224

Standard Error 0.519 0.644 0.5946 0.567 0.6051 0.1717 0.1161 0.0851 0.0695 0.1091 0.0097 0.0081 0.0053 0.0041 0.0073 853.69 1210

656

INTERNATIONAL JOURNAL OF EPIDEMIOLOGY

ACKNOWLEDGEMENTS Peter Diggle originally suggested using a simulation approach when dealing with sparse data and Hana Kudlac helped supply the data. The Census data are Crown Copyright, were bought for academic use by the ESRC/JISC and are held at the Manchester Computer Centre.

REFERENCES 1

FIGURE 3 Deviances from 1000 simulations (model 5)

had a mean value which was much lower than the model deviance. The deviance from model 3 (Table 2) was much lower and according to the χ2 test the model now fitted the data, and once again this conclusion was confirmed using simulations. In these examples, the observed data were not especially sparse and both methods of measuring the goodness of fit provided reasonable assessments of the ability of these models. The referrals were then disaggregated into five age groups, creating a very sparse data set. The deviance from model 5 was much smaller than both the degrees of freedom and the lower critical χ2 value. If the data set being modelled was not sparse, it would be reasonable to conclude that underdispersion was a problem, but in this example the large number of zeros invalidates the χ2 goodness of fit statistic. The simulation approach indicated that, in reality, the model fitted the data well. It is advisable that a method of this type is adopted to assess the fit of models when the data sets are sparse, if only to confirm conclusions that are based on comparisons between the deviance and the χ2 goodness of fit test.

Lovett A, Bentham C G, Flowerdew R. Analysing geographic variations in mortality using Poisson regression: the example of ischaemic heart disease in England and Wales 1969–1973. Soc Sci Med 1986; 23: 935–43. 2 Boyle P J, Williams A J, Kudlac H. Geographical variation in the referral of patients with chronic end stage renal failure for replacement therapy. Q J Med 1996; 89: 151–57. 3 Barnes J N, Bloodworth L L O, Drew P J T et al. Letter to the editor. Br Med J 1992; 305: 1018. 4 Taube D H, Winder E A, Chisholm O S et al. Successful treatment of middle aged and elderly patients with end stage renal disease. Br Med J 1983; 286: 2018–20. 5 McCullagh P, Nelder J A. Generalised Linear Models, 2nd edition. London: Chapman and Hall, 1989. 6 Knudsen D C. Generalising Poisson regression: including a priori information using the method of offsets. Professional Geographer 1992; 44: 202–08. 7 Congdon P. Approaches to modelling overdispersion in the analysis of migration. Environ Planning A 1993; 25: 1481– 1510. 8 Boyle P J, Flowerdew R. Modelling sparse interaction matrices: interward migration in Hereford and Worcester, and the underdispersion problem. Environ Planning A 1993; 25: 1201–09. 9 Flowerdew R, Boyle P J. Migration models incorporating interdependence of movers. Environ Planning A 1995; 27: 1493– 1502.

(Revised version received September 1996)