Large-Scale Imputation for Complex Surveys - CiteSeerX

LARGE-SCALE IMPUTATION FOR COMPLEX SURVEYS

David A. Marker, David R. Judkins, and Marianne Winglee Westat

1.

Introduction Much of the recent research into imputation methodology has focused on developing

optimal procedures for a single variable or set of variables, where the patterns of missingness and underlying distributions follow standard distributions. In contrast, it is frequently necessary to impute for many variables from a single survey, with an even larger set of potential covariates and complex covariance structures among the variables to be imputed. Further, the imputations need to be completed in a relatively short time frame within a constrained budget. The analyst also is unlikely to be able to anticipate all of the important analyses for which the imputed data are to be used. This often prevents analysts from being able to produce optimal imputations for each variable.

Instead, one tries to produce a set of imputed variables that minimize the

attenuation of key relationships, hopefully reduces nonresponse bias, and satisfies the time and budgetary constraints. These complexities often require the selection of methods that use imputed values of one variable to impute others, possibly through iterative procedures. Both Bayesian and hot-deck procedures have been proposed for such situations. This chapter describes how three actual case studies addressed these issues when imputing for large-scale complex surveys. A number of strategies could be chosenin such situations. By providing examples from three U.S. government surveys, the chapter demonstrates how hot-deck imputation strategies can successfully be implemented in this setting. The three surveys are the 1994 National Employer Health Insurance Survey, the 1992-96 Medical Current Beneficiaries Survey, and the 1991 Reading Literacy Study by the IEA (International Association for the Evaluation of Educational Achievement). In each survey it was necessary to develop imputation models in large data sets (tens of thousands of respondents) for dozens of variables, many of which were highly correlated with each other, requiring careful consideration of the sequence of imputation. Many of the variables had to be imputed satisfying arithmetic constraints. Item nonresponse rates varied from under five percent to over 50 percent following complex swiss-cheese patterns.

After briefly discussing the

complications in these datasets, the authors explain the strategies implemented to produce imputations that achieve many of the characteristics desired by optimal procedures.

1

1.1

Reasons to impute data for complex surveys Large complex datasets typically contain large numbers of variables measured on

even larger numbers of respondents. Such datasets are the logical result of surveys that attempt to understand the relationships among characteristics of the population of inference and multiple outcome measures. Such surveys are frequently conducted by or for government agencies, covering topics such as health, welfare, education, and many other topics.

These data are

expensive to collect, but once collected, provide analysts with a wealth of analytic possibilities. The missing responses in the questionnaire items can be handled in one of two ways: they can be filled in by some form of imputation, or they can be left as missing with missing data codes assigned in the data files. The use of imputation to assign values for item nonresponse in large-scale surveys has a number of advantages (see, for example, Kalton, 1983). One is that carefully implemented imputations can reduce the risk of bias in many survey estimates arising from missing data. Second, with data assigned by imputation, estimates can be calculated as if the dataset were complete. Thus, analyses are made easier to conduct and the results easier to present.

By including the responses from partially-complete cases, the power of statistical

analyses for marginal means and totals is also increased.

Third, the results obtained from

different analyses will be consistent with one another, a feature that need not apply to the results obtained from an incomplete data set. (For example, estimates of total production will only equal the sum of the production totals by establishment size, after size has been imputed for all establishments.) However, when analyzing an imputed data set, it needs to be recognized that the standard errors of the estimates are larger than those that would apply if there were no missing data. (Lee, Rancourt, and Sarndal (Chapter XX of this book), Brick and Kalton, 1996, and Rubin 1987). The alternative to imputation is to leave the task of compensating for the missing data to the data analysts. Analysts can then develop methods of handling the missing data to satisfy their specific analytic models and objectives (Little and Rubin, 1987). Such an approach may be preferable in some cases, but it is often impractical. Most analysts confronted with this task are likely to rely on the options available in software packages for handling missing data. Moreover, given the wide range of analyses that are conducted with a survey data set, it is unrealistic to believe that an efficient compensation procedure can be developed for each

2

individual analysis, while it is possible to retain a core set of relationships when producing an imputed data set to be used by others. Finally, the data producers may have access to useful restricted information that can assist in their imputations.

1.2

Constraints limiting development of optimal imputation for each variable Much of the research on imputation has concentrated on best methods for imputing

for a single variate at a time. In large complex datasets, the situation is much harder, because the resulting data must satisfy multiple logical consistencies that are often intertwined.

These

relationships can take the form of one variable being the sum or ratio of others, or that while not being the exact sum (or ratio), it should approximate that relationship. Also, in developing models to use in imputation, it is desirable to anticipate the main analyses that are planned for the imputed data and to try to avoid attenuating the variance among the variables whose relationships are being investigated. By their very nature, large complex datasets are analyzed by many users over many years. It is impossible to anticipate all of the significant analyses that will be conducted by the analysts. It is only possible to work with those who designed the original study, to try and anticipate which relationships are most important to accurately preserve during the imputation process. With the number of variables collected in the dozens or even hundreds, it is often impractical to try and develop best models for each variable. Rather, limits may be placed on the time and resources devoted to development of each model, with the goal of filling in for as much missing data as possible, given finite resources. If the survey is not a repeated survey, there may be a lack of historical data available on the relationships among the variables. In this situation, model development has to be based on the observed responses or on a-priori beliefs. Not only is this more complicated, since the database used to develop the model contains the very biases that are hoped to be reduced, but the time between data collection and database production is often limited by sponsor’s eagerness to begin analyses.

1.3

Achievable goals of imputation The goal of any imputation should be to provide a database containing complete

cases allowing for easy, consistent, analyses. The resulting database should minimize bias from nonresponse in univariate analyses and attenuation of key multivariate relationships. Subgroup

3

analyses should be consistent with marginal distributions. The greater the resources (statistical skill, knowledge of potential uses, budget, time, etc.) available, the better one will be able to achieve these goals. From the examples in this chapter, it is hoped that one will better understand how to address this with finite resources, while still trying to come as close as possible to these ideals.

2.

Multivariate Approaches

2.1

Bayesian methods (Gibbs Sampling) Bayesian methods of imputation have experienced a period of exciting growth in the

1990s. This progress centered around Don Rubin, Rod Little, and Joe Schafer. The field was thoroughly reviewed in Schafer (1997). The methods involve using Gibbs Sampling and other Monte Carlo Markov Chain (MCMC) methods to simulate the posterior distribution of the population given the data, a model, and a prior distribution. These Bayesian methods are a major advance over maximum likelihood methods such as the EM algorithm because they can be used with more general models than the multivariate normal. It is possible to use them to estimate the joint posterior for a mixed vector of categorical and continuous variables. Also, it is possible to reflect not just fixed population parameters, but random parameters, thereby allowing clustering to be properly reflected in the imputation. Furthermore, Bayesian methods can be used to estimate variances on statistics based on partially imputed data in a way that reflects the uncertainty due to having to estimate the components of variance. In common with model-based maximum likelihood methods, Bayesian methods allow for the preservation of a large number of low-order associations in the data. This is in contrast to hotdeck methods which typically allow the preservation of only a small number of low-order associations. The hot-deck methods also preserve high-order associations for the variables that are used to define similarity or distance, but the preservation of a small number of high-order associations may frequently be less valuable than the preservation of a large number of low-order associations. Also in common with model-based maximum likelihood methods, Bayesian methods get around the chicken versus egg problem in non-nested nonresponse, first noted by David et al. (1983). This is a problem that has long bedeviled hot-deck imputation methods. Any variable that has not yet been imputed may not be a good tool for defining similarity or distance,

4

but omitting such variables from the hot-deck partitions leads to attenuation of association. By using iterative methods, both approaches extract more value out of cases with partially reported data than has traditionally been possible with hot-deck methods. We note that these Bayesian methods are sometimes referred to as multiple imputation, but this is really a misnomer. The linkage between Bayesian imputation methods and multiple imputation is that multiple imputation is the natural tool for variance estimation given a dataset imputed by Bayesian methods. Since MCMC methods generate a continuous stream of plausible values for each variable and case requiring imputation, it is a matter of great simplicity to retain several imputed values for each such variable-case combination. Armed with these multiple imputations from an MCMC-fitted model, the Bayesian imputer can then create point and interval estimates for parameters of interest (Rubin, 1996). Given this variety of very attractive features, the question arises of whether one should ever again consider using hot-deck imputation procedures. We believe that there are still reasons to consider hot-deck methods, particularly ones that have incorporated the iterative features of the other methods so that more value can be extracted from the cases with partial data. This is the subject of our next section.

2.2

Alternative methods There are at least three reasons for using alternative methodologies, one practical

and the others conceptual.

The practical reason is that the Bayesian methods can require

significant computing capabilities. While computing power has been increasing impressively, it is still the case that many do not have access to the necessary high-speed computing hardware nor the knowledge of how to use it to run complex iterative procedures. The development of standardized software is helping to break down this barrier, but it should be recognized that many data producers and users are completely stymied by this roadblock. The first conceptual reason for using alternatives involves the reluctance to use complex modeling based on subjective prior distributions when producing data sets and estimates from otherwise design-based surveys. To use Bayesian models requires gaining agreement from sponsoring agencies and the main potential users on the appropriate prior distributions to use in generating the imputations. This is not difficult when a sole researcher plans, designs, carries out, and then directs the analyses from a survey.

5

However, in many large-scale, government-

sponsored, surveys, these steps involve many different people, some of whom may not be even known at the time when imputations are required. Often, the government wants to produce a public-use data set that will be available to a wide range of users. These users may not be identified at the time of imputation, or they may have contradictory analytic plans. Recent research (Ghosh 199x) has demonstrated that use of improper prior distributions can lead to improper posterior distributions, thus necessitating the use of proper priors. The shape of such priors can significantly affect the resulting imputations.

By avoiding the need for prior

distributions, one can be robust to possible model-failure by a model that isn’t of great concern to many data users. The second conceptual concern deals with the shape of the assumed distribution for the variable being imputed. Existing Bayesian methods assume smooth standard distributions (e.g., NORM requires a multivariate normal distribution). Many variables do not have such distributions. Expenditure data often consists of many zeroes with the remainder some form of right-skewed distribution; number of hours worked often has discrete lumps at multiples of 5; amount of dollars paid for medical treatment also has lumps at common values such as $10. Environmental contamination samples frequently are a mixture of mostly values below detectable levels, with the others following a right-skewed distribution. These distributions are easily handled by hot-deck imputation methods but are difficult to fit into existing Bayesian procedures. Alternative methods identify strong covariates for use in hot deck or regression imputations. These covariates can be identified through examination of correlation matrices and patterns of missingness. Multivariate relationships can also be maintained by restricting donors to those with responses to an entire set of variables. Using a common donor for a set of variables can assure satisfaction of multivariate constraints across re-imputed variables provided that the set are either jointly missing/reported or that partially reported vectors are overwritten. Logical or numeric constraints can also be retained through repeated imputation-edit cycles. Cyclical hotdeck imputation methods can also be used, which use initial imputations of one variable to improve the imputation of others. This process is repeated until some convergence criteria are met, similar to MCMC methods. Despite the advantages of avoiding computer-intensive methods, prior distributions, and assumptions of standard, smooth distributions, hot-deck methods do have some disadvantages. Hot-deck methods preserve a set of high-order relationships, but cannot maintain as many first-order relationships maintained by Bayesian methods.

Hot-deck models are

sometimes less transparent than Bayesian ones due to the need to make ad hoc decisions on cell 6

collapsing to find appropriate donors. Finally, by categorizing continuous variables, hot-deck methods lose some explanatory power of these variables; although this can be minimized by increasing the number of categories.

3.

Three Large-Scale Imputation Examples Examples are provided from three complex surveys demonstrating the complex

issues faced by large-scale imputation. The imputations for all three surveys were done by Westat, but the peculiarities of each survey led to different decisions on how to impute for missing data.

The three surveys are the National Employer Health Insurance Survey, the

Medicare Current Beneficiaries Survey, and the Reading Literacy Study.

3.1

National Employer Health Insurance Survey (NEHIS)

3.1.1

Magnitude and complexity of problem The 1994 National Employer Health Insurance Survey (NEHIS) was sponsored by

three United States health agencies, the Health Care Finance Administration (HCFA), Agency for Health Care Policy Research (AHCPR), and the National Center for Health Statistics (NCHS). NEHIS collected information on the health insurance plans offered by 40,000 private-sector establishments and governments (collectively referred to as establishments), and 50,000 health insurance plans offered by those private and public-sector respondents. More than 100 variables were collected for each establishment and each health plan. Fifty of these variables were selected for imputation. Since it was necessary to model these variables separately for public and private sectors, and for fully-insured and self-funded health plans, this required almost 150 separate imputation models (Yansaneh, et al., 1998). Each model had to evaluate dozens of potential covariates.

Ideally, covariates

would be found that were highly correlated with the variable to be imputed and were present when the imputation variable was missing. Further complicating the effort was the fact that the data set was to be used by at least 3 government agencies and many additional unknown analysts. The anticipated uses ranged from modeling national accounts to estimating levels of health insurance coverage to understanding the types of coverage included by different health plans. Thus, it was impossible to gain agreement from sponsors on the few vital relationships that must

7

not become attenuated through imputation. Best efforts would be needed to maintain many different relationships. The item response rates for the 150 variables to be modeled varied from 99 percent to 25 percent; but in almost all cases the response rates were above 70 percent (see Table 1). Even though the government did not plan to publish estimates for the few variables with low response rates, it planned to use them in a variety of modeling efforts since no other source exists for this information. Imputed data based on low response rates were thought to be preferable to using the unimputed data for modeling purposes because the imputation models were likely to significantly reduce some sources of bias. Table 1 - Item response rates for imputation variables. Response Rate

Percent of variables

95-100%

37%

90-95%

21

85-90%

13

80-85%

8

75-80%

9

Below 75%

12 As with many complex surveys, variables were measured at different levels. For

NEHIS, data were collected at the firm (corporation) level, establishment level, the health insurance plan level, and the plan within establishment level. (Other examples of data collected at multiple levels include schools and students within schools, and hospitals and patients within hospitals.) It is important to retain this structure in the imputed data. If a variable is answered once for all establishments in a firm, then it should only be imputed once for a similar set of establishments when the firm did not respond. Further complicating the imputation, the data were subject to numerous logical consistency requirements.

These requirements range from situations where employer and

employee contributions must add up to the total premium, to much more complex arrangements involving combinations of single and family-coverage enrollments and contributions and total plan costs. This required frequent cycling between imputation-edit-reimputation to achieve an imputed dataset that matched the cleanliness of the reported data.

8

3.1.2

Approach chosen Ordering and Grouping Variables Many of these consistency requirements could only be evaluated when the last of the

variables involved was imputed. Thus, it was very important to determine an order of imputation that would maximize the available covariates at each step, and simultaneously allow for checking logical edits as soon as they could possibly be checked. To accomplish all of these tasks, the variables were broken up into chunks of related variables (e.g., enrollments in health plans).The chunks were put together into groups so that all chunks in a group could be imputed simultaneously, since a variable in one chunk would not be needed to check the imputation of a variable in another chunk in the same group. The groups were then ordered in a logical sequence to provide for the maximum available covariates at each step. The basic groups were to first impute enrollments, then component costs, then finally total costs per enrollee. Figure 1 provides a simplified overview of this approach. The following factors should be taken into consideration in deciding on the order of imputation for a large complex data set when one wants to avoid large numbers of cycles: 1)

If one variable is used in the construction of a second variable, then the first variable should be imputed before the second.

2)

The imputations should follow the logical sequence (if any) suggested by the patterns of missingness of the imputation variables, that is, the joint frequencies that identify sets of imputation variables that are missing together. For instance, if the first variable happens to be a strong covariate of the second, and is present in most cases where the second variable is missing, then the first variable should be imputed first.

3)

Within groups, variables using deterministic imputation should be imputed before variables requiring stochastic imputation.

4)

Within groups, decisions about the order of imputation need only be made for imputation variables that are very highly correlated with one another. The order of imputation is not crucial for variables that are not highly correlated.

5)

If the best covariates are the same for a set of imputation variables within a group, and those variables are frequently all missing for the same cases, then those imputation variables should be imputed as a block. The chunks into which the NEHIS imputation variables were partitioned covered

the following data areas: fully-insured premiums; premium equivalents (self-funded plans); plan enrollments; plan costs; deductibles and co-payments; additional plan-level variables; and

9

additional establishment-level variables.

10

Figure 1. Overview of NEHIS Imputation 1

Impute Group A Enrollments Chunks 1 and 2

2 Pass Group A Edit Checks?

No

Yes 3

Impute Group B Component Costs Chunks 3, 4, and 5

4 Pass Group B Edit Checks?

No

Yes 5

Impute Group C Total Costs Per Enrollee Chunk 6

6 Pass Group C Edit Checks?

Yes Done

11

No

The complex nature of the NEHIS data set had a major impact on the order in which the variables were imputed. For instance, health insurance plan costs are a function of plan enrollments and premiums. Therefore, enrollments and premiums were imputed before plan costs. Also, within the chunk consisting of premiums, the employer and employee contributions to the premiums for single coverage were found to be the most highly correlated covariates for the corresponding contributions and premiums for family coverage.

Therefore, the single-

coverage contributions and premiums were imputed first, and then used in the imputation of their family-coverage counterparts. Several variables within some chunks consisting of premiums and enrollments were imputed simultaneously as a block. Examples of such blocks of variables are the number of retirees under 65 and the number of retirees over 65 for all plans; the employer contributions to premiums and the premiums for single coverage for fully-insured plans in the public sector; and the number of enrollees and the number of enrollees with family coverage for private sector plans. The imputation process for cost variables for fully-insured plans provides an illustration of the importance of a thorough understanding of the structure of the data set to the formulation of an imputation strategy. For fully-insured plans, the plan cost variable of interest is the total annual premium. While the total annual premium was a question on the questionnaire, it should equal the sum of the annual premium estimated from monthly premiums and enrollments plus a noise factor. Adding the noise factor was deemed appropriate because the enrollment figures are for a point in time (end of plan year), while the total annual premium covers an entire year. The distribution of the noise factor is expected to be highly skewed and to contain extreme values due primarily to retiree-only plans with large numbers of retirees and due to plans with extremely large enrollments. Therefore, the imputation process started with an exploratory data analysis on the noise factor and various transformations of the factor. Two types of outliers were identified: unreasonable outliers, which arise from reported values that are clearly inconsistent with other reported data; and reasonable outliers, which have values that are consistent with the rest of the data but have other characteristics that render them undesirable as donors for imputation. In the case of unreasonable outliers, selected reported values, and other values derived from them, were deleted from the database prior to imputation. No modification of the data was made in the case of reasonable outliers, but the associated plans were excluded from the donor pool during imputation. The imputation process was further complicated by the fact that cost variables such as the total annual premium (cost) are likely to be highly correlated with the total number of enrollees (active and retiree), and that this relationship depends on whether or not a plan has enrollees, as well as if it is a retiree-only plan. 12

Therefore the imputation of total annual premium was done separately for various subgroups of plans: single service plans (e. g., dental or vision plans), major medical plans with no enrollees, retiree-only major medical plans, and all other major medical plans. Depending on the quality of their covariates, the total annual premium was either imputed directly or constructed by addition after the noise factor is imputed. In planning any imputation process it is important to decide how consistent the imputed data should be. In NEHIS, the goal was to make the data set after imputation at least as good as the one before imputation in terms of allowable ranges and multivariate relationships between variables. For example, care was taken not to impute data values that were out of range, and to be sure that algebraic relationships among variables were preserved (such as one variable being the sum of three others). These consistency requirements frequently necessitated both an edit-impute cycle and an edit-construct cycle. The data set before imputation was edited, then missing values were imputed, and then the imputed data were edited again. Any values that failed edits and were set to missing were then re-imputed. Similarly, during the course of imputation, an impute-construct cycle was implemented for sets of variables with algebraic relationships that needed to be maintained; once a value was imputed, others could be logically constructed using that value. These constructed variables were then subject to their own edits.

Imputation Methods To understand the extent and patterns of missingness in the NEHIS data set, frequency distributions of all imputation variables and covariates were constructed and examined. For each set of variables, an appropriate imputation strategy was selected.

Alternative

approaches were evaluated in terms of the quality of the imputations and the associated costs. Some approaches which may be sub-optimal were chosen because they kept the number of passes through the data to a minimum, thereby reducing both data processing costs and time while producing results that are essentially comparable to those produced by optimal but very expensive and time-consuming approaches. Other approaches were not considered for implementation in NEHIS for a variety of reasons; for instance, regression imputation was not used primarily because of the pervasive problem of missing data in the most highly correlated covariates. The cold-deck method was not used because of the lack of comparable past data on the same population. Logistic regression imputation was not used because of the relatively small number and the relatively low nonresponse rates of binary imputation variables in the NEHIS data set.

13

Variables missing only one or two percent of the time were deterministically imputed. Mean or modal imputation within cells was used, depending on if the variable was continuous or categorical. Given the low rate of missing data, the resulting deflating of the variance was thought to be trivial compared to the savings in time and effort compared to stochastic imputation methods that require development of models. Variables with higher rates of missing data were generally imputed using hot deck. Examination of bivariate correlations (categorical variables were converted to dummy variables) were used to identify best covariates, which were then compared with patterns of missingness across imputation variables and potential covariates. Highly correlated covariates that were generally present whenever the imputation variable was missing were chosen to define the hot-deck cells. If the covariates were continuous they were split into 3 or 5 categories based on their empirical distribution. The resulting imputed variables were then subjected to the same edits used for reported data and, if they failed edits, re-imputed. A variant of the hot deck, which we will refer to as the Hot-Deck-Variant (HDV) method, was implemented in situations where there was only one significant continuous covariate for a given imputation variable and this covariate turned out to be a count variable (for example, number of enrollees in a health insurance plan or number of employees at an establishment). This method is a form of nearest neighbor imputation within cells. In this procedure, the covariate itself (rather than categories of it), was used to define the boundaries that could be crossed if necessary to find a donor. This procedure has the advantage of easy implementation and its results are comparable to those obtained from regression imputation (Aigner et al., 1975). Once all the imputation was completed in a group of chunks, imputation moved on to the next group. At this point additional edit constraints that the reported data had passed were applied to the imputed data. If a conflict arose between two imputed values from different variable groups, the most recently imputed data were revised to minimize the need for cycling across groups of variables. There were situations, however, where earlier imputations were identified at a later stage to fail edits and therefore required re-imputation. This required a second pass through the imputation process to impute for these complex situations. Careful review of all edits for the smaller number of cases needing imputation during this second pass assured that no third pass through the data would be necessary. As an example, imputed enrollments in a health plan may have met all the required edits on enrollments (e.g., enrollment is not greater than total employment at the 14

establishment) but the imputed enrollment causes costs per enrollee to go outside allowable ranges. This is not identified until costs have been imputed in a later group. If the cost data were reported, then it becomes necessary to go back and re-impute enrollments. The hot deck and HDV procedures were the most frequently used imputation methods in NEHIS primarily because the NEHIS data set contains a large number of imputation variables with weakly correlated covariates. Another reason is computational convenience - the procedures are easily implemented by the standard imputation software developed by Westat. Of the approximately 150 imputation models implemented in the NEHIS imputation task, 60% used hot deck, 30% used HDV, and 10% used deterministic imputation.

3.1.3

Strengths and weaknesses of approach The combination of hot deck and HDV imputations described above allows for

complete case analyses involving these 50 variables. This significantly increases the utility of the resulting data set for multivariate analyses by providing consistency across tabular analyses. The relatively simple models used for the imputations have hopefully reduced much of the potential bias from item nonresponse. Attenuation of key relationships has been minimized for those relationships that were of joint concern to the three sponsoring agencies. For example, after imputing enrollment in family coverage for over 10 percent of health plans, its correlation with overall enrollment (single plus family coverage) remained at 0.98. The final database provides clear documentation of the source of imputed data (from a donor versus resulting from an edit constraint). The full range of data users is unknown and thus could not be consulted on the models that were used for the hot-deck imputations. The models, however, are quite straightforward and easily described in documentation (which covariates were used for which variables), so users can decide on their appropriateness for their analyses. This approach, however, just like Bayesian methods, was very labor intensive and time consuming.

It has taken multiple statisticians and statistical programmers many months to

complete this work.

This large effort was a result of the number of imputation variables

combined with the very complex logical and edit constraints imposed on the resulting data.

15

3.2

Medicare Current Beneficiaries Survey (MCBS) (Cyclic n-Partition Hot-Deck)

3.2.1

Magnitude and complexity of the problem The main activity of the MCBS is to collect data on about 660,000 health care

events per year on about 12,000 beneficiaries. These events can be doctors’ visits, inpatient hospitalizations, dental visits, containers of prescription drugs, purchases of adult diapers, and so on. For each event, interviewers attempt to collect the costi and a complete record of payments by the patient and all third parties. An extra payer category of “uncollected liability” is also recognized. Sometimes beneficiaries are only able to report that an event took place. Other times, they will remember their out-of-pocket payment but will be unaware of the total cost or of amounts paid by third parties. Or they may have a complete paper history by a series of third parties including Medicare and a private medigap insurer. Sometimes they will have forgotten the event entirely but it will show up in HCFA’s claim system with the cost and Medicare’s payment but no other payment information. Of course, this possibility of reporting from claims is only possible for covered services. Those exclude prescription medicines, and dental care among others. Since beneficiaries are visited three times over the course of each year for several years, there is a lot of duplicated paperwork across visits. Before imputation can start, there is an intensive editing phase which we will only mention briefly here. We define the following notation. Let δ=(δ1,..., δs) where s is the number of payer sources recognized in the survey, δi=1 if the i-th source is known to have made a payment, δi=0 if the i-th component is known not to have made a payment, and δi is missing otherwise. Let Y=(Y1,..., Ys) where Yi is the payment by the i-th source. The total vector to be completed for each event is ζ=(δ,Y,Y+). A feature of cost and payment data is that the payments should sum to the cost and all the payments should be positive. Table 2 shows the edit and imputation rates for cost data at the event level by matched status and type of event. Imputation rates are of course lowest for events that are reported by the respondent and matched to administrative claim data. Even for these events, i

We use “cost” here is a convenient shorthand for a more subtle concept. In the simple situation where there is no involvement by Medicaid or capitated payment arrangements, the “cost” is defined to be the amount that the provider is legally entitled to collect from the patient and third parties. Thus, it is a cost to the consumer rather than the cost of the provider to create the service. For patients with capitated payment plans, the “cost” was defined to be the amount that the provider could have collected under a feefor-service arrangement. For Medicaid beneficiaries, the “cost” was reduced to reflect legislated mandatory discounts. The “cost” might be equated with the concept of “value” except for Medicaid beneficiaries.

16

however, 12 percent are edited or imputed.ii

Imputation rates for cost are 42 percent for

unmatched survey-reported events (mostly prescription medicines) and 24 percent for claims-only events that were not reported by the beneficiaries (largely bills by separately billing doctors and laboratories). Table 3 shows how often a potential payeriii was edited or imputed to have at least partially paid for an event despite the lack of a report by the beneficiary to that effect. The questionnaire only asked what payments had been made by any source. When a payment was reported, the amount and payer were recorded. The survey never collected information from the respondent that a particular potential payer did not pay any part of the bill. Note that the most frequent unmentioned potential payers to be imputed to have helped pay for an event are “out of pocket” and “uncollected liability.” Third parties were also edited and imputed to have helped pay for events, but for the most part, if a bill was not fully paid by Medicare, the difference was imputed to have been either paid by the beneficiary or forgiven by the respondent. Potential payers were also frequently edited or imputed not to have paid for a specific event, but no data on that type of imputation are in the table. Table 4 shows the edit/imputation rates for positive payment amounts. A payment of $0 by a potential payer could also be considered to be imputed, but no data on these rates are presented. The rates of imputation are high for all payers except Medicare fee for service. They are 100 percent for Medicaid and Veteran’s Administration because patients with this coverage never have any paperwork about the costs of their covered events. The high rates for other third party payers reflects the fact that there can always be negotiation about how much is paid by them instead of being paid by the patient or forgiven by the provider. Looking at these rates, one might almost wonder if there is even any point in asking about payment data for medical events. However, there is quite a bit of payment information that is reported by respondents. Table 5 shows the level of partial data that is being preserved by imputation. Even among survey-only events, 32 percent of all events have a complete payment reconciliation provided by the respondents. Only 28 percent have no financial data on them at all. For matched events, 58 percent are fully reconciled using respondent and administrative data. It is uncommon for survey-only events to have a reported cost and not have complete payment data (just 1 percent

ii

This mostly reflects adjustments to costs on events by dual Medicaid-Medicare beneficiaries.

iii

Potential payers were determined by a series of questions on program participation and medical insurance that are updated each round prior to asking about utilization and costs.

17

of survey-only events), but it is quite common for claims-only events to have a reported cost and incomplete payment data (53 percent of claims-only events). One could develop an imputation system that first deleted all partially reported financial data and then filled in complete records for similar events that were fully reconciled. Such a system would be very easy to create. The 34 percent of events that are completely reconciled prior to edit and imputation would be used to impute the remaining 66 percent. However, much partial information would be lost. Overall, 22 percent of events have at least a cost associated with them, and 29 percent have at least one payment reported even though they do not have a cost. Thus, it is important to have an imputation system that can either impute payments given a known cost or impute cost given partial payment data It is interesting to note that among respondents who were alive and eligible at the end of the year, not eligible for Medicaid at any time during the year, not covered by capitated payment plans during the year, outside of nursing homes for at least part of the year, and had at least one medical event during 1996, just 3.8 percent were able to give a complete cost and payment record for every event they reported during the year (or was reported to HCFA by a provider). Even if one includes persons with no events for the year, the rate of perfect reporting only increases to 6.4 percent. Clearly, a pure weighting strategy for handling missing data would not work on this survey. Without imputation, there would be no point in conducting medical expenditure surveys.

18

Table 2. Edit and imputation rates for cost data by type of event and by match status Matched Status Survey only Claims only Matched Totalvi

Event Count Imputation ratev Event Count Imputation rate Event Count Imputation rate Event Count Imputation rate

Dental 10,890 28% 0 na 53 2% 10,943 28%

Inpatient Short-term Medical Hospital Institutional provideriv 499 79 39,488 83% 89% 81% 1,434 914 43,068 19% 18% 29% 2,761 226 59,532 14% 4% 11% 4,694 1,219 142,088 23% 20% 36%

Event Type Other Outpatient Prescribed Separately Separately medical Medicine Billing MP billing lab 12,368 8,864 211,117 2,441 487 59% 79% 33% 25% 18% 16,986 21,466 0 42,332 43,777 30% 21% na 31% 10% 4,369 17,581 0 16,476 4,447 17% 19% na 7% 6% 33,723 47,911 211,117 61,249 48,711 39% 31% 33% 25% 10%

Table 3. Frequency of editing or imputing that a potential payer did make a payment on a specific event Event Type Outpatient Prescribed Separately Dental Inpatient Short-term Medical Other medical Medicine Billing MP Hospital Institutional provider Medicaid 0% 5% 1% 7% 10% 7% 4% 13% Medicare fee for service 0% 0% 0% 0% 0% 0% 0% 0% Uncollected liability 2% 4% 3% 3% 3% 3% 62% 3% Medicare HMO 1% 1% 1% 1% 1% 1% 2% 0% Private HMO 0% 1% 0% 1% 1% 1% 7% 1% Out of Pocket 7% 11% 6% 18% 19% 14% 29% 14% Other 0% 1% 0% 0% 0% 1% 3% 0% Employer-Sponsored FFS 3% 5% 7% 8% 5% 5% 10% 11% Individually purchased FFS 1% 6% 5% 9% 8% 7% 4% 13% FFS - unknown buyer 0% 2% 12% 1% 1% 1% 0% 4% Veteran’s Administration 0% 0% 0% 0% 0% 0% 2% 0%

Total 286,233 42% 169,977 24% 105,445 12% 561,655 31%

Separately Total billing lab 7% 7% 0% 0% 1% 25% 0% 1% 1% 3% 8% 20% 0% 1% 7% 9% 8% 7% 1% 1% 0% 1%

iv

Office-based medical provider visits.

v

Imputation rate is defined as percent of events for which the cost was either edited or imputed.

vi

Excludes events of persons in facilities entire year and “ghosts,” placeholder records for persons eligible for Medicare in 1996 but not drawn into sample in time for collection of event-level data.

19

Table 4. Edit/imputation rates of positive payment amounts Dental Inpatient Short-term Hospital Institutional Medicaid 100% 100% 100% Medicare fee for service 0% 2% 0% Uncollected liability 74% 90% 99% Medicare HMO 90% 91% 100% Private HMO 66% 78% 100% Out of Pocket 16% 76% 98% Other 64% 72% 94% Employer-Sponsored FFS 42% 50% 93% Individually purchased FFS 70% 55% 90% FFS - unknown buyer na 100% 100% Veteran's Administration 100% 100% 100%

20

Medical provider 100% 13% 84% 98% 78% 60% 87% 59% 58% 100% 99%

Other Outpatient medical 100% 100% 3% 7% 90% 89% 90% 90% 89% 81% 69% 72% 89% 85% 79% 59% 80% 56% 100% 100% na 99%

Prescribed Medicine 100% 0% 74% 100% 100% 37% 92% 85% 81% na 100%

Separately Billing MP 100% 1% 87% 89% 71% 85% 83% 60% 61% 100% 100%

Separately billing lab 100% 0% 87% 65% 72% 87% 78% 77% 76% 100% 100%

Total 100% 6% 79% 97% 93% 48% 89% 71% 63% 100% 100%

Table 5. Frequency of edited and imputed cost and positive payment data by type of event, matched status, and level of missingness Medical Other Outpatient Prescribed Separately Separately Dental Inpatient Short-term Medicine Billing MP billing lab Hospital Institutional provider (MP) medical Survey-only events Complete data 72% 15% 11% 16% 41% 18% 32% 51% 47% Cost reported but payments incomplete 3% 4% 0% 4% 3% 4% 0% 26% 37% Cost missing and payments incomplete 4% 6% 5% 13% 4% 10% 50% 3% 7% All cost and payment data missing 22% 75% 84% 67% 53% 67% 17% 19% 10% 10,890 499 79 39,488 12,368 8,864 211,117 2,441 487 Claims-only events Complete data na 40% 34% 1% 10% 43% na 0% 64% Cost reported but payments incomplete na 42% 48% 70% 60% 37% na 68% 26% Cost missing and payments incomplete na 18% 18% 27% 25% 20% na 30% 10% All cost and payment data missing na 0% 0% 2% 5% 1% na 1% 0% 0 1,434 914 43,068 16,986 21,466 0 42,332 43,777 Matched events Complete data 74% 53% 74% 58% 38% 49% na 71% 67% Cost reported but payments incomplete 25% 33% 23% 31% 44% 32% na 22% 28% Cost missing and payments incomplete 2% 14% 4% 9% 16% 17% na 7% 4% All cost and payment data missing 0% 0% 0% 1% 1% 1% na 0% 1% 53 2,761 226 59,532 4,369 17,581 0 16,476 4,447 All events Complete data 72% 45% 40% 29% 25% 41% 32% 21% 64% Cost reported but payments incomplete 3% 32% 40% 35% 37% 29% 0% 54% 26% Cost missing and payments incomplete 4% 14% 14% 16% 16% 17% 50% 23% 9% All cost and payment data missing 21% 8% 5% 20% 22% 13% 17% 2% 0% Event count 10,943 4,694 1,219 142,088 33,723 47,911 211,117 61,249 48,711

21

Total

32% 1% 40% 28% 286,233 24% 53% 22% 1% 169,977 58% 30% 10% 1% 105,445 34% 22% 29% 15% 561,655

Prior to developing an imputation algorithm, the developers first set some criteria that they wanted the algorithm to satisfy. These criteria were: 1)

All payments had to be positive and sum to the cost of the event;

2)

All imputed payments should be consistent with δ , and δ must be consistent with the insurance coverage and program participation that was effective at the time of the event;

3)

No partial cost or payment data should be discarded;

4)

Runs had to finish in a reasonable amount of time on the computer hardware then available (the HCFA IBM 9672-T26 mainframe, a VAX 4000-700A and Pentiums);

5)

Once the algorithm has been programmed and tested it should require minimal review by human analysts;

6)

Multiple events that are very similar to each other (such as multiple purchases of the same prescription drug) should have similar participation by the various payers (uniform δ)

7)

Imputed cost should be consistent with the nature of the event (surgery and eyeglasses having very different costs)

8)

Correlations between payments by different sources (usually negative) should be preserved

9)

Associations of payment patterns with demographic and economic variables should be preserved.

10)

Seasonal variation in payment patterns (induced by deductible requirements) should be preserved

3.2.2

Approach Chosen While the developers couldn’t develop an algorithm that met all these criteria, they

did come up with an algorithm that meets most of the criteria. Earlier references to this work are Judkins, Hubbell, and England (1993) and England, Hubbell, Judkins, and Ryaboy (1994). However, a large team at Westat and HCFA contributed to the effort.vii The algorithm has now been used on five years of MCBS data from 1992 through 1996. That reflects processing of about 5 million events. Plans are to continue using it indefinitely. In this paper, we provide more detail on the algorithm than has previously been given and present some new data on the magnitude of the imputation.

vii

Some of the team members were Frank Eppig, Kim Skellan, Dave Gibson, John Poisal, Gary Olin, Mary Laschober, Ian Whitlock, and Diane Robinson.

22

The basic approach chosen was a two-step procedure. The first step was to use a large number of sequential common-donor hot-decks to impute δ, where a separate hot deck was used to complete δ for every group of events with a different pattern of missingness in δ and a different set of available auxiliary data. This type of approach had been hinted at in Fahimi, et al. (1993) and in Winglee, Ryaboy and Judkins (1993. However, in the Fahimi paper, average values where substituted into variables with values that had not yet been imputed, and in the Winglee et al. paper, missing values were treated as legitimate match categories. The former approach did not prevent anomalies and the second approach led to thin donor pools. The 1993 paper by Judkins, Hubbell and England paper was the first where a separate hot deck was performed for every distinct missing data pattern in a large vector. This procedure might be referred to as sequential full-information common-donor hot-decks. The second step was to use cyclic n-partition hot decks (Judkins, 1997) to impute (Y,Y+) consistent with δ. The first step was run on “super-events,” the second step was run on individual events. Super-events were defined to be clusters nearly identical events such as purchases of the same prescription drug or multiple visits to the same doctor. Payer participation was imputed at the super-event level because it was theorized that if a payer made a payment for one event within an super-event, then it would probably make payments for all the events. Actual costs and payments were imputed at the individual event level because it was theorized that event-level costs and payments would be more homogenous across beneficiaries and providers if defined at the event level rather than covering a whole course of treatment. Let h=(h1,...,hs) where hi=1 if δi is "observed" and 0 otherwise. Let Ωh be the set of distinct values of h realized in the sample. Let g=(g1,...,gs) where gi=1 if Yi is observed and 0 otherwise. Let Y+ be the total cost of the event and g+ indicate whether Y+ is observed. In the first step, events were separated by type of event, by whether Y+ was observed, and if Y+ was observed, by whether there were any known payers with unknown payment amounts. For events with observed Y+ and a valid payment amount for every known payer, a separate common-donor hot deck was run to impute the missing portion of δ for every element in the product of Ωh with the list of event types. The reason to have a separate hot deck for every element of Ωh was that the exact matches on insurance coverage and program participation for the missing portion of δ were required. Separate hot decks were required for different event types because coverage varies by event type for most of the payers. Beyond exact matching on insurance coverage and program participation for the missing portion of δ and on the type of 23

event, matching was attempted on the observed portion of δ and on deciles of the unaccounted cost. For events with observed Y+ but a missing payment amount for at least one known payer, a separate common-donor hot deck was again run to impute the missing portion of δ for every element of the product of Ωh with the list of event types. As with the first group, exact matching on insurance coverage and program participation for the missing portion of δ was required and matching was attempted on the reported portion of δ. However, matching could not be made within deciles of the unaccounted cost since it was not possible to define this uniformly. Instead, more detailed matching was attempted on deciles of Y+. For events with missing Y+, a separate common-donor hot deck was again run to impute the missing portion of δ for every element of the product of Ωh with the list of event types. As with the first group, exact matching on insurance coverage and program participation for the missing portion of δ and on the type of event were required. However, matches could not be made within deciles of the unaccounted cost or on cost. So the only further attempted matches were on the reported portion of δ. In 1996, this approach required a very large number hot-deck runs. No one has counted them, but based on the thickness of the printouts, there are at least 1000 and possibly several thousand. This was a very large number of hot decks to run on such a large data set, but automated SAS macros made it possible. The approach was generally successful in preserving the association of payer participation. The only problems that arose were that occasionally a pattern will arise in the observed portion of δ for which there are no events with complete payment accountings. For these few events, it was necessary to develop ad hoc solutions. After the completion of δ, work began on the actual payment and cost amounts. Let Ωδ be the set of distinct values of δ realized in the sample. Let Ωg be the set of distinct values of g realized in the sample. Consideration was given to running a separate hot deck for every element in the product of Ωδ with Ωg with the list of event types. A separate hot deck for every element of Ωδ would be desirable so that one could work within a set of events where the same payers were known to be involved. A separate hot deck for every element of Ωg would be desirable so that the imputed payment amounts could all come from the same donor that matched exactly on all payments by known payers. However, this was too large a set to be feasible.

24

Instead, three systems were developed. One system was used for events with no cost or payment data, corresponding to g=0. The second system was used for all events involving Medicaid. The third system was developed for events with partial payment data and no Medicaid involvement. The system for events with no payment data was quite simple. First, a hot deck was run to impute cost with matching on type of event, some other event-level variables, and some person-level variables. Next, a hot deck was run with exact matching on δ and rough matching on cost. This is very similar to the system that was used on NMES II. The Medicaid system was really a logical edit system based on HCFA rules for cost sharing. The third system was a new development, one that is part of a new class of imputation methods called cyclic n-partition hot decks. The first step with a cyclic method is to create a feasible solution with simplistic methods and then to cyclically reimpute originally missing portions of the vector conditioned on important aspects of the remainder of the vector, just as is done in the Bayesian MCMC methods. For those interested, the simplistic method for creating the initial feasible solution is in England, Hubbell, Judkins, and Ryaboy (1994). The cyclic n-partition hot decks are the more interesting feature. Once the set of payers is fixed, part of the (Y,Y+) vector is overspecified. If the cost and all but one of the payments are known, then the final payment can be completed by subtraction. If all the payments are known, then the cost can be completed by addition. Thus, two types of hot decks were set up. The first type was to impute Yi where both Yi and Y+ were originally missing. For this type, there was exact matching on δi and on type of event and rough matching on the sum of payment by other sources. Rough matching was defined as within deciles of the sum of payment by other sources. After imputation of Yi, the value for Y+ was recalculated. The second type of hot deck was to impute the relative division between two originally missing payment amounts, Yi and Yj. For the second type of hot deck, there was exact matching on δi and δj and on type of event and rough matching on the sum Yi + Yj. Rough matching was defined as within deciles of Yi + Yj. The hot deck actually imputes Pij = Yi /(Yi + Yj). The program then calculates new corresponding values of Yi and Yj. There are s hot decks of the first type for each type of event and s(s-1)/2 of the second type for each type of event. This was found to be a manageable number of hot decks with s=9. These s + s(s1)/2 hot decks were run sequentially in a batch to constitute one cycle of the algorithm. The entire cycle is then repeated until some measure of convergence is obtained. To speed convergence, only events with originally reported values of Yi and Y+ are allowed to serve as donors in the first type of hot deck and only events with originally reported values of Yi and Yj are allowed to serve as donors in the second type of hot deck. Also to speed convergence, only events with at least three originally missing pieces of data are iteratively 25

reimputed. (For events with just two missing pieces of data, the marginal expectation does not vary by iteration.) Ad hoc procedures were occasionally required where there were no potential donors.viii Convergence is crudely gauged by watching the marginal distribution of total cost and each payment category

3.2.3

Strengths and weaknesses of approach We now review the list of criteria that were developed in advance and discuss the

extent to which the algorithm met each of the criteria. First, all payments were imputed to be positive and to sum to the cost of the event. This was achieved by first developing a feasible solution, then making only two types of changes to the imputed cost and payment vectors. The first type of change was to reimpute payments on records with originally missing cost and then to recompute a new cost as the sum of payments. The second type of change was to shift payment amounts among payers. Neither type of change could lead to negative payments or to payments that failed to sum to cost. This would not have been easy to achieve with model-based methods because of the strongly non-smooth nature of the distribution of payment amounts. The most common payment is $0 for most payers. Second, all imputed payments were consistent with post-edit δ. This means that if a person did not have a certain type of coverage, then a strictly positive payment by that source was never imputed. Also, if a person mentioned that a particular source had paid but did not know the amount of the payment, then a strictly positive payment amount was imputed. Payments were never imputed for insurance or programs for which the beneficiary had no coverage at the time of the event. Third, no post-edit partial cost or payment data were discarded. As shown in Table 3.2-4, this resulted in a considerable saving of partial data. If a system like the NMES II system had been used, partial payment data on at least 40 percent of survey-only events, 22 percent of claims-only events, and 10 percent of matched events would need to have been discarded. Fourth, the method was computationally feasible on the available computer hardware. The first time it was used, it was run on a HCFA IBM 9672-T26 mainframe. The runs viii

Specifically, if there were no events in the database with a known split of payments between two payers, then there were no available donors.

26

for prescription drugs took 7.5 CPU hours on that machine ; the hours for every other type of event was smaller. The software was also run successfully on Pentiums and on a VAX 4000700A.

In 1996, most of the processing was done on an HP 9000 D380 Unix Server. On that

machine, the CPU time was 14.3 hours combined across all the runs for each event type. Fifth, the method has required very little manual review and retraining. There was a very large expenditure of professional labor to create and test the system in 1993 through 1995, but since that time, there has been very little professional labor. The system is basically run on each year of data by systems analysts with a light review by staff economists. Sixth, similar events were imputed to have similar payer participation patterns by virtue of imputation of participation at the super-event level. This may not be entirely a good thing since it may obscure the effects of meeting deductibles, but this has not been examined in detail. Seventh, the marginal distribution of costs by nature of event seemed reasonable. Exact matching on major event category was required. For the unmatched survey events with no payment data, the program uses some detailed match keys like whether the medical provider care involved surgery and personal characteristics such as region and metropolitan status. On the eighth criteria, matching on the observed components of δ usually ensured that correlations of the payments by various sources were preserved. However, there were instances where it was not possible to match on the full δ information because some payer coordination patterns were never reported on a resolved basis. That is, respondents sometimes reported odd combinations of payers without ever reporting how those payers coordinated their payments. Various ad hoc solutions were developed for this. We doubt that MCMC methods would have done any better since there were essentially no relevant complete data from which to build a model. On the ninth and tenth criteria, the system was not successful. Matching on partial event-level features, insurance coverage, and program participation proved to be all the matching that was feasible for imputing most of the cost and use data. It was generally not possible to also match on time of year of service, personal income, sex, detailed age, geography, race, ethnicity, employment status, general health and functioning, and so on. The only exception was for total cost for unmatched survey events with no payment data, where it was possible to match on region and metropolitan status. The failure to match on time-of-year or more person-level variables was 27

partly due to the difficulty in using more match keys and partly the result of a feeling that these other variables have little impact on the cost and payment data for health care events after controlling on the variables that were used. However, the failure to match at a finer level probably resulted in: imputations of payments by medigap policies (policies that cover costs not covered by Medicare) too early in the year and out-of-pocket expenses too late in the year, too high personal payments by poor persons above the Medicaid threshold, too much uncollected liability for rich persons, and other distortions. This is one area where a Bayesian MCMC method might have performed better. By focusing on low-order effects, one is able to include more main effects with such a procedure. There are three other general criteria that are usually applied to imputation systems. One of course, is that imputations be accurate, or at least unbiased. Another is that variance estimates and coverage for confidence intervals both be good. The third is that the process be fairly transparent so that users can assess whether the imputations meet their needs. With respect to the criterion that imputations be accurate or at least unbiased, the approach assumes ignorable nonresponse, as do the Bayesian MCMC methods. In this case, this seems like a fairly reasonable assumption given the strong auxiliary data about insurance coverage and the strong partial data from Medicare claims histories. With respect to variance estimation and coverage of confidence intervals, this is a weak point in the system. Users of the fully imputed data set may be lulled into a false sense of security. A large percentage of total dollars and their allocation across payers is imputed. Yet, the user will appear to have complete data on close to 660,000 health care events for about 12,000 Medicare beneficiaries. Standard errors estimated from this data set by conventional means will not be very accurate. Resampling weights have been provided so that the variance estimates can be inflated for the complex sample design, but no satisfactory way of adjusting estimated standard errors for the MCBS imputation process has been developed.

Clearly, estimated

standard errors will tend to be much too small. The issue of variance estimation is briefly discussed at the end of this chapter. For the moment, the best we can advise users of MCBS data is to inflate estimated variances by the inverse of the observed item response rate. A related question is what sort of variance to associate with the exogenous imputation process that was carried out. Regarding the last criterion of transparency, it is clear that the method is not easily understood. Some of the overall features are clear, but there are many obscure details in the system and even the implications of the known features are not entirely clear. However, the 28

method is probably no worse than Bayesian MCMC methods in this regard. We have to explain decisions on when to weaken match criteria. The Bayesians have to explain prior distributions on parameters. Neither job is easy. This complexity appears to be an unavoidable consequence of preserving the joint distribution of a complex vector. As such, we believe that the complexity is worthwhile. Certainly, no secondary analyst would be able to extract very much value out of the raw data or even out of the edited data. The extent of missingness is simply too broad.

3.3

IEA Reading Literacy Study The U.S. component of the IEA Reading Literacy Study (Elley 1992) was conducted

in the 1990-91 school year. The study involved national samples of over 6,000 grade 4 and 3,000 grade 9 students. The grade 4 students were sampled from over 160 schools nationwide, and two complete classes of students were selected per school. The grade 9 students were sampled from approximately the same number of schools over the country, and one class of students was included per school. Students sampled for the study were given performance tests to evaluate their reading levels and comprehension. In addition, the students, their teachers, and school principals completed questionnaires about background factors related to the students’ reading achievement. Student performance on the cognitive tests was scored using the Rasch scaling method, and nonresponses to the cognitive test items were handled within the context of the Rasch model (see Elley 1992).

The item nonresponses discussed in this paper refer to

nonresponses to student, teacher, and school-level questionnaire items in the United States.

3.3.1

Magnitude and Capacity of Problem Need for imputation. The aims of the IEA Reading Literacy Study were to assess

school children’s reading proficiency in the language of their own country and to collect information from students, teachers, and schools about the factors that lead some students to become better readers than others. While the prime focus of the study is on international comparisons, an additional objective for the United States is to develop conceptual models of which factors are effective and which are ineffective in improving reading skills in the U.S. school systems (NCES, 1996). In order to develop these models, questionnaire items are often used as independent variables for predicting student’s reading performance. Therefore, it is important to have complete data on the questionnaire items to facilitate these analyses and the development of factors. Further, software used to analyze such hierarchical models typically

29

require complete responses at certain model-levels (e.g. schools), thus necessitating imputation for missing data. Reasons for item nonresponse.

Item nonresponse to the questionnaire items

occurred when a student who completed the reading performance test failed to complete and item on the student background questionnaire, or when a teacher or principal failed to complete an item on the questionnaires that they completed. Possible reasons for item nonresponse include lack of knowledge, inadvertent omissions, refusals, and edit failures. As discussed below, the level of item nonresponse was generally low, but there were some items that were not answered by 10 percent or more of the respondents. While time and resources were limited, it was necessary to impute for over 1,000 variables. Edit imputation. Questionnaire items that were unanswered by respondents were reviewed, and efforts were made to locate the missing responses or to deduce the responses by means of logical edits. For example, for schools that failed to report the type of school or communities they served, hard copies of the questionnaire form were retrieved to check the address of the school and to deduce the missing response. For items for which data are available from other sources, those data were used to replace the missing values. For example, for schools that failed to report enrollment, the enrollment was taken from the 1989 Quality Education Data (QED) file, a comprehensive database of schools in the United States that was used as the sampling frame of schools for this study. In some situations deductive editing was used to complete the responses for items for which unique responses could be deducted from responses to other items on the questionnaires. Extent of item nonresponse.

Imputation methods were used to handle item

nonresponse in the background questionnaires after the data review and editing processes. The amount of missing data for each item was relatively small. Table 6 summarizes the extent of item nonresponse in each of the six datasets corresponding to the three questionnaires for each of the grade 4 and grade 9 samples. Items in each questionnaire are separated into three categories according to the amount of missing data: 5 percent or less, between 6 and 10 percent, and 11 percent or more. Generally, the percentage of items with 11 percent or more missing data is small; the exception being grade 4 students with 20 percent of the items on their questionnaires having 11 percent or more missing data.

30

Table 6.

Percentage of items with different levels of missing data

Percentage of questionnaire items with given level of missing data Student Questionnaire 5 percent or less 6-10 percent 11 percent or more Total number of items Teacher Questionnaire 5 percent or less 6-10 percent 11 percent or more

Grade 4

Grade 9

54% 26% 20%

87% 5% 8%

134

241

92% 5% 3%

84% 14% 2%

Total number of items School Questionnaire 5 percent or less 6-10 percent 11 percent or more

250

153

89% 4% 7%

87% 12% 1%

Total number of items

113

117

SOURCE:

Methodological Issues in Comparative Educational Studies: the case of the IEA Reaching Literacy Study, NCES 94-469, Table 3-1.

Items with high nonresponse rates. There are three types of questionnaire items with high nonresponse rates. The first type comprises factual items that require a certain degree of effort in information retrieval.

For example, teachers frequently omitted the number of

education courses they had taken. This information should have been available to all teachers provided that they were willing to make the effort to review their training records. The second type comprises items that may have been unclear to some respondents. For example, teachers were asked about the amount of time they spent teaching “ESOL” – English as second language. Since the term ESOL was not defined in the questionnaire, some teachers may have been confused and therefore did not provide a response. The third type comprises items in which the response categories may be inappropriate. When none of the choices was suitable, respondents may have decided to skip the question. For example, school principals were asked to rate their levels of satisfaction with various forms of student assessment. The principal may have omitted the item because the form of assessment was not used in the school.

31

3.3.2

Approach chosen Hot-deck imputation. Item imputation for the IEA Reading Literacy Study mostly

used a form of hot deck imputation. Hot deck imputation procedures preserve the distribution of observed data and ensure that the imputed values are within the valid range. Since the data collected in the background questionnaires in the IEA Reading Literacy Study were mainly categorical (discrete choice) items, the hot-deck procedure is an efficient method to fill in the data gaps. Conducting the imputations within imputation classes has the advantage of preserving the relationships between the item being imputed and the auxiliary variables used to form the imputation classes. An assumption of the hot-deck approach is that after controlling for the auxiliary variables, the distribution of responses for the nonrespondents is the same as that for the respondents. Preserving relationship between items. Items strongly related to each other were imputed together, assigning values from the same respondent (donor). Thus the donor pool was restricted to those containing full information on the set of related items. For example, because the educational levels of parents are correlated, father’s and mother’s education were imputed together when both parents were present in the household. For those with father’s education reported but other’s education missing, father’s education was used as an auxiliary variable along with race of student, school type (public, private), and community size in forming the imputation classes. Thus, donors for the students with missing mother’s education were selected from within imputation classes where all students had the same level of father’s education. Likewise, when mother’s education was reported but father’s education was missing, mother’s education was used in forming the imputation classes. When the educational levels of both parents were missing, the imputation classes were formed using only race, school type, and community size. The imputation of the two variables was performed jointly, for each recipient taking both values from the same donor, and restricting the choice of donor to those students with both father’s and mother’s education reported. This method is referred to as the full-information common-donor hot deck in Section 3.2.2. By using a common donor for both variables, this simple imputation method preserves the multivariate structure of the data. Item-by-item imputation. The items in the questionnaires were imputed sequentially, following roughly the logical sequence of the questionnaires. The imputed values of some variables were used in the subsequent imputation of other variables. For example, for the student questionnaires, race and parents’ education were imputed first. The imputed values of these variables were then used to classify students into imputation classes for subsequent imputations

32

of other items. Similarly, for nested items, the filter items that led into skip patterns were always imputed first, and the responses to the items that followed were imputed to be consistent with the imputed responses to the filter items.

3.3.3

Strengths and weaknesses of approach

To examine the effect of the IEA imputation on multivariate relationships an analysis was conducted comparing the imputation results with alternative methodologies. This analysis is described below. (For more details, see Winglee, Kalton, and Rust, 1999.) A study was conducted to provide an empirical assessment of the impact of imputation on data analyses. In particular, regression analyses and hierarchical linear modeling (HLM) analyses (Bryk and Rauderbush, 1992) were conducted on datasets that used different methods for handling missing data, and the results compared. The impact of treating imputed data as real reported data in data analysis has been studied by several researchers (see, for example, Santos, 1981, and Wang, Sedransk and Jinn, 1992). Regression models, estimated from the data set completed by imputation, were compared with the corresponding models estimated using three other methods of handling the missing data. These other methods were: the complete case analysis (CC), where cases with missing values for any of the variables involved in the analysis are discarded (also known as the casewise deletion method); the available case (AC) analysis, where all the reported data are used to derive the sample means and variance-covariance matrix employed in the regression analysis (also called the pairwise deletion method); and a method that assumes that the data come from a multivariate normal distribution and estimates parameters of this distribution by a maximum likelihood method using the EM (estimation-maximization) algorithm (Dempster, Laird and Rubin, 1977).

In addition, HLM analyses were estimated for the data set completed by

imputation and using the complete case analysis. The complete case (CC) analysis, which is the default procedure for handling missing data in most statistical packages, is widely used in practice. It is easy to implement but clearly inefficient in its use of data. In the regression models examined in this paper, almost a third of the sampled students were discarded. The CC approach assumes that the complete cases are a random subsample of all cases (Little, 1993), an assumption that if often unjustified in

33

practice. For this data set, there is clear evidence that the students with one or more missing values of the predictors in the regression models differ from those with complete data in terms of reading performance, race/ethnicity of the student, the type of community served by the school (urban, suburban, non-urban), the region of the country, and type of school (public, private). As a result, the regression analyses conducted for the complete cases are likely to have produced biased results. An example of this bias is the regression coefficient for “Father absent” in predicting narrative scores shown in Table 6. While the other three methods all estimate the coefficient between 18 and 20, the complete cases estimate is 31.6, over a 50 percent increase. This increase is greater than a standard error away from the other coefficients. Table 6.

Unweighted regression coefficients predicting narrative scores

Predictor variables Intercept Gender Age Minority Father ed. - H.S. Father ed. - < college Father ed. - college Father absent Mother ed. - H.S. Mother ed. - < college Mother ed. - college Family wealth index Family composition Extended family Foreign language

Hot-deck imputation b s.e. 744.7 22.4 16.9 2.3 -1.8 0.2 -36.0 2.7 9.8 4.7 15.3 5.0 23.3 4.6 18.1 7.5 16.3 4.7 17.5 4.9 18.7 4.7 9.3 1.2 20.2 2.4 -23.0 2.4 -8.0 2.7

EM algorithm b s.e. 731.0 22.6 17.2 2.3 -1.8 0.2 -36.0 2.7 9.3 4.7 17.4 5.0 25.4 4.6 20.2 7.2 19.0 4.7 19.2 5.0 21.0 4.8 8.6 1.3 21.2 2.4 -24.1 2.4 -7.5 2.7

Available cases b s.e. 726.5 25.0 17.1 2.5 -1.7 0.2 -36.3 2.9 9.2 5.2 18.0 5.5 25.7 5.1 18.7 8.0 19.7 5.2 19.4 5.4 21.0 5.2 8.5 1.4 21.3 2.7 -23.8 2.7 -7.7 3.0

Complete cases b s.e. 729.3 27.9 16.4 2.7 -1.6 0.2 -36.3 3.3 9.9 5.7 16.9 6.0 28.2 5.6 31.6 10.3 17.6 6.0 16.5 6.2 17.1 6.0 7.2 1.5 19.4 2.9 -27.6 2.9 -9.3 3.2

The three remaining approaches, the hot deck (HD) imputation approach described above, the available case (AC) approach, and the EM algorithm yielded very similar results in the regression analyses conducted. The EM algorithm, which is available through ROBMLE (Little, 1988), the BMDP and GAUSS packages, has theoretical attractions (Little, 1992). This algorithm has been found to be superior to the CC and AC analyses even when the underlying normality assumptions are violated (Azen, Van Guilder, and Hill, 1989; Little 1988a). However, it is a computer intensive procedure, and software for its use with a particular form of analysis may not be readily available.

34

As compared with the CC approach, the AC approach has the attraction of making fuller use of the available data. In a simulation study, Kim and Curry (1977) found the AC approach to be superior to the CC approach with weakly correlated data. A limitation to the AC approach is that it may produce a covariance matrix that is not positive definite, an outcome that poses problems for model estimation (yielding indeterminate slopes in a regression analysis). As Little (1992) notes, this limitation is severe when the independent variables in a regression model are highly correlated (however, in this case an alternative model specification that avoids the multicollinearity may be preferred). The AC approach yields inconsistent results as well, since different cases are used for each analysis. A further problem with the AC approach is that it is not available in all statistical packages, and it may not be available for particular forms of analysis. It was not applied with the HLM analysis here because of a lack of available software. A limitation to imputation is that it can lead to an attenuation in covariances between some variables, thus distorting the results of multivariate analyses (Kalton and Kasprzyk, 1986). This attenuation does not occur between a variable subject to imputation and the variables used as auxiliary variables in the imputation of that variable (e.g., the variables used to form the imputation classes with hot deck imputation), but it does occur between the variable subject to imputation and other variables. For this reason, it is important to employ, as auxiliary variables in the imputation scheme, major variables associated with a variable subject to imputation. However, even when a variable is not used as an auxiliary variable in the imputation, the attenuation of its covariance with the variable subject to imputation is small provided that the level of item nonresponse is low, as is the case in the IEA Reading Literacy Study. The regression coefficients in the analyses reported show no sign of such attenuation. It appears that the IEA Reading Literacy Study imputed data set can be safely analyzed without concern for an appreciable attenuation of covariances. In conclusion, this study shows that, for the U.S. component of the IEA Reading Literacy Study, data analysis using the hot-deck imputed data yielded similar results to those produced by the available case and EM methods of handling the missing data. Since analysis with the hot-deck imputed data is the simplest to implement and yields consistent results for marginal means and totals across subgroup analyses, it appears to be the best option for most analyses of the IEA Reading Literacy Study data. It should, however, be noted that analysis of the IEA Reading Literacy Study data set is not constrained to the hot-deck approach. Since flags identifying the imputed values are provided in the data set, the imputed values can readily be deleted and an alternative approach for handling the missing data can then be employed.

35

4.

Variance Estimation Another concern with imputation is the effect on the standard errors of survey

estimates. In essence, the hot deck imputation used in this study duplicates some of the values from respondents to substitute for the missing data from nonrespondents. Therefore, treating the HD imputed data set as complete responses is likely to overstate the precision of the survey estimates. One approach to variance estimation with an imputed data set is to employ multiple imputations, completing the data set several (say 3 to 5) times and estimating the overall variance of a survey estimate from a combination of the average within data set and the between data set variance components (Rubin, 1987). This approach was not adopted on any of these three surveys because of uncertainty of the utility of multiple imputation for unplanned analyses (Fay, 1996). Other approaches to variance estimation with imputed datasets are under development (Lee, Rancourt and Särndal, Chapter XX of this book; Shao and Sitter, 1996; Fay, 1992; Rao and Shao, 1992; Tollefson and Fuller, 1992; and Montaquila and Jernigan, 1997), but they are not yet available for multiple variable applications. As a result, there is no ideal solution currently available for multivariate variance estimation from large-scale databases. Recently, the idea has been suggested that for multivariate statistics a Shao-Sitter bootstrap (Shao and Sitter, 1996) might work or a variant of the All-Cases Imputation (ACI) method (Montaquila and Jernigan, 1997) could be used on these data sets, but such extensions have not yet been developed.

A Shao-Sitter bootstrap would involve repeating this whole

imputation process on a series of half samples. Regarding ACI, the concept has only been developed for univariate statistics.

5.

Recommendations for Future Large-Scale Imputation Efforts Large complex datasets typically contain hundreds of variables on thousands of

respondents. These datasets usually contain item nonresponse for nearly all variables. The missing responses in the questionnaire items can be handled in one of two ways: they can be filled in by some form of imputation, or they can be left as missing with missing data codes assigned in the data files. This chapter strongly argues in favor of imputing for nonresponse and assigning codes to the data identifying which were not reported.

36

Carefully implemented

imputations can reduce the risk of bias in survey estimates arising from missing data. Also, analyses can be conducted from the imputed data set making use of respondents who had partially reported data, increasing the power of analyses. Procedures are in development for properly estimating the accuracy of such estimates. Table 7 compares the hot-deck imputation procedures used in the three large-scale databases described in this chapter. Table 7. Comparison of imputation procedures for three large-scale databases

Number of respondents/events Number of variables/models to impute

NEHIS

MCBS

IEA

40,000

660,000

6,000

150

19

1,000+

Hot-deck imputation methods Univariate

X

Common donor

X

Cyclic

X X

X

X

The methods described here all make use of forms of hot-deck imputation. These different forms were developed to better support multivariate analyses of public use datasets. Statisticians in future years will need to choose between methods of this type and Bayesian methods that have also been developed with a particular focus on improving multivariate analyses. We think that these less parametric methods deserve continued consideration. They are generally easier to implement than Bayesian methods, easier to explain to laymen, avoid dependence on prior distributions, and are flexible enough to handle non-standard distributions.

6.

References

Aigner, D.J., Goldberger, A.S., and Kalton, G. (1975). On the Explanatory Power of Dummy Variable Regressions. International Economic Review, 16, 2, 503-510. Azen, S.P., Van Guilder, M., and Hill, M.A. (1989). Estimation of Parameters and Missing Values Under a Regression Model With Non-Normally Distributed and Non-Randomly Incomplete Data. Statistics in Medicine, 8, 217-228. Brick, J.M. and Kalton, G. (1996). Handling Missing Data in Survey Research. Statistical Methods in Medical Research, 5, 215-238.

37

Bryk, A.S. and Raudenbush, S.W. (1992). Hierarchical Linear Models: Applications and Data Analysis Methods. Advanced Quantitative Techniques in the Social Science Series. Thousand Oaks, California: Sage Publications. David, M. A., Little, R. J. A., Samuhel, M., and Triest, R. (1983), “Imputation Models Based on the Propensity to Respond,” Proceedings of the Business and Economic Statistics Section of the American Statistical Association, pp. 168-173. Dempster, A.P., Laird, N.M., and Rubin, D.B. (1977). Maximum Likelihood From Incomplete Data Via the EM Algorithm. Journal of the Royal Statistical Society, B, 39, 1-38. Elley, W.B. (1992). How in the World do Students Read? The Hague: The International Association for the Evaluation of Educational Achievement. England, A., Hubbell, K., Judkins, D., and Ryaboy, S. (1994). Imputation of medical cost and payment data. Proceedings of the Section on Survey Research Methods of the American Statistical Association, 406-411. Fahimi, M., Judkins, D., Khare, M., and Ezzati-Rice, T. M. (1993), “Serial Imputation of NHANES III with Mixed Regression and Hot -deck Techniques,” Proceedings of the Section on Survey Research Methods of the American Statistical Association, pp. 292-296. Fay, R.E., (1992). When are Inference From Multiple Imputation Valid? Proceedings of the Section on Survey Research Methods, American Statistical Association, 227-232. Fay, R.E., (1996). Alternative Paradigms for the Analysis of Imputed Survey Data. Journal of the American Statistical Association, 91:434, 490-498. Judkins, D.R. (1997). Imputing for Swiss cheese patterns of missing data. Proceedings of Statistics Canada Symposium 97, New Directions in Surveys and Censuses. 143-148. Judkins, D. R., Hubbell, K.A., and England, A.M. (1993). The imputation of compositional data. Proceedings of the Section on Survey Research Methods of the American Statistical Association, 458-462. Kalton, G. (1983). Compensating for Missing Survey Data. Research Report Series. Ann Arbor, Michigan: Institute for Social Research. Kalton, G. and Kasprzyk, D. (1986). Methodology, 12, 1-16.

The Treatment of Missing Survey Data.

Survey

Kim, J.O. and Curry, J. (1977). Treatment of Missing Data in Multivariate Analysis. Sociological Methods and Research, 6, 215-240. Lee, H., Rancourt, E., and Särndal, C. (1991). Experiments With Variance Estimation From Survey Data With Imputed Values. Proceedings of the Seventh Annual Census Bureau Research Conference, 483-499. Little, R.J.A. (1988). ROBMLE user notes. Unpublished manuscript.

38

Little, R.J.A. (1992). Regression With Missing X’s: A Review. Journal of the American Statistical Association, 87, 1227-1237. Little, R.J.A. (1993). Pattern-Mixture Models for Multivariate Incomplete Data. Journal of the American Statistical Association, 88, 125-134. Little, R.J.A. and Rubin, D.B. (1987). Statistical Analysis with Missing Data, John Wiley. Montaquila, J. and Jernigan, R. (1997). Variance Estimation in the presence of Imputed Data. Proceedings of the Section on Survey Research Methods, American Statistical Association, 273-278. National Center for Education Statistics (1996). Reading Literacy in the United States. Findings from the IEA Reading Literacy Study. Washington DC. Nordholt, E.S. (1998). Imputation: Methods, Simulation Experiments, and Practical Examples. International Statistical Review, 66, 2, 157-180. Rao, J.N.K. and Shao, J. (1992). Jackknife Variance Estimation With Survey Data Under Hot Deck Imputation. Biometrika, 79, 811-22. Rubin, D.B. (1987). Multiple Imputation for Nonresponse in Surveys, New York: John Wiley. Rubin, D. B. (1996), “Multiple Imputation after 18+ Years,” Journal of the American Statistical Association, 91, 473-489. Santos, R.L. (1981). Effects of Imputation on Regression Coefficients. Proceedings of the Section on Survey Research Methods, American Statistical Association, 140-145. Schafer. J.L. (1997). Analysis of Incomplete Multivariate Data. Chapman Hall, London, UK. Shao, J. and Sitter, R. R. (1996). Bootstrap for imputed survey data. Journal of the American Statistical Association. 91, 1278-1288. Tollefson, M. and Fuller, W.A. (1992). Variance Estimation for Samples With Random Imputation. Proceedings of the Business and Economic Statistics Section, American Statistical Association, 758-763. Wang, R., Sedransk, J., and Jinn, J.H. (1992). Secondary Data Analysis When There are Missing Observations. Journal of the American Statistical Association, 87, 952-961. Winglee, M., Kalton, G., and Rust, K. (1999), Handling Item Nonresponse in the U.S. Component of the IEA Reading Literacy Study. Journal of Educational and Behavioral Statistics (to appear later this year). Winglee, M., Ryaboy, S., and Judkins, D. (1993), “Imputation for the Income and Assets Module of the Medicare Current Beneficiary Survey,” Proceedings of the Section on Survey Research Methods of the American Statistical Association, pp. 463-467. Yansaneh, I.S., Wallace, L., and Marker, D., (1998). Imputation Methods for Large Complex

39

Datasets: An application to the NEHIS. Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 314-319.

40