Quality & Quantity 34: 331–351, 2000. © 2000 Kluwer Academic Publishers. Printed in the Netherlands.
331
Imputation of Missing Item Responses: Some Simple Techniques MARK HUISMAN Department of Statistics, Measurement Theory, & Information Technology, FPPSW, University of Groningen, Grote Kruisstraat 2/1, 9712 TS Groningen, The Netherlands, e-mail:
[email protected]
Abstract. Among the wide variety of procedures to handle missing data, imputing the missing values is a popular strategy to deal with missing item responses. In this paper some simple and easily implemented imputation techniques like item and person mean substitution, and some hot-deck procedures, are investigated. A simulation study was performed based on responses to items forming a scale to measure a latent trait of the respondents. The effects of different imputation procedures on the estimation of the latent ability of the respondents were investigated, as well as the effect on the estimation of Cronbach’s alpha (indicating the reliability of the test) and Loevinger’s H -coefficient (indicating scalability). The results indicate that procedures which use the relationships between items perform best, although they tend to overestimate the scale quality. Key words: missing data, mean imputation, hot-deck imputation, item response theory, simulation.
1. Introduction Among the wide variety of procedures to handle missing data, imputation is a popular strategy to deal with missing item responses. With imputation procedures estimates of the missing values are obtained which replace the blanks in the data set. The completed data set can be analyzed with all techniques usually applied to complete data. Imputing missing values, however, is not without danger. Dempster and Rubin (1983) state: “The idea of imputation is both seductive and dangerous. It is seductive because it can lull the user into the pleasurable state of believing that the data are complete after all, and it is dangerous because it lumps together situations where the problem is sufficiently minor that it can be legitimately handled in this way and situations where standard estimators applied to the real and imputed data have substantial biases” (p. 8). The danger comes from the possible difference between responders and nonresponders. When this difference is systematical, the results of analyses may be biased and false conclusions are easily drawn. Despite the dangers, imputation is a popular technique, because it allows the researcher to use standard complete-data methods of analysis on the filled-in data. However, naive imputations may be worse than doing nothing, so care is needed (see Little, 1988).
332
MARK HUISMAN
In this paper the performance of some imputation methods is investigated. These techniques are simple and relatively easy to implement, and are used to handle missing responses to some kind of test or scale. These tests are often used in behavioral sciences to measure a latent trait of individuals (like emotional well-being or some aspect of a person’s personality). The occurrence of missing responses in test-data is discussed in Section 2. A simulation study is performed in which missing data were created in test-data. The missing values are imputed and the performance of the imputation techniques is investigated by comparing the effects of imputation on the estimation of population parameters. In this paper, the parameters of interest are the scale score and two measures of internal consistency of scales, Cronbach’s alpha (the classical concept of internal consistency, see, e.g., Lord and Novick, 1968) and the Loevinger H -coefficient used in Mokken scaling (Mokken, 1997). In Section 3, the investigated imputation methods are presented and Section 4 describes the design of the simulation study. The results are presented in Section 5 and are discussed in the last section.
2. Item Responses and Missing Data In the behavioral sciences, inferences about a latent property of individuals are usually made by analyzing the responses of these individuals to a set of items, forming a test. Each single item does not cover all aspects of the latent trait, and a person’s position on the trait can only be inferred indirectly by investigating the responses to all items of the test. Measurement models are used to estimate the location of a person on the trait, but also the (simple) weighted sum of the item responses is often considered a good estimate. Measurement of latent abilities is not an easy task and the precision with which the latent properties are measured is a topic of concern. The quality of measurement can be defined in various ways. For instance, the classical concept of reliability is defined as the degree of true (latent) score variation relative to observed score variation, and can be assessed with Cronbach’s alpha. This is a measure of the internal consistency of the scale (Lord and Novick, 1968), and is a lowerbound of the reliability of a test. Another aspect of scale-quality is the scalability of the test-items. Loevinger’s H -coefficient, which is used in, e.g., Mokken scaling is a measure of scalability (see Mokken, 1997). It equals the ratio of the covariance between the items, and their maximum covariance given the marginal distributions of the items. In the presence of item nonresponse, however, the measurement task is much harder and the quality of measurement can be seriously affected. People unwilling to respond to (certain) items cause the data to be incomplete, and standard measurement techniques are more difficult to use. In the treatment of nonresponse, the nature of the missing data mechanism plays an important role. The mechanism is called ignorable if there are no systematic differences between respondents and
IMPUTATION OF MISSING ITEM RESPONSES
333
nonrespondents – Missing Completely at Random (MCAR); Rubin (1976) – or if these systematic differences only depend on completely observed variables – Missing at Random (MAR) – and the parameters of the missing data mechanism are unrelated to those of the model for the data. In this case, an imputation technique performs well if good estimates of the missing values are obtained, and biased results are only caused by the inability of the procedure to preserve the relationships among items. On the other hand, if the missing data mechanism is nonignorable, unbiased estimates of parameters can only be obtained by including a suitable model for the missingness process (Greenlees et al., 1982). Such a model, however, may be difficult to find. In this case covariate information can be useful and analysis under the less stringent MAR assumption may yield reasonable results. The data considered in this paper consists of responses to test-items X = [xvi ] (v = 1,. . . , n respondents, i = 1, . . . , k items) and covariates Z = [zvh ] (h = 1, . . . , q). All items have a fixed number of ordered response options, and the weighted sum ofPthe item responses (scale score) is used as an estimate of latent ability: rv = ki=1 wi xvi , with wi the weight of item i. In the sequel it is assumed that the covariates are completely observed and the missingness only occurs in the item responses.
3. Imputation Techniques Imputing missing values means that predictions are made of the unknown values of items. Sande (1982) discussed the problems an imputer is faced with, and concluded that a procedure is needed that: (1) will impute plausibly and consistently with the edits, (2) will reduce the bias and preserve the relationship between the items as far as possible, (3) will work for (almost) any pattern of missing items, (4) can be set up ahead of time, and (5) can be evaluated in terms of impact on the bias and precision of the estimates. She states that “particular techniques of imputation vary in their ability to meet these requirements” (p. 147). Different categorizations of imputation techniques can be distinguished (cf. Schulte Nordholt, 1998). First, there are deterministic versus stochastic techniques. The values imputed by deterministic methods are uniquely determined and result in identical estimates when repeated. However, these methods tend to overestimate precision by underestimating variances. Stochastic imputation methods, on the other hand, use some kind of randomization process to impute missing values, and therefore reduce the bias in estimating variances/covariances. Refinements of stochastic methods are multiple imputations (Rubin, 1987). A second distinction is that between naive and more principled approaches. Naive methods are quick options, mainly based on analyzing complete cases (listwise or pairwise), but also the imputation of an unconditional mean is considered a naive approach. It leads to biased parameter estimates even when the data are randomly missing. Little and Rubin (1987, Chap. 3) show that the obvious corrections for
334
MARK HUISMAN
this bias lead to the same estimates found with available case procedures. More principled approaches use models for both the observed and missing data on which the imputations are based. Finally, there is the distinction between imputation based on explicit models and implicit models. Little and Schenker (1995) define explicit models as models which are usually discussed in mathematical statistics, for instance, normal linear regression models (see also, Schafer, 1997). Implicit models are models which underlie procedures for fixing up data structures in practice and often have a nonparametric flavor. Examples are hot-deck procedures, in which missing values are imputed with donor cases from the set of completely observed cases. In this paper the performance of nine imputation methods which mainly belong to the group of naive, implicit, and deterministic methods, is investigated. The first two particularly are not to be recommended and will only be used as naive benchmarks; the other methods are expected to perform at least as well as these two. All methods impute values which are in accordance with the edits, except the mean imputation methods which yield noninteger values. These values are rounded to the nearest integer to create an imputed data set which contains adequate scores that are consistent with the edits. This completed data sets then can be analyzed with standard complete data software for scale analysis. The following simple imputation techniques are investigated: Naive imputation methods. The benchmark methods are: random draw substitution (RDS) and incorrect answer substitution (IAS). RDS replaces a missing value with a random draw from the permitted response options. IAS imputes the incorrect answer (in a test for which the items are of the correct/incorrect type) or the answer which is socially most undesirable (worst case scenario, for attitude items). Mean imputation. When scale data are considered there are several possibilities to impute a mean. The first method is item mean substitution (IMS), where the item mean of the observed cases is imputed for every missing value of a particular item. A second method is person mean substitution (PMS). Here the mean scale score over the observed items is used to impute missing values of a person. The third mean substitution method investigated here is corrected item mean substitution (CIM): ! ! P k (v) PMSv h∈obs(v) xvh (i) (i) CIMvi = wv x¯.i = P x¯.i = P IMSi , (h) ¯.h h∈obs(v) IMSh h∈obs(v) x where x¯.i(i) is the mean score on item i for the nonmissing cases, and obs(v) is the collection of observed items and k (v) the number of observed items for person v. Where IMS only depends on the scores of all persons on a particular item, and PMS only looks at one person and depends on this person’s score on all items, CIM replaces missing values by the item mean which is corrected for the ‘ability’
IMPUTATION OF MISSING ITEM RESPONSES
335
of the respondent, i.e., the score on the observed items of the respondent compared with the mean score on these items. Item Substitution. Instead of imputing the mean scale score of a person, an observed response of this person to another item in the scale, can be used to substitute a missing value. The method of item correlation substitution (ICS) replaces a missing value by the observed response on that item which has the highest correlation with the missing item. If this donor item itself has a missing value for the particular person, the value of the item with the second highest correlation will be imputed, and so on. The correlation matrix is based on the complete cases. This can result in biased estimates of the correlations (when the percentage missing values is large or the data are nonrandomly missing), but only the order of the correlations is used here, which probably will not be affected too much even if there is a considerable amount of nonrandomly missing data. Hot-deck imputation. Hot-deck imputation techniques use a completely observed donor case for the imputation of an incomplete case (see, e.g., Sande, 1983). The missing values are replaced by the corresponding values of the donor case. The different ways of finding a donor case define the different hot-deck procedures. The hot-deck next case (HNC) method uses the first complete case after the incomplete case as donor case and is also known as sequential hot-deck (see also Schulte Nordholt, 1998). In hot-deck nearest neighbor methods a donor is found by minimizing some distance function. Here a nearest neighbor is defined as a complete case with a similar response pattern as the incomplete case, and is found by minimizing dv,v 0 =
X
(xvi − xv 0 i )2 ,
i∈obs(v)
where v is the incomplete and v 0 a complete case. Two variants of this method are investigated. The deterministic method (HDD) uses the complete case for which the distance function is minimized. When several complete cases are at the same minimal distance of the currently considered incomplete case, the complete case which is nearest to the incomplete case with respect to its place in the data matrix is used as a donor case. The random method (HDR) selects several complete cases for which the distance function is small (minimum and near minimum). From these possible donors, one is randomly drawn and used for imputation. 4. Simulation Design This section describes the design of the simulation study to investigate the performance of the imputation techniques for missing item responses. Four independent factors were used: data (d), sample size (n), missing data mechanism (m), and
336
MARK HUISMAN
Table I. Data sets serving as basis for data used in simulation (scale, size of original data set (N), number of items (k), number of response options (Cat.), average interitem correlation (Corr.), Cronbach’s alpha, and Loevingers H
d1 d2 d3 d4
Data
N
k
Cat.
Corr.
alpha
H
FFPI(E) RAND(PF) RAND(MH) NHP(SI)
1295 2456 2517 955
20 10 5 5
1–5 0–2 0–5 0–1
0.39 0.61 0.58 0.41
0.93 0.94 0.85 0.74
0.43 0.76 0.65 0.60
proportion missing values (p). With these factors incomplete data matrices are generated which will be imputed with the nine imputation techniques. 4.1.
INDEPENDENT FACTORS
Data (d) The data used in this study come from four empirical data sets from the behavioral sciences. Data encountered in actual field research are realistic and are therefore expected to provide a better picture of the accurateness and effectiveness of the imputation techniques than simulations based on data generated with some (theoretical) distribution (see also Kromrey and Hines, 1994). The original data consist of scales with different numbers of items and response options, and a large number of cases. In Table I the four data sets are presented. FFPI(E). A subscale of the Five-Factor Personality Inventory to assess the factor Extraversion, one of the ‘big five’ personality traits. The scale score is computed by subtracting the sum of the last 10 items from the sum of the first 10 items. The range of the scale score is between −40 and 40; low values indicate that the respondent is (more) introvert, high scores are found for (more) extravert persons (Hendriks, 1997). RAND(PF). A subscale of the RAND-36 Item Health Survey measuring Physical Functioning of respondents. The scale score is computed by summing the item scores and multiplying this sum by 5 to obtain a scale score between 0 and 100. A high score indicates a more favorable functional state (Van der Zee and Sanderman, 1993). RAND(MH). Also a subscale of the RAND-36, measuring Mental Health of persons. A scale score is computed by multiplying the sum of the items by 4 to obtain a score between 0 and 100, where a high score indicates a more favorable mental health state (Van der Zee and Sanderman, 1993). NHPR(SI). The recoded subscale of the Nottingham Health Profile measuring Social Isolation. To obtain a scale score the individual item scores are multiplied by unique weights (see also Table III). The resulting score is between 0 and 100, low values indicating that the respondent feels socially isolated (Hunt et
IMPUTATION OF MISSING ITEM RESPONSES
337
al., 1993). The items are recoded to equalize the order of the scale scores in all scales: high scores indicating more favorable states (implicitly assuming that being extravert is more favorable than being introvert). Sample size (n) Data matrices consisting of n = 100, 200, and 400 cases are randomly drawn from the original sets. Missing data mechanism (m) The missing values were created with three different mechanisms for missingness. These are labeled MCAR, NRX, and NRXZ. (m1) MCAR – Missing completely at random. Randomly a row and column of the data matrix are drawn and the corresponding entry is deleted until a fraction p of all entries is missing. (m2) NRX – NonRandomness depending on X. The probability of response of person v on item i is computed as a logistic function of the scale score rv and the mean item score x¯.i (see, e.g., Greenlees et al., 1982): P (Mvi = 0 | rv , x¯.i ) =
exp(α0 + α1 rv + δ x¯.i ) , 1 + exp(α0 + α1 rv + δ x¯.i )
where Mvi is the missing data indicator (Mvi = 0 indicating an observed and Mvi = 1 a missing response) and α0 , α1 and δ are scalar parameters (see Table II). (m3) NRXZ – NonRandomness depending on X and Z. The probability of response of person v on item i is a logistic function of the scale score rv , the mean item score x¯.i , and the covariates sex, zv1, and age, zv2 : P (Mvi = 0 | rv , zv1 , zv2 , x¯.i ) =
exp(α0 + α1 rv + γ1 zv1 + γ2 zv2 + δ x¯.i ) , 1 + exp(α0 + α1 rv + γ1 zv1 + γ2 zv2 + δ x¯.i )
where α0 , α1 , γ1 , γ2 , and δ are scalar parameters (see Table II). The scale scores, mean item scores, and the covariates are standardized and the parameters of the functions are fixed. The values of γ1 and γ2 are fixed at −1, indicating small response probabilities for women and older persons. The values of α1 and δ are fixed at 1, such that for persons with a low scale score and for items with a low mean value (indicating a ‘bad performance’ or consequently, a ‘difficult item’) the probability of response is small. Such persons and items are often found to have missing responses (see e.g., Huisman, 1998). The value α0 is set at the values given in Table II to create the necessary fractions of missing data. An observation is classified missing (Mvi = 1) if P (Mvi = 0 | rv , zv1 , zv2 , x¯.i ) ≤ Uvi , where Uvi is a randomly drawn variable from a uniform generator over
338
MARK HUISMAN
Table II. Values of the parameter α0 of the missing data mechanisms for different values of p and different data sets FFPI(E) 0.05 0.12 0.20 NRX 3.75 NRXZ 4.50
2.75 1.75 3.25 2.25
RAND(PE) 0.05 0.12 0.20
RAND(MH) 0.05 0.12 0.20
NHPR(SI) 0.05 0.12
0.20
3.75 2.75 4.75 3.25
3.50 2.50 4.25 3.75
3.50 2.50 4.50 3.00
1.75 1.75
1.75 2.25
1.75 1.75
the interval [0, 1]. This means that person v has a missing value for item i if the probability of response is small, i.e., smaller than a randomly drawn number. Proportion missing values (p). Three levels of proportion of missing data were used: p = 0.05, 0.12, and 0.20. These proportions indicate the percentage of missing cells in the data matrix. The different proportions were achieved by adapting the α0 parameter in the missing data mechanisms. The values can be found in Table II. Due to the random character of the mechanisms the actual proportion missing data may be somewhat smaller or larger than p. This might impair a fair comparison of the results. Therefore, these possible differences are resolved by randomly creating extra missing data or randomly removing some missing cells by re-inserting the original observed score respectively. 4.2.
PERFORMANCE OF THE IMPUTATION TECHNIQUES
An imputation technique performs well if it is able to obtain unbiased estimates of missing values. However, more important is the ability of preserving the relationships among items and reducing the bias caused by the missing data (Sande, 1982). Kromrey and Hines (1994) argue that the effectiveness of an imputation technique must be evaluated against a criterion which is commonly used in applied research, such as regression coefficients. In the case of a test or scale measuring persons, the (latent) ability of the respondents is the topic of interest. This ability is estimated by (some transformation of) the scale score. Therefore, the performance of an imputation technique is investigated by comparing the scale scores after imputation with the original scores before data points were deleted. To judge the ability of the imputation techniques to preserve the relationships among the items, the level of comparison is changed from person level to scale level. This means that instead of the relation between every item pair, an overall measure of the relations between all items will be investigated. The two measures to assess the quality of the scale introduced in Section 2, Cronbach’s alpha and Loevinger’s H -coefficient, will be investigated before and after deletion of data points and imputation to assess the effect of the imputation techniques on the scale quality.
IMPUTATION OF MISSING ITEM RESPONSES
339
Person level Let rv be the scale score of person v in the original complete data before deletion and imputation of entries. Let rv (t) be the scale score of person v from the tth imputed data matrix (t = IAS, RDS, IMS, PMS, CIM, HDR, HDD, and HNC). The distribution of the deviation dv (t) = rv (t) − rv for all cases v with incomplete data is used to judge the performance of the imputation techniques. The distribution of ¯ the dv (t)’s is summarized by the mean d(t), the standard deviation sd(t ), and the root mean-squared deviation: 2 1/2 !1/2 n n mis mis 1 X 1 X X RMSD(t) = dv (t)2 = wi (xvi (t) − xvi ) , nmis v=1 nmis v=1 i∈mis(v) where xvi (t) is the imputed value for person v and item i, mis(v) is the collection of missing item responses for person v, and nmis is the number of persons with missing data in the data matrix. It follows that RMSD(t) is only based on the incomplete cases. This means that it will not automatically increase with an increasing fraction of missing values. In this way the RMSD(t) only reports the extra deviation on top of the deviations due to an increasing number of incomplete ¯ cases. The same holds for the mean d(t), which serves to investigate any systematic under- or overestimation of rv (t). The RMSD(t) gives an indication of the general bias of method t. In order to make the four different scales comparable with respect to deviations in the scale scores, the scale scores of the data sets must have the same range. Therefore, the scores are all transformed to new scores with range 0–100. For this purpose all items are given a weight wi . As was mentioned earlier, for the scale scores of the RAND(PF), RAND(MH), and NHPR(SI) the transformation to the range 0–100 is instructed by the literature. The scores of the FFPI(E) are the only ones with a different range. Except for the NHPR(SI), the weights used for the transformation of the scores equal wi = k×100 max , where k is the number of items and max is the difference between the largest and the smallest response option. The latter is also equal to the maximum error caused by an imputation technique for a particular item in one of the four scales. For instance, the maximum error that can be made when imputing an item of the FFPI(E) is 4, which occurs when the original score was 1 (smallest response option) and the imputed value is 5 (largest option) or vice versa. Note that these weights are equal for all items. The NHPR(SI) on the other hand, has different weights for every item (see Hunt et al., 1993). Still, the average of these weights equals k×100 max . The values of the weights in the four scales can be found in Table III. Weighting the items in a scale results in a change in the interpretation of the distribution of dv (t), and therefore of the RMSD(t). Before weighting, incorrectly imputing an item with a difference of one response option (e.g., imputing a 3 1/2 1 instead of a 2) resulted in a change of the RMSD(t) by nmis . After trans-
340
MARK HUISMAN
Table III. Weights of the items in the four scales. The weights of the RAND(PF), RAND(MH) and NHPR(SI) are those which are found in the literature Data
k
max
wi = k×100 max
FFPI(E)a RAND(PF) RAND(MH) NHPR(SI)b
20 10 5 5
4 2 5 1
1.25 (i = 1, . . . , 10), −1.25 (i = 11, . . . , 20) 5.00 (for all i0 4.00 (for all i) 22.01, 19.36, 20.13, 22.53, 15.97
a r + 40 was used for the transformation. v b The average w is 20, which equals 100 . i k×max
forming the scores, the same incorrect imputation results in different changes in the RMSD(t) for the different scales due to the item weights (when wi is the same 1/2 1 for all items, the change in RMSD(t) is w nmis ). This change reflects that: (1) incorrectly imputing an item in a short scale (k small) causes more bias in rv (t) than in a long scale, and (2) incorrectly imputing an item with few response options (max small) causes more bias in rv (t) than many response options. Incorrectly imputing one item of the FFPI(E), for instance, causes a change in rv (t) of 1.25 when there is a difference of one response option between imputations. The same error in the NHPR(SI), however, causes on average a change in rv of 20. Scale level For the evaluation of the imputation methods at the scale level, the deviations of Cronbach’s alpha and Loevinger’s H after imputation from the values obtained before deleting data points are used as criteria; da (t) = alpha(t) − alpha and dH (t) = H (t) − H , for every data set created in the simulation. Both measures heavily depend on the covariance matrix of the items, which means that bias in estimating variances and covariances caused by imputation will lead to biases in alpha and H . 5. Results 5.1.
PERSON LEVEL : RECOVERING rv
The simulation design is a complete factorial one resulting in 108 cells (d × n × m × p), in each of which 100 data matrices with missing data were created and repeatedly imputed with the nine imputation methods. Because of the large sample size in each cell effect sizes will be judged on relevance rather than on significance. Main effects Table IV presents the average values of the RMSD(t) of the imputation techniques for all factors of the design. In the table the techniques (columns) are ordered such that on the outmost left and right sides the methods which performance is worst
341
IMPUTATION OF MISSING ITEM RESPONSES
Table IV. Main effects of RMSD(t) for the imputation techniques across all factors IAS
RDS
IMS
PMS
CIM
ICS
HDR
HDD
HNC
All
17.07
11.96
11.17
8.37
7.17
8.05
10.20
8.82
12.96
FFPI(E) RAND(PF) RAND(MH) NHPR(SI)
10.41 11.04 22.02 24.83
4.51 11.52 12.66 19.14
4.23 11.27 10.57 18.61
2.57 7.40 8.29 15.23
2.26 5.62 6.46 14.33
3.26 5.94 8.36 14.63
5.09 8.90 9.25 17.57
3.57 7.21 8.26 16.26
5.39 15.19 12.30 18.96
n = 100 n = 200 n = 400
17.08 17.05 17.08
11.89 11.98 12.01
11.09 11.18 11.23
8.28 8.39 8.44
7.09 7.18 7.23
8.16 8.07 7.91
10.27 10.20 10.14
9.02 8.79 8.66
12.85 13.01 13.03
MCAR NRX NRXZ
18.07 14.43 18.72
10.87 11.52 13.49
6.11 12.89 14.51
5.13 8.98 11.01
4.58 7.68 9.25
5.37 8.75 10.02
5.92 11.96 12.73
5.44 10.53 10.50
7.97 14.63 16.29
p = 0.05 p = 0.12 p = 0.20
12.59 16.86 21.77
10.52 11.85 13.51
10.34 11.19 11.98
7.26 8.45 9.40
6.19 7.25 8.06
6.79 8.01 9.34
8.45 10.21 11.96
7.31 8.79 10.36
11.48 13.00 14.41
can be found, while in the middle the best techniques according to the RMSD(t) are located. From Table IV it follows that across all factors CIM is the best technique. For each independent variable separately, CIM also performs best, closely followed by ICS, PMS, and HDD in varying order. The benchmark methods (IAS, RDS, and HNC) perform worst, as was expected. Imputing an item mean (IMS) is on average always worse than imputing a person mean (PMS), although the RMSD(PMS) more rapidly increases than that of IMS when the number of missing values increases. In many cases IMS is almost as bad as RDS or, to a lesser degree HNC, and sometimes even worse, especially for nonrandomly missing data. ¯ The same general emerges when looking at d(t) and sd(t ). These criteria are presented in Figure 1 for the four best performing techniques of Table IV: PMS, CIM, ICS, and HDD. Figure 1 shows positive mean deviations, indicating that the imputed values are larger than the original values. Only when the missing data mechanism is ignorable the deviations are small and close to zero, and the methods IAS and RDS (not reported here) even result in negative deviations. This latter method proves to be at least as good as PMS and HDD and in case of serious ¯ nonrandomness even as good as CIM, according to d(t). When looking at the effect of the four factors the following can be said. First, the differences between the scales show the influence of the item weights wi on the deviations in the scale scores. This reflects the ‘punishment’ of incorrectly
342
MARK HUISMAN
Figure 1. Mean and standard deviation of dv (t) for the four best imputation techniques according to RMSD(t): PMS, CIM, ICS, and HDD (standard deviations printed within circles).
imputing a short scale (d3 and d4), or a scale with few response categories (d2 and d4). Second, sample size has a marginal effect on the performance of an imputation technique and is only relevant for the hot-deck methods. Third, the higher the percentage of item nonresponse, the more mistakes the imputation methods make. And finally, the more complicated the missing data mechanism, the more values are incorrectly imputed. IAS is the only method for which the performance does not become worse when the mechanism is nonrandom (m2), because the values deleted by this mechanism are related to low abilities and ‘difficult’ items, which are often correctly imputed by IAS. Interaction effects Figures 2 and 3 present the results of the three-way interactions d × m × p, based on the criterion RMSD(t) (because sample size was not found relevant it is not considered in the interactions). Specifically, the results for the scales FFPI(E) and RAND(PF) are presented in Figure 2, the results for RAND(MH) and NHPR(SI) in Figure 3. The rows of the figures show the effects of the data sets, the columns – from top to bottom – the effect of the missing data mechanisms. For each level of proportion missing data, a plot is made of the nine imputation techniques against their performance. The imputation methods on the horizontal axis are ordered in the same manner as they were in Table IV.
IMPUTATION OF MISSING ITEM RESPONSES
343
Figure 2. Comparison of nine imputation techniques based on RMSD(t). Data sets FFPI(E) and RAND(PF).
The following results are noticeable in Figures 2 and 3. First, the overall best method found earlier, CIM, also proves to be the best here. There are, however, some serious competitors. In the FFPI(E) data, PMS and HDD, and in the RAND(PF) and NHPR(SI) data ICS perform as good as CIM. Only in the RAND(MH) scale, CIM is the best technique in all circumstances.
344
MARK HUISMAN
Figure 3. Comparison of nine imputation techniques based on RMSD(t). Data sets RAND(MH) and NHPR(SI).
Second, some of the benchmark methods IAS, RDS, and HNC, perform better than expected in particular situations. In may situations RDS is even better than item mean substitution or the random hot-deck procedure. The bad performance of HDR is caused by the large number of donors of which one is drawn. The average number of donors can be as small as 1 or 2, but is often large (30 or 40) or even as large as 240 or 250. This means that the set of donors also contains (many) cases
IMPUTATION OF MISSING ITEM RESPONSES
345
less similar to the incomplete case, and the probability of using a ‘bad’ donor is large. From these results it follows that imputation techniques which use the relationships between items perform better than those which do not. Imputing an item mean is bad, but when the relationships between items is taken into account (CIM) the bias decreases. ICS performs well, especially in scales with few response options, and the performance of HDD is good in scales with many items. Finally, when the situation becomes ‘uglier’, the performance gets worse. Here ‘uglier’ is defined as more missing values, more complicated mechanisms, and/or a scale that is more difficult to impute (according to the weights). One exception is the improvement of some techniques in the NHPR(SI) as p increases. This is, however, an artificial result caused by the interaction between actual data, mechanisms deleting specific values, and techniques which are unable to impute particular values.
5.2.
SCALE LEVEL : ASSESSMENT OF SCALE QUALITY
In the previous section the imputation techniques were evaluated by their ability to recover the total score of a person on a test. Good estimates of rv after imputation do not indicate that the imputation technique is able to adequately preserve the relations between the items. As a result the internal structure of the scale may change due to imputation, making conclusions based on the imputed data less reliable or valid. Therefore the change in the quality of the scales after imputation is investigated. The criteria used are da (t) and dH (t), as defined earlier. Cronbach’s alpha Figures 4 and 5 present the results of the three-way interactions across the levels of n, based on the criterion da (t). Again the effects of the different data sets are shown in the rows of the figures and the columns show the effect of the missing data mechanisms. In each plot the ordering of the imputation techniques is the same as in Figures 2 and 3, the three different lines represent the levels of p, and the average values of alpha (over the levels of n) are given for each level of d, m, and p. From the figures it follows that the three mean substitution procedures perform very differently from each other. On the one hand IMS, which substitutes values at the center of the distribution and therefore underestimates the variances and covariances, systematically underestimates alpha. On the other hand, PMS and CIM systematically overestimate alpha. For the long tests in Figure 4 this overestimation is small, for the scales with few items the overestimation is considerably large. In the case of NHPR(SI), mechanism NRXZ, and p = 0.20, for instance, CIM results in an increase of alpha from 0.640 to 0.828, making the scale seem more reliable. In the short scales ICS performs somewhat better than CIM, but this is not true for the longer scales, although it does not overestimate alpha there. The hot-
346
MARK HUISMAN
Figure 4. Comparison of nine imputation techniques based on da (t). Data sets FFPI(E) and RAND(PF). Average values of alpha (over n) are given for in each plot.
IMPUTATION OF MISSING ITEM RESPONSES
347
Figure 5. Comparison of nine imputation techniques based on da (t). Data sets RAND(MH) and NHPR(SI). Average values of alpha (over n) are given for in each plot. For NHPR(SI) these are higher than reported in the literature (Table I).
348
MARK HUISMAN
deck methods always underestimate alpha, but absolute differences are in some cases smaller than those of CIM. Only HDD, however, can be considered a serious competitor of CIM, as the performance of HDR and HNC is bad. Just as for RMSD(t), when the situation becomes ‘uglier’, the performance of all techniques becomes worse. The bias in alpha increases as p increases or the mechanism becomes more complicated. Only when the mechanism is most complicated (dependent on covariates) the bias is less severe because then the cause of missingness is only partly depending on scale characteristics. Also longer scales are more robust against imputation with respect to internal consistency than shorter scales. Because of the dependence of alpha on the number of items (when k increases alpha increases) this is not surprising. The same holds for scales with polytomous items, although length has a larger impact on da (t) than the number of response options. Loevinger H-coefficient The results obtained with dH (t) as a criterion of the performance of imputation techniques are very similar to those obtained with da (t). Therefore only two important differences in the results are reported: 1. The absolute values of dH (t) are all larger than those of da (t), except for the NHPR(SI) for which the deviations are equally large. Moreover, the deviations in H are of the same order when the four scales are compared. This shows that the length of a scale and the number of response options do not have as much influence on H as they have on alpha, and that therefore imputation has a larger impact on the H -coefficient than it has on alpha (see e.g., Molenaar and Sijtsma, 1984, who report that test length has little or no systematic influence on H ). For example, imputing item means in the FFPI(E) scale (mechanism NRX, p = 0.20) results in a H -coefficient of 0.296, which classifies the scale as very weak (see Mokken, 1997). CIM, on the other hand, causes H to increase to 0.492, which almost classifies the scale as strong. 2. The overestimation of alpha by the methods PMS and ICS does not occur in some cases for the H -coefficient, or is much smaller. Moreover, the deviations caused by ICS are very small in the short scales, making its performance as good as that of CIM.
6. Discussion The results of the simulation study clearly show that the proportion missing data, the missing data mechanism, and the characteristics of the scale (length and number of response options) all have effects on the performance of any imputation technique. Only the effect of sample size proved not to be relevant, except maybe for the hot-deck methods, but the effects were very small. When examining the scale
IMPUTATION OF MISSING ITEM RESPONSES
349
score rv of the respondents, imputing an item mean corrected for ability (CIM) emerged as the overall best of the nine simple imputation techniques presented, in almost all the combinations of studied factors. The only method that can compete with CIM is item correlation substitution. This method, however, only performs well in scales with few response options. This is caused by the fact that high correlations do not necessarily indicate similar responses, and may yield wrong answers for items with many response options (e.g., 0 0 1 0 1 correlates perfectly with 3 3 4 3 4, but the values are different). In the study, this is shown by HDD and PMS, which are best after CIM when the number of response options is large. Considering the quality of measurement after imputation, an overall best method cannot be presented. Person mean and corrected item mean imputation result in an overestimation of the internal consistency, as well as ICS in some cases. The other techniques always yield scores of which the quality is less than the original scores. Imputation of an item mean is always one of the worst techniques, and may result in serious biases. The overestimation of the quality of the scales by CIM is not surprising. Imputing an item mean that is corrected for the ability of the respondent can be seen as a very simple case of a model from item response theory (IRT). These models are used to analyze scale data and make estimates of latent abilities of respondents. When measurement models, even very simple versions, are used for both the imputation of missing item responses and the analysis of the imputed data, the model fit will increase. The values of alpha and H , which can be seen as goodness-of-fit measures of IRT models, will therefore be overestimated when the missing item responses are imputed with an IRT model. The same is found by Kromrey and Hines (1994) in regression analysis. They report an overestimation of R 2 when regression imputation is used to handle missing data. The assessment of quality of measurement reveals how mean imputation fails to reflect accurately the uncertainty in individual missing values. Moreover, the use of single imputation techniques is often considered improper from the perspective of ensuring proper coverage of interval estimates (Rubin, 1987; Schafer, 1997). This is reflected by the under- and overestimation of the scale quality by several methods. The methods, however, may still be useful in some situations where the researcher is not in the position to use multiple imputation procedures, which are computationally much harder to use and are difficult to implement. And, as stated by Landerman et al. (1997): “precision is more adversely affected by a poor imputation model than by the use of single-value imputation methods. The probability of obtaining a poor estimate increases dramatically as the predictive power of the imputation model decreases, regardless of whether single-value or multiple-imputation methods are used” (p. 25). The performance of the corrected item mean imputation for recovering rv shows that using information from both persons and items results in better estimates of the missing values. This indicates that using imputation models from the class of
350
MARK HUISMAN
measurement models, like IRT models, may lead to improved imputations. But estimation of such models may be difficult or even not possible. CIM, however, is easy to compute and yields good estimates of rv , although one should bare in mind the overestimation of the scale quality. In general the performance of CIM is best, and it shows that there is much to gain when measurement models are used for the imputation of missing values to test-items. Acknowledgements The author is grateful to Ivo Molenaar for suggested improvements on the design of the study and comments on earlier versions of the paper. Thanks are also due to Anne Boomsma and Herbert Hoijtink for helpful comments. This research was supported by the Netherlands Research Council (NWO), Grant 575-67-048. References Dempster, A. P. & Rubin, D. B. (1983). Overview. In: W. G. Madow, I. Olkin & D. B. Rubin (eds), Incomplete Data in Sample Surveys, Vol. II: Theory and Bibliographies. New York: Academic Press, pp. 3–10. Greenlees, J. S., Reece, W. S. & Zieschang, K. D. (1982). Imputation of missing values when the probability of response depends on the variable being imputed. Journal of the American Statistical Association 77: 251–261. Hendriks, A. A. J. (1997). The Construction of the Five-Factor Personality Inventory (FFPI). PhD Dissertation, The Netherlands: University of Groningen. Hunt, S. M., McKenna, S. P. & McEwen, J. (1993). Nottingham Health Profile (NHP). In: C. Koenig-Zahn, J. W. Furer & B. Tax (eds), Het meten van de gezondheidstoestand; 1-Algemene gezondheid. Assen: Van Gorcum, pp. 100–114. Huisman, M. (1998). Missing data in behavioral science research: Investigation of a collection of data sets. Kwantitatieve Methoden 57: 69–93. Kromrey, J. D.& Hines, C. V. (1994). Nonrandomly missing data in multiple regression: An empirical comparison of common missing-data treatments. Educational and Psychological Measurement 54: 573–593. Landerman, L. R., Land, K. C. & Pieper, C. F. (1997). An empirical evaluation of the predictive mean matching method for imputing missing values. Sociological Methods & Research 26: 3–33. Little, R. J. A. (1988). Missing-data adjustments in large surveys. Journal of Business & Economic Statistics 6: 287–296. Little, R. J. A. & Rubin, D. B. (1987). Statistical Analysis with Missing Data. New York: Wiley. Little, R. J. A. & Schenker, N. (1995). Missing data. In: G. Arminger, C. C. Clifford & M. E. Sobel (eds), Handbook of Statistical Modeling for the Social and Behavioral Sciences. New York: Plenum Press, pp. 39–75. Lord, F. M. & Novick, M. R. (1968). Statistical Theories of Mental Test Scores. Reading: AddisonWesley. Mokken, R. J. (1997). Nonparametric models for dichotomous responses. In: W. J. van der Linden & R. K. Hambleton (eds), Handbook of Modern Item Response Theory. New York: Springer-Verlag, pp. 351–367. Molenaar, I. W. & Sijtsma, K. (1984). Internal consistency and reliability in Mokken’s nonparametric item response model. Tijdschrift voor Onderwijsresearch 9: 257–268. Rubin, D. B. (1976). Inference and missing data. Biometrika 63: 581–592.
IMPUTATION OF MISSING ITEM RESPONSES
351
Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. New York: Wiley. Sande, I. G. (1982). Imputation in surveys: Coping with reality. The American Statistician 36: 145– 152. Sande, I. G. (1983). Hot-deck imputation procedures. In: W. G. Madow, I. Olkin & D. B. Rubin (eds), Incomplete Data in Sample Surveys, Vol. III: Proceedings of the Symposium. New York: Academic Press, pp. 339–349. Schafer, J. L. (1997). Analysis of Incomplete Multivariate Data. London: Chapman & Hall. Schulte Nordholt, E. (1998). Imputation: Methods, simulation experiments and practical examples. International Statistical Review 66: 157–180. van der Zee, K. I. & Sanderman, R. (1993). Het meten van de algemene gezondheidstoestand met de RAND-36: Een handleiding. [Measuring the general state of health with the RAND-36: A manual.] Groningen: Northern Center for Healthcare Research (NCG).