Statistics Netherlands Division of Technology and Facilities Methods and informatics department P.O.Box 4000 2270 JM Voorburg The Netherlands
The AUTIMP-project: Evaluation of Imputation Software
R.L. Chambers (University of Southampton) J. Hoogland (Statistics Netherlands) S. Laaksonen (Statistics Finland) D.M. Mesa (University of Southampton) J. Pannekoek (Statistics Netherlands) P. Piela (Statistics Finland) P. Tsai (University of Southampton) T. de Waal (Statistics Netherlands)
Project number: BPA number: Date:
Remarks: The views expressed in this paper are those of the authors and do not necessarily reflect the policies of Statistics Netherlands. TMO-102139 1881-01-TMO 24 July 2001
THE AUTIMP-PROJECT: EVALUATION OF IMPUTATION SOFTWARE Summary: The AUTIMP project was a European project that was partly funded by the European Commission as part of its 4th Framework Programme (Esprit). It lasted from the 1st of February 1999 till the 31st of October 2000. One of the aims of AUTIMP was the evaluation of software that can be used for imputation. Evaluation reports have been written for several computer programs as part of the AUTIMP project. In the present paper those reports are compiled. Keywords: imputation software, CART, CHAID, Imputer 2, Solas 2.1, S-Plus, SPSS Missing Value Analysis 7.5
ii
Introduction to “The AUTIMP-project: Evaluation of Imputation Software” Ton de Waal – Statistics Netherlands July, 2001
1
AUTIMP Introduction to Evaluation of Imputation Software
1. AUTIMP The AUTIMP project was a European project that was partly funded by the European Commission as part of its 4th Framework Programme (Esprit). It lasted from the 1st of February 1999 till the 31st of October 2000. Institutes participating in the AUTIMP project were University of Southampton (UoS), Statistics Finland (StatFi), Office for National Statistics UK (ONS), Instituto Nacional de Estatística de Portugal (INE PT) and Statistics Netherlands (CBS). CBS acted as project coordinator. The aims of the AUTIMP project as described in the synopsis of the project proposal is quoted below. In Official Statistics, for business surveys too much effort is spent in correcting minor flaws in the data. Imputation of erroneous or missing fields should be as automated as possible. With Censuses and large-scale person surveys automatic imputation is also vital, because the amount of data is so large. However, corrections are often still carried out by hand or with crude approximations. Therefore CBS, ONS, UoS, StatFi and INE PT propose to co-operate on innovative imputation software for both business surveys and population Censuses, which will deal with both numerical and categorical data. ‘Hot-deck’ imputation selects values from a donor record that is similar to the recipient record in some sense. The research aims at the development of a fast algorithm that automatically selects criteria to match recipient and donor records, ‘tuned’ to reflect both accurate individual imputations and accurate distribution imputations. This algorithm will be implemented in a computer programme. UoS is primarily responsible for methods, CBS for programming, and ONS, StatFi and INE PT for testing. Several software packages for imputation, e.g. SurFox and SOLAS, are already available. These packages support several interesting imputation methods (not including the methodology described in this proposal). We propose to evaluate several of these imputation software packages.
2. Organisation of this report The present report is a compilation of reports on available software that can be used for imputation. Reports on the imputation software developed within the AUTIMP project (WAID) are compiled in Chambers et al. (2001). The remainder of this report is organised as follows. On pages 5-37 an evaluation report by Mesa, Tsai and Chamber on using classification and/or regression trees for imputation of missing values can be found. Three software packages for generating classification and/or regression trees have been evaluated, namely S-Plus, CHAID in SPSS AnswerTree, and CART. An appendix to this report can be found on pages 39-43.
2
AUTIMP Introduction to Evaluation of Imputation Software
Three software packages that have been developed especially for imputation of missing data have also been evaluated within the context of the AUTIMP project, namely •
Imputer 2. The evaluation report by Piela can be found on pages 45-50.
•
Solas 2.1. The evaluation report by Laaksonen and Piela can be found on pages 51-67.
•
SPSS Missing Value Analysis 2.1. The evaluation report by Hoogland and Pannekoek can be found on pages 69-95. An appendix can be found on pages 97-100.
Part of the latter evaluation study has also been carried out using CART. The evaluation report by Tsai can be found on pages 101-110. Its appendix can be found on pages 111-122.
3
AUTIMP Introduction to Evaluation of Imputation Software
References Chambers, R.L., T. Crespo, S. Laaksonen, P. Piela, P. Tsai and T. De Waal, 2001, The AUTIMP Project: Evaluation of WAID. Report, Statistics Netherlands, Voorburg.
4
Using Tree-Based Models for Missing Data Imputation: An Evaluation Using UK Census Data D.M. Mesa, P. Tsai and R.L. Chambers – University of Southampton October 21, 2000
5
AUTIMP Using Tree-Based Models for Missing Data Imputation
1. Overview This report describes the results from an analysis of the imputation performance of three software packages for building classification tree models. The application that underpins this analysis is imputing for missing data in the UK Census, and the numerical results quoted later in this report reflect imputation performance using a “tree-based” approach for this situation. The basic idea in tree-based missing data imputation is very straightforward. Given a categorical response variable for which data are missing, and a set of categorical explanatory variables, the method works by first using the complete data cases for these variables to build a classification tree that “explains” the distribution of the response variable in terms of the values of the explanatory variables. The terminal sets or nodes in this tree are then treated as “impute classes” and imputed values for the missining values of the response variable for cases that “fall into” a node are drawn in some appropriate way from the “complete data” cases in that node. Thus there is an implicit assumption that missingness within a node is at least at random (MAR), or typically completely at random (MCAR). In the following section we first introduce and develop the idea of tree-based models for data, distinguishing between classification trees (the primary type of tree used in this report) and regression trees. In section 3 we go on to discuss the attributes of the three tree-modelling software packages that are the focus of this report (CART, S-Plus and CHAID), distinguishing in particular between the binary recursive algorithms used by CART and S-Plus for fitting tree models and the non-binary algorithm used by CHAID. In section 4 we move on to introduce some methods for missing data imputation that can be used with a tree-model, focusing on issues of the size of tree used, how the tree is used for imputation and methods of imputation within a node of the tree. In order to assess how well these ideas work with “real”data we need to confront the tree-based method of imputation with a realistic missing data scenario. The one we examine in this report is described in section 5 and is based on replicating the pattern of missing data observed in the 1991 UK Census. In section 6 we then develop the methods we use subsequently to evaluate the performance of the tree-based imputation approach. Finally in section 7 we present summary results from our evaluation (the complete results can be obtained by sending an e-mail to
[email protected]), and provide our general recommendations for using tree-based imputation models with the type of data we investigated. Our overall conclusion is that the use of tree-models for missing data imputation is a promising approach, provided “sensible” methods are used to select values for imputation. In general, such methods tended to preserve the distribution of the missing data. However, the structure of trees that are optimal for imputation are different from those that are optimal for classification, and no method we investigated proved entirely satisfactory in terms of actually recovering the values that were missing.
2. Building Tree-Based Models A tree–based model is a set of classification rules that partition a data set into mutually exhaustive and non-overlapping subsets. These rules are defined in terms of the values of a group of categorical explanatory variables, with the model constructed by successively splitting the data set into subsets that are increasingly more homogeneous with respect to a response 6
AUTIMP Using Tree-Based Models for Missing Data Imputation
variable of interest. This splitting continues until a stopping criterion is met. The tree model is then represented by the hierarchy of splits that eventually lead to the final subsets or “terminal nodes” of the tree. Since the output of a tree-based model is a set of terminal nodes that are (relatively) homogeneous as far as the response variable is concerned, they are particularly suited to imputation of missing values for this variable. The basic idea is to take a case for which the value of the response variable is missing, determine the terminal node defined by the tree-based model to which this case belongs, and impute a value for the response variable for this case by selecting one from the distribution of values for the response variable within this node. The AID (Automatic Interaction Detection) program of Sonquist, Baker and Morgan (1971) represents one of the first methods for fitting a tree-based model to data. AID is based on a recursive algorithm which successively splits the original data set into smaller, more homogeneous subsets via a sequence of sequential binary splits. A similar recursive binary segmentation algorithm underpins the CART (Classification and Regression Tree) program developed by Breiman et al (1984). These ideas have also been implemented in the regression and classification tree analysis modules in S-Plus (Martin and Minardi, 1995). An alternative, non-binary, recursive splitting algorithm underpins the CHAID program (Kass, 1980). Two types of tree-based models are usually referred to in the literature. These are (a) Classification Tree models; and (b) Regression Tree models. Their basic difference is the scale of measurement of the response variable. In a classification tree model the response variable is assumed to be categorical, and measures of homogeneity appropriate to categorical data are used to determine the splits in the tree. In a regression tree the response variable is assumed to be continuous and measures of homogeneity relevant to the distribution of a continuous variable are used to determine the splits in the tree. In all cases explanatory variables are treated as categorical. In particular, a continuous explanatory variable, unless explicitly categorized before analysis, is treated as categorical with total number of classes equal to the total number of distinct values taken by this variable in the data set being modelled. Use of tree-based models therefore requires some initial preprocessing of the data to ensure all explanators are categorical. This report evaluates three tree-based modelling programs (CART, S-Plus and CHAID) from the viewpoint of their use for missing data imputation.
3. The Tree-Modelling Software used in this Report Three tree-modelling software packages were investigated for this report: CART (Classification and Regression Trees; Steinberg and Colla, 1995), Tree-Based Models in S-Plus (S-Plus Version 4.5; MathSoft, 1998), and CHAID (CHi-squared Automatic Interaction Detection; SPSS, 1998). The performance characteristics of these three software packages and their features are summarised in Table 3.1. Although these software packages adopt different tree-growing algorithms, the basic ideas underpinning the algorithms they use are similar. Given a data set containing values for a response variable (categorical or continuous) and a set of explanatory variables (all categorical, or if containing some continuous variables, considered as categorical with categories defined by unique values on the data set), all cases in the data set are examined to find the best rule for classification or prediction of the response variable within a set of rules defined by hierarchically splitting the original sample into subgroups on the basis of the values of the 7
AUTIMP Using Tree-Based Models for Missing Data Imputation
explanatory variables. This sequence of splits defines a tree. Each split is chosen to optimise classification or prediction for the subset of cases being split, without attempting to optimise the overall classification or predictive performance of the tree. The splitting process is recursive, in that the same procedure is used to split the subgroups obtained from earlier splits, stopping only when the tree is complete, as defined by an appropriate stopping criterion. In the analysis described later in this report, this stopping criterion was defined by no further splitting once a node has less than 30 cases. The main difference between these three software packages investigated in this report is that while CART and S-Plus use binary recursive partitioning to grow a tree, CHAID uses nonbinary (k-child) recursive partitioning. It is also important to note that although CART and SPlus use the same binary tree building method, they adopt different criteria for deciding on where to split subgroups. Below we describe in more detail the algorithms adopted by these packages.
8
AUTIMP Using Tree-Based Models for Missing Data Imputation
Table 3.1 Software Characteristics
Kind of variables (Input) Explanators Response
Capacity for handling large numbers of cases Capacity for handling missing data (while growing a tree) Segmentation Splitting Rules
Pruning Rules Optimal tree given
CART Reads numeric variables but can treat them either as a categorical or numerical Categorical and numerical variables Categorical (classification trees) and continuous variables (regression trees) Yes
S-PLUS Reads either categorical variables or numerical variables
CHAID Reads either categorical or numerical variables
Categorical and numerical variables Categorical (classification trees) and numerical variables (regression tree)
Categorical and numeric variables Categorical variables
Yes
Yes (but relatively slower than CART & S-Plus) Yes
Yes
No
Binary Classification trees: • Gini • Symmetric Gini • Class Probability • Twoing • Ordered Twoing Regression trees: • Least squares • Least absolute deviations
Binary Classification trees: • Deviance Regression trees: • Least squares
Linear combinations Misclassification costs Yes
K-child Classification trees: • F test or Chi-squared test § §
§
Cost-complexity measure No
9
If dependent variable is continuous, uses an F test. If dependent variable is nominal, uses Person chisquared test or likelihood-ratio test. If dependent variable is ordinal, uses likelihood-ratio test.
modified by Bonferroni multiplier Cost-complexity measure No
AUTIMP Using Tree-Based Models for Missing Data Imputation
3.1. Tree-based Modelling via Recursive Binary Segmentation (CART and S-Plus) The basic idea in recursive binary segmentation is to break up the data set of interest by a succession of binary splits. After an appropriate “parent” node has been identified, two “child” nodes are created by splitting the parent node on the basis of the values of an explanator. This is accomplished by placing in one child node all cases from the parent node with a specified range of values of this explanator. The remaining parent node cases are then placed in the other child node. The algorithm then proceeds recursively, treating all unsplit nodes formed up to that point as potential parents and repeating the above process. The first parent node to be split is the original data set. After this split, the decision on which parent node to split and the explanatory variable and associated value range determining the split is made on the basis of minimisation of an appropriate criterion for minimising within node heterogeneity across all unsplit nodes available at each stage in the tree-building process. In particular, the algorithm proceeds by evaluating all possible splits of candidate parent nodes at any particular stage, and chooses the one that leads to the largest reduction in within node heterogeneity in the resulting child nodes. The result of these binary splits is an inverted “decision tree” with branches determined by the split points or nodes, with each split defined in terms of the values of an explanator. Clearly, the homogeneity of unsplit nodes increases the farther down the tree one goes. However at the same time the size of each node decreases, so it becomes more and more difficult to split the node. Eventually, no further nodes can be split and all remaining nodes are “terminal” nodes. 3.1.1. Tree-building in CART The procedure followed in CART for growing a tree is quite simple. The only necessary requirement for growing a tree using this software is specification of the set of explanatory variables and identification of the response variable. The software contains many options for growing better or more efficient trees. Some of the options that can be specified are the method used for splitting a group and the method used for measuring the homogeneity of the resulting subgroups. It is also possible to use a subset of the input data set to build a tree and to use cross-validation methods for deciding on an optimal tree. One of the most important characteristics of the software is that it is possible to handle missing values for explanatory variables by the use of surrogates defined in terms of linear combinations of the explanatory variables with nonmissing values. The default option for the CART software is to generate an optimal tree, selected by minimising a compromise between tree complexity and misclassification cost (“cost complexity”). However, the output from the package is such as that it is possible to obtain complete information for any of the different trees grown in the analysis. The report generated by CART contains a considerable amount of information, including misclassification costs, complete node information, misclassification by class, tables for learning sample and crossvalidation (classification tables and classification probability tables), variable importance, surrogates and complete terminal node information. 3.1.2. Tree-building in S-Plus This software can be used to construct a tree provided there are no missing values in the data, no categorical response variable has more than 128 levels; and no categorical explanatory variable has more than 32 levels. The software automatically decides whether to fit a regression or classification tree, according to the type of response variable. If the response variable is 10
AUTIMP Using Tree-Based Models for Missing Data Imputation
categorical, a classification tree is built. If the response is continuous, a regression tree is built. As with any binary tree the algorithm used by the software splits the data into successive nodes that become progressively more and more homogeneous until they contain too few observations to split further (default is ≤ 5). Unlike CART, the tree-based model in S-Plus does not produce an optimal tree. The algorithm builds a maximum tree based on a model choice criterion. The analyst must then prune the tree using the tools provided. In S-Plus, this tree pruning function is based upon the cost-complexity measure or misclassification rate. The outputs of the tree-based model include a text summary of the tree constructed and its node information. A separate function has to be used if one requires a picture of the tree. 3.1.3. Splitting Criteria for CART and S-Plus In both CART and S-Plus, tree construction revolves around two processes: the selection of the node to split, and the decision on when to stop splitting a node (i.e. declare the node to be a terminal node). Each split of a parent node is carried out so that the data in each of the child nodes are “purer” (i.e. less heterogeneous) than the data in the parent node. The main criteria used by CART and S-Plus to determine within node heterogeneity are described below. For a categorical response variable, CART offers a number of options for “growing” a tree with increasing node homogeneity. See Table 2.1. The one investigated in this report bases the splitting process on the Gini measure of within node heterogeneity
G (t ) = ∑ j∈t p ( j | t )[1 − p ( j | t )] = 1 − S (t ) Here t denotes the node, p(j|t) denotes the proportion of cases in node t that fall into category j of the response variable and j ∈ t denotes all categories of the response that occur in node t. S(t) is then the sum of the squares of the p(j|t). Observe that if a node is pure (i.e. contains only cases that have the same category for the response, then G(t) is zero. Consequently splits are chosen in order to minimise the sum of the resulting values of G(t) for the child nodes created by the split. The partition methods used by S-Plus for tree construction are described in Clark and Pregibon (1992). For a classification tree, overall within node heterogeneity is measured by the deviance statistic:
D = −2∑∑ n( j | t ) log p ( j | t ) t
j∈t
where n(j|t) denotes the number of cases in node t with response category j. For a regression tree (i.e where the response variable Y is continuously distributed), S-Plus measures within node heterogeneity by the within node sum of squares
L2 = ∑t ∑k∈t ( y k − y t ) 2 CART can also build a tree on the basis of L2. In addition CART can use the sum of within node absolute deviations
11
AUTIMP Using Tree-Based Models for Missing Data Imputation
L1 = ∑t ∑k∈t y k − median( y ) t Here median(y)t denotes the median value of Y in node t.
3.2. Tree-based Modelling using K-Child Segmentation (CHAID) This method allows both categorical and continuous response variables and allows each parent node in the tree to be split into more than two child nodes. Taking each potential parent node in turn, the CHAID algorithm first evaluates all combinations of the values of a potential explanator (assumed to be categorical), collapsing values that are judged to be statistically homogeneous (similar) with respect to the response variable and maintaining all other values that are heterogeneous (dissimilar). It then selects the best “merged” explanator to form a set of child nodes defining the next branch in the tree. For a categorical response variable, this is accomplished by computing the crossclassification of each such explanator with the response variable in the parent node and selecting the explanator that generates the smallest chisquare pvalue, appropriately corrected for multiple comparisons. For a continuous response variable an equivalent F statistic p-value is computed. A similar likelihood ratio statistic is used to generate p-values for an ordinal categorical response variable. Once an explanator (and associated crossclassification) have been chosen, the child nodes are defined by the classes of this explanator in the cross-classification. This process continues recursively until the tree is fully grown. This process is described fully (with the aid of an example) in Appendix 1. In order to account for multiple comparisons effects when calculating p-values to split a node, CHAID adjusts these by appropriately selected Bonferroni multipliers. Let c = number of original categories of the explanator, and r = number of reduced categories. Then the different types of multipliers used by CHAID are given by: Monotonic (Ordinal) Explanators:
c − 1 Bmonotonic = r − 1
Non-Monotonic (Nominal) Explanators:
B free = ∑ (−1) i
r −1
i =0
(r − i) c r!(r − i )!
“Floating” Explanators: A floating explanator is one whose categories lie on an ordinal scale with the exception of a single category that either does not belong with the rest, or whose position on the ordinal scale is unknown (Kass, 1980). In this case, the Bonferroni multiplier is defined as
c − 2 c − 2 r − 1 + r (c − r ) + r = B float = Bmonotonic c −1 r − 2 r −1 The CHAID software gives a wide range of outputs, which includes a picture of a tree, node information, gains charts, risk charts, and rules that make up particular nodes. In the case of categorical response variable, gains charts provide user with mode statistics that describe the classification tree relative to the target category of the response variable. If the response variable is continuous, gains charts provide user with node statistics relative to the mean of the response variable. Risk charts tabulate misclassification statistics.
12
AUTIMP Using Tree-Based Models for Missing Data Imputation
4. Imputation via Tree-Based Models This approach to imputation involves three steps: selecting the tree to be used for imputation, classifying a record to be imputed (i.e. determining the terminal node in which it should be “placed”), and then imputing an appropriate value for the response variable for this record by selecting using an appropriate value from the distribution of values in this terminal node. The following subsections discussed these three steps in more detail.
4.1. Growing the Imputation Tree Once a variable is selected for imputation, a tree is grown for that variable using either CART, S-Plus or CHAID. These trees can have a very large number of terminal nodes. Thus, it is usually necessary to find ways to “prune” a tree without compromising imputation performance. One way to prune a classification tree is to use its misclassification rate. This is a measure of the percentage of cases misclassified if one classifies cases allocated to a particular terminal node as belonging to the modal category for that node (often referred to as the node category). Let χ (• ) be 1 if the condition inside the parentheses is true and 0 otherwise. Then, the
misclassification rate R(d ) can be calculated as follows.
R(d ) = where
1 N
∑ χ (d ( x ) ≠ j ) N
1
i
i
N denotes the total number of cases to be classified;
d ( xi ) denotes the node category for case i; ji denotes the true category for case i. Figure 4.1 is a plot of the change in the cross-validation misclassification rate of a CART tree for the variable ALWPRIM (see section 5 for a description of the variables used in the analysis described in this report) as the number of terminal nodes increases. It is clear that the misclassification rate drops markedly until the tree has 7 terminal nodes, and then it stays relatively constant. Similar patterns were observed in the misclassification rate plots of all other trees investigated in this study. It was therefore decided that three different sizes of tree would be included in the subsequent analysis. A “small” tree with around 7 terminal nodes, a “medium” tree with around 15 terminal nodes and a “large” tree with around 30 terminal nodes.
13
AUTIMP Using Tree-Based Models for Missing Data Imputation
Figure 4.1. Misclassification Rate Plot M is c la s s ific a tio n R a te A L W P R IM C A R T O u tp u t Percentage misclassified
40 35 30 25 20 15 10 5 0 523
99
47
45
34
29
28
21
18
N o . o f te r m i n a l n o d e s
14
7
4
3
2
C r o s s -V a lid a t i o n
The CART software also allows an “optimal” tree to be built. This is done by initially growing a much larger tree than is actually needed and then “pruning” it back to avoid over-fitting. That is, the “branches” of this tree are “cut back” (i.e. split nodes are merged) until a smaller tree with minimum “cost complexity” or one that minimises a cross-validation criterion for classification performance is reached. This approach is automated in CART and is described in Breiman et al (1984). However, it is unclear whether the optimality possessed by such a tree is relevant when it is used in imputation. Furthermore, in most of the analyses described in this report, the optimal trees reported by CART were in fact the maximal trees for the data sets on which they were based. Consequently, we did not use these trees in our analyses in general. However, in one case the optimal tree was of a reasonable size (Response = AWLPRIM LTILL; 38 terminal nodes).
4.2. Classifying Records for Imputation A tree can only be grown using a data set where all values of the response variable are present. In the imputation situation this data set consists of those records that have no missing values for the variable to be imputed (the response variable). Once this tree has been grown then cases where this variable is missing are “dropped” down the tree to identify the terminal node from which imputed values for these cases will be obtained. This requires each such case to have sufficient information, in terms of the explanatory variables that define the tree, to allow it to be classified using the tree. The tree shown in Figure 4.2 is a classification tree for the variable ALWPRIM. There is a set of conditions associated with each terminal node. For example to reach terminal node T1, a record has to have the variable AGE equal to 1, 2 or 3 and the variable LTILL equal to 1. Suppose there is a record with ALWPRIM missing but with AGE = 4 and SEX = 2. By dropping this record down the tree in Figure 4.2 we see that it ends up in terminal node T4. The value of ALWPRIM can then imputed from the range of values of AWLPRIM for those “complete” cases in this node using the imputation methods described in the next subsection.
14
AUTIMP Using Tree-Based Models for Missing Data Imputation
Figure 4.2. Classification of Records with Missing Values
ALWPRIM
AGE 1,2,3
4,5,6,7
LTILL 1
SEX
2
1
Tree
Database
15
2
AUTIMP Using Tree-Based Models for Missing Data Imputation
4.3. Imputation Methods This report describes the imputation performance of five within node imputation methods. These methods are (a) probability distribution; (b) highest probability; (c) random record selection, (d) random category selection and (e) nearest neighbour. Neither CART nor CHAID offer facilities for within node imputation, so this has to be implemented “outside” the tree growing environment. S-Plus provides a function that can be used to predict missing values, using highest probability imputation, but none of the other methods. Consequently separate programs were written in Visual FoxPro 6.0 to classify records with missing values for a particular response variable using the tree based on cases where this response variable is nonmissing. The algorithm then imputes the missing values using one of the five methods described below. Probability Distribution In this case the probability distribution of the response in a terminal node determines the value of that variable to be imputed. To illustrate, suppose Figure 4.3 is a summary of the probability distribution in a particular terminal node of the tree in Figure 4.2. Then, all the records with missing values for AWLPRIM that end up in this node will be imputed as follows: the first 18.36% percent of the records will have ALWPIM imputed as category 1, the next 10.27% imputed as category 2, the next 56.74% imputed as category 3 and the last 14.63% imputed as category 4.
Figure 4.3. Imputation Based on Probability Distribution 56.74%
percentage
60 50 40 30 20
real
18.36%
14.63% 10.27%
10 0 1
2
3
4
categories
Highest Probability This method imputes the value that is “most likely” (i.e. has the highest probability) in a terminal node. Again, suppose Figure 4.4 is a summary of the probability distribution in a particular terminal node of the tree in Figure 4.2. Then, under highest probability imputation, all records with missing values of AWLPRIM that end up in this node will have ALWPRIM imputed as category 1. On the other hand, if the terminal node has a probability distribution like the one in Figure 4.3, then all the records that end up in this node will be imputed as category 3.
16
AUTIMP Using Tree-Based Models for Missing Data Imputation
Figure 4.4. Imputation Based on Highest Probability 70
64.98%
percentage
60 50 40
real
30
17.00%
20
10.12%
7.91%
10 0 1
2
3
4
categories
Random Selection of Record Given a case with a missing value for the response variable “drops” into a particular terminal node, this method randomly selects a “complete” record within the node and uses its response value as the imputed value. Thus, if Figure 4.5 represents the composition of a particular terminal node of the tree in Figure 4.2, with ten complete records, then for each of the “missing” records that end up in this node, the algorithm first randomly selects a complete record from the ten in the node (without replacement) and uses its value of ALWPRIM as the required imputation for the missing record. Thus, if complete record 3 is selected then the imputed value for AWLPRIM is 4. Alternatively, if complete record 7 is selected then ALWPRIM is imputed as 1.
Figure 4.5. Imputation Based on Random Selection of Record
record 1 record 2 record 3 record 4 record 5 record 6 record 7 record 8 record 9 record 10
alwprim 1 3 4 2 2 3 1 4 3 2
other variables
Random Selection of Category This method randomly selects a category from the distribution of response variable categories in a terminal node to serve as the required imputed value. Thus, suppose Figure 4.6 represents the distribution of AWLPRIM categories in a particular terminal node of the tree in Figure 4.2. We see that in this node only categories 1, 2 and 3 of AWLPRIM are represented. 17
AUTIMP Using Tree-Based Models for Missing Data Imputation
Consequently, when imputing a value for AWLPRIM for a record with this value missing that falls into this terminal node, we randomly select one category from the set {1, 2, 3}.
number of records
Figure 4.6. Imputation Based on Random Selection of Category
1
2
3
4
categories
real
Nearest Neighbour Given a case with a missing value for the response variable “drops” into a particular terminal node, this method first calculates the total distance between this case and each of the “complete” records within the node. This total distance is the sum of distances for all the explanatory variables that define the tree. For each nominal scale explanatory variable used to define the tree (i.e. all except the variable “AGE”), the distance between the "missing value" case and a "complete data" case is calculated by comparing the value of this variable for the “missing data” case with the corresponding value for the “complete data” case. If these values are the same then the distance is zero; otherwise, the distance is one. For an ordinal scale explanatory variable (e.g. “AGE”), the distance is measured by the absolute difference between these two values. Having calculated the total distances between the "missing data" case and all the "complete data" cases in the node, the method then selects that "complete data" case with minimum total distance and uses its value for the response variable as the required imputation for the “missing data” case. If there is more than one “complete data” case with the same minimum total distance, the method randomly selects one of these cases as the donor. To illustrate, Figure 4.7 represents the composition of a particular terminal node of the tree in Figure 4.2, with six "complete data" cases (C1,…,C6), and one record with ALWPRIM missing (M1). The algorithm first calculates the distance between this case and each of the six “complete data” cases within the node. For example, the distance between cases C1 and M1 is the sum of the distances for the four explanatory variables WELSH, SEX, AGE and ETHNIC. For the predictor variable WELSH, the value of this variable in complete data case C1 is the same as the value of the same variable in the missing data case M1. Thus, the distance is zero for this variable. Similarly, the distance is zero for the variable SEX. However, the values of the variable ETHNIC are different for cases C1 and M1 so the distance for this variable is equal to 1. Finally, the distance for the variable AGE is the actual difference between the values of this variable for cases C1 and M1, which is 2 (= 3-1). Summing these variable specific distances, the total distance between cases C1 and M1 is 3 (= 0+0+2+1). The total distance between case M1 and the other five complete data cases in the node can be similarly calculated. This distance is 3 for case C2, 3 for case C3, 2 for case C4, 2 for case C5 and 4 for case C6. The minimum distance here is 2, and there are two complete data cases with this value, cases C4 and C5. In this situation, the algorithm randomly selects one of these cases and uses its value of ALWPRIM as the required imputation for the missing data case M1. Thus, if case C4 18
AUTIMP Using Tree-Based Models for Missing Data Imputation
is selected, M1 has its ALWPRIM value imputed as category 3, while if case C5 is selected, then the imputed value of ALWPRIM for case M1 is category 4.
Figure 4.7. Imputation Based on Nearest Neighbour Selection
"Complete" record in a terminal node (C) alwprim Welsh Sex record 1 (C1) record 2 (C2) record 3 (C3) record 4 (C4) record 5 (C5) record 6 (C6)
1 2 3 3 4 2
1 1 2 0 1 0
2 1 1 2 2 1
Record with missing value (M)
Age Ethnic
3 2 1 1 3 2
1 2 2 3 4 3
alwprim Welsh Sex record 1 (M1)
1
2
Age Ethnic
1
4
5. Data Description The data set used for the analysis presented in this report consists of the values reported for a selection of variables measured in the 1991 UK Census. The cases underpinning the data set corresponded to Census returns from a single county in 1991. All variables were person level variables. No household variables or identification variables were included in the analysis. Table 5.1 sets out the complete list of variables on the data base and their status in the analysis. Only cases for which all variables were 100% edited were used in the database. Initially the database contained 222872 cases and 23 variables. However, a number of the variables were dropped because they corresponded to identifying variables or because the information they contained was not relevant to the imputation process. These included the variables dobdy (date of month in which person born), dobmt (month of birth of the person), doby3 (year of birth of the person), alwsec (secondary activity last week), alwtert (tertiary activity last week). Table 5.2 shows the final list of variables used for the analysis and their descriptions. The second stage of the analysis was to determine the pattern of missing data in the data base. This was necessary in order to create realistic artificial “holes” in the 100% complete records on the data base in order to test the imputation process. A complete list of all possible combination of missing variables on the original data base with their corresponding percentage “missingness” is shown in Table 5.3.
19
AUTIMP Using Tree-Based Models for Missing Data Imputation
Table 5.1. Variables
Identification Variables
Other Variables
Variables Dropped
Rectype Edcode Formnum Persnum Uresdet Urondet Termdet Werabout Errimpin Errcorin Gaelic Dobdy Dobmt Doby3 Alwsec Alwtert Sex Marcon Cob Ethnic Ltill Alwprim Age Welsh
Table 5.2 Variable AGE ALWPRIM COB ETHNIC LTILL MARCON SEX WELSH
Definition Age of the person, calculated from date of birth Primary activity last week Country of birth Ethnic origin Long term illness Marital status Sex Welsh language ability
20
AUTIMP Using Tree-Based Models for Missing Data Imputation
Table 5.3 age
alwprim
cob
ethnic
ltill
Marcon
sex
welsh
Total
3346 1222 40 3330 360 40 5 3992 167 47 2 543 89 15 1 1803 137 24 1 263 73 1 1 185 24 7 2 544 201 23 4 1484 117 18 3 222 80 3 85 21 3 1 81 39 21
% missing
14.93 5.45 0.18 14.86 1.61 0.18 0.02 17.81 0.75 0.21 0.01 2.42 0.40 0.07 0.00 8.04 0.61 0.11 0.00 1.17 0.33 0.00 0.00 0.83 0.11 0.03 0.01 2.43 0.90 0.10 0.02 6.62 0.52 0.08 0.01 0.99 0.36 0.01 0.38 0.09 0.01 0.00 0.36 0.17
AUTIMP Using Tree-Based Models for Missing Data Imputation age
alwprim
cob
ethnic
ltill
Marcon
sex
welsh
Total
84 14 1 43 25 3 1 25 7 476 178 7 3 163 2054 2 571 75 34
% missing
0.37 0.06 0.00 0.19 0.11 0.01 0.00 0.11 0.03 2.12 0.79 0.03 0.01 0.73 9.16 0.01 2.55 0.33 0.15
The total percentage of missingingness in the data base is 10.05 %, corresponding to 22399 cases out of a total of 228782. As can be seen in Table 5.3, the pattern of missingingness in the data base is not simple. There are a large number of combinations of variables that are missing, with up to 6 different variables missing at the same time. Building a tree model for each combination of missing variables is therefore out of question. However, these “missing combinations” are relatively rare. The approach taken in our analysis is to recreate the pattern of missingness shown in Table 5.3 in the 100% complete records in the original database. That is, a synthetic data base with holes in it similar to the holes in the original data base was created by randomly deleted variables following the pattern of missingness shown in Table 5.3. The sizes of the different data bases involved in this missing data simulation process is set out in table 5.4, while Figure 5.1 shows how the holes in the synthetic data base were created. The advantage of this new synthetic data base is that the “true” values for the holes are known and so imputation performance on the synthetic data base can be assessed.
Table 5.4 Database Original Database Complete Database Synthetic Database
Size 222872 200457 200457
Complete Information 200457 200457 180294
22
Missing Information 10.057% None 10.059%
AUTIMP Using Tree-Based Models for Missing Data Imputation
Figure 5.1 Alwprim
Ltill
Sex
Cob
Welsh
Marcon
Age
Ethnic
Alwprim
Ltill
Sex
Cob
Welsh
Marcon
Age
Ethnic
Since all variables on the data base are categorical, combinations of these variables also correspond to categorical variables, with categories defined by the cross-classification of categories of the components of the combination. In some cases, however, this led to some “composite” categorical variables with too many categories to allow tree construction. Consequently, variables with more than ten categories were collapsed. A complete list of variables with their original categories and their new ones is set out below:
AGE Grouping 1 Group 0-4 5-9 10-15 16-18 19-21 22-24 25-29
New Code 1 2 3 4 5 6 7
Group 30-34 35-39 40-44 45-49 50-54 55-59 60-64
New Code 8 9 10 11 12 13 14
Group 65-69 70-74 75-79 80-84 85 +
Grouping 2 Group 0-4 5-15 16-24 25-34
New Code 7 6 5 4
Group 35-54 55-64 65 +
23
New Code 3 2 1
New Code 15 16 17 18 19
AUTIMP Using Tree-Based Models for Missing Data Imputation
ETHNIC Group White Any black including mixed Asian China / Other including other mixed
Codes 00 01 / 02 / 70-80 03-05 06 / 81-97
New Code 1 2 3 4
COB Countries UK Europe / USA Indian Sub-continent Africa / Caribbean Asia / Central and South America / Other
Codes 601-609 610-612 / 639-641 / 645-671 / 679 632-635 613-631 / 642-644 / 672-678 / 680 636-638 / 681-702
New Code 1 2 3 4 5
ALWPRIM Primary Activity Employee working full time / Employee working part time / Self employed, employing others / Self employed, not employing others / Government employment or training scheme Waiting to take/start a job / Unemployed / looking for a work / At school or in full time education / Unable to work because of long term disability / Retired from paid work / Looking after home/family / Other economically inactive No code required
Codes 01 / 02 / 03 / 04 / 05 06 / 07 08 / 09 / 10 / 11 / 12 $
New Codes 1
2 3
4
Two variables were then selected for analysis. The first was ALWPRIM, which had more than 6% of missing values on the data base. The second was a composite variable, defined by the cross-classification including AWLPRIM that had the most missing values. This was the ALWPRIM by LTILL cross-classification. The codes defining this composite variable are set out below.
24
AUTIMP Using Tree-Based Models for Missing Data Imputation
ALWPRIM - LTILL (Composite) Combination Alwprim=1+ Ltill=1 Alwprim=1 + Ltill=2 Alwprim=2 + Ltill=1 Alwprim=2 + Ltill=2 Alwprim=3 + Ltill=1 Alwprim=3 + Ltill=2 Alwprim=4 + Ltill=1 Alwprim=4 + Ltill=2
Code 1 2 3 4 5 6 7 8
Definition Employed+health problems Employed+no health problems Unemployed+health problems Unemployed+no health problems Inactive+health problems Inactive+no health problems No code required+health problems No code required+no health problems
Since CART and CHAID can grow a tree even when there are missing values for some of the explanatory variables, two extracts from the synthetic data base were in fact used when tree building for a particular response variable. We describe these below in the context of AWLPRIM being the variable for which imputations are required. 1. The subset of the synthetic database corresponding to complete records. The size of this data set was 180294 records. 2. The subset of the synthetic database excluding only those records for which AWLPRIM was missing (i.e. records with missing values for explanators were kept). This resulted in a data set of 200457 records. Using trees grown on these data sets, two subsets of the synthetic data base were imputed: 1. Records with ALWPRIM missing but no other missing information. 2. All the records with ALWPRIM missing, irrespective of whether the remaining variables were missing or not. The data sets used for both tree growing and imputation for the two variables analysed in this report were therefore:
Response = ALWPRIM Process
Database Size
Growing Trees Growing Trees
180294 (complete records) 200457 (all the records) - 2819 (total of records with ALWPRIM missing)
Imputing
180294 (all the records) + 1335 (total of records with ALWPRIM missing and the rest of the variables complete)
Imputing
180294 (all the records) + 2819 (total of records with ALWPRIM missing)
Missing Information None
Notes
Only for the explanatory variables Only for the response variables
Records with information for the response variable were deleted
Only for the response variables
Assuming complete information for all of the explanatory variables
25
AUTIMP Using Tree-Based Models for Missing Data Imputation
Response = ALWPRIM – LTILL Process
Database Size
Growing Trees Growing Trees
180294 (complete records)
Missing Information None
Notes
Only for the explanatory variables Only for the response variables
Records with information for the response variable were deleted
Only for the response variables
Assuming complete information for all of the explanatory variables
200457 (all the records) - 1045 (total of records with ALWPRIM - LTILL missing)
Imputing
180294 (all the records) + 200 (total of records with ALWPRIM - LTILL missing and the rest of the variables complete)
Imputing
180294 (all the records) + 1045 (total of records with ALWPRIM - LTILL missing)
A graphical representation of the various data sets used in imputation is shown below
Imputation for missing AWLPRIM Alrprim
Ltill
Sex
Cob
2819
1335
26
Welsh
Marcon
Age
Ethnic
AUTIMP Using Tree-Based Models for Missing Data Imputation
Imputation for missing AWLPRIM - LTILL Alrprim
Ltill
Sex
Cob
Welsh
Marcon
Age
Ethnic
1045
200
6. Evaluating Imputation Performance A key aspect of our analysis of the different tree-based imputation methods considered in this report is a rigorous statistical evaluation of their comparative performance. In particular, we assess this performance by: 1. comparing the differences in the marginal distributions of the real and the imputed values; 2. comparing the differences in individual values (real vs. imputed values for each record). In order to carry out these comparisons, the real and the imputed values for the imputed variables were cross-tabulated. An example of such a cross-tabulation is set out below. In this case, the variable imputed was ALWPRIM, using the data set where 2819 records has this variable missing, and using a specific tree size, a specific software for growing the tree and a specific method for imputation.
27
AUTIMP Using Tree-Based Models for Missing Data Imputation
2819 cases in table +----------+ |N | |N/RowTotal| |N/ColTotal| |N/Total | +----------+ Alwprim| Alwprim (real values) imputed|1 |2 |3 |4 |RowTotl| -------+-------+-------+-------+-------+-------+ 1 |787 |142 |306 | 0 |1235 | |0.6372 |0.1150 |0.2478 |0.0000 |0.438 | |0.6542 |0.5703 |0.3579 |0.0000 | | |0.2792 |0.0504 |0.1085 |0.0000 | | -------+-------+-------+-------+-------+-------+ 2 |138 | 27 | 68 | 0 |233 | |0.5923 |0.1159 |0.2918 |0.0000 |0.083 | |0.1147 |0.1084 |0.0795 |0.0000 | | |0.0490 |0.0096 |0.0241 |0.0000 | | -------+-------+-------+-------+-------+-------+ 3 |278 | 80 |481 | 0 |839 | |0.3313 |0.0954 |0.5733 |0.0000 |0.298 | |0.2311 |0.3213 |0.5626 |0.0000 | | |0.0986 |0.0284 |0.1706 |0.0000 | | -------+-------+-------+-------+-------+-------+ 4 | 0 | 0 | 0 |512 |512 | |0.0000 |0.0000 |0.0000 |1.0000 |0.182 | |0.0000 |0.0000 |0.0000 |1.0000 | | |0.0000 |0.0000 |0.0000 |0.1816 | | -------+-------+-------+-------+-------+-------+ ColTotl|1203 |249 |855 |512 |2819 | |0.427 |0.088 |0.303 |0.182 | | -------+-------+-------+-------+-------+-------+ Similar tables were constructed for both ALWPRIM and ALWPRIM-LTILL for all combinations of different tree sizes, different methods for imputation, different software for growing trees, different data sets used in tree growing (i.e. including or not missing information for explanators) and different data sets used in imputation. Equality of the marginal distributions of the imputed and actual values was assessed using a Wald statistic of the form
[
W = ( R − S ) t diag ( R + S ) − T − T t
]
−1
where
R is the vector of imputed counts. S is the vector of the actual counts. 28
(R − S )
AUTIMP Using Tree-Based Models for Missing Data Imputation
T is the matrix corresponding to the cross-classification of actual and imputed counts. Under the hypothesis that both imputed and actual marginal distributions are identical, W should have a large sample chisquare distribution with degrees of freedom equal to p - 1, where p is the order of the actual vs. imputed cross-classification. In order to assess how well an imputation procedure recreates the missing data values, the following statistic was computed:
tD =
D Vˆ ( D)
where D is the proportion of incorrectly imputed cases and
1 1 Vˆ ( D) = − 2 1t {diag ( R + S ) − T − diagT }1 . n 2n Under the hypothesis that the individual values are “preserved” under imputation, t D should have an approximate N(0,1) distribution. Values for the both statistics and their p-values are shown in the results section that follows.
7. Imputation Test Results This section contains the results from our analysis of the imputation performance of different size tree models generated by the software packages, CART, S-Plus and CHAID and the five different methods of within node imputation described in section 4. These results are for the variables ALWPRIM and ALWPRIM - LTILL and for data sets which were complete as well as incomplete in terms of the explanatory variables used to build the trees. There are a very large number of possible combinations of scenarios that need to be considered in presenting these results. Because this report includes many graphs and statistics which assess the performance of each scenario, we provide below a brief description of each kind of output. Our conclusions are then presented, including specific conclusions for each software package, as well as general conclusions. Detail results for each scenario are contained in Appendix 2, which can be obtained by sending an e-mail to
[email protected]. The performance of an imputation method can be assessed from two different perspectives: preservation of marginal distribution and preservation of the individual values. To show how different methods perform in this regard, we use two different kind of graphs as well as provide values for the W statistic (called the Wald Statistic below) and the tD statistic (called the Diagonal Statistic below) that were described in the previous section. The first type of graph compares the marginal distribution of the actual missing values with the corresponding marginal distribution of the imputed values.
29
AUTIMP Using Tree-Based Models for Missing Data Imputation
Software: CART Marginals
Variable: ALWPRIM Database: 2819 Records
Total of Records
Probability Distribution 1400 1200 1000 800 600 400 200 0 1
2
3
4
imputed real
Categories
In this case, the blue columns represent the distribution of the imputed values for the categories of the variable ALWPRIM and the red columns represent the distribution of the real values for the same categories of ALWPRIM. This graph was based on imputations obtained for the data set containing 2819 records with missing values for this variable, where these imputations were defined by using CART to grow a 7 node tree based on complete data only and using the probability distribution method for imputation. The second type of graph shows how well the individual values are preserved. It compares each value of the variable before and after imputation.
Software: CART Diagonals
Variable: ALPRIM Database: 2819 Records
Percentage
Probability Distribution 45 40 35 30 25 20 15 10 5 0 1
2
Categories
3
4
real imputed
Here, the blue part of a column represents the percentage of cases that in that category whose values were recovered by imputation. The red part of the column on the other hand represents the percentage of records in that category whose values were incorrectly imputed. This graph was made using the same conditions as the previous graph (same variable, same tree size, same software, same imputation method, etc.). In this example, the percentage of category 1 records is 42.67%. After imputation, 27.46% of category 1 records were correctly imputed (64.35% of all category 1 records). The remaining 35.65% of category 1 records were incorrectly imputed. The statistics mentioned in the previous section were also calculated for all possible combinations of imputation scenarios considered in our study. These values were then tabulated in order to allow comparison of either different methods for imputation or different tree sizes or different databases. An example of tabulation of these values is set out below. 30
AUTIMP Using Tree-Based Models for Missing Data Imputation
These tables correspond to the same example used for the preceding graphs, but include different tree sizes and different methods of within node imputation.
Wald Statistic (value)
Variable: ALWPRIM Database: 2819 Records Terminal Nodes 7 14 29
Probability Distribution 3.3784 4.3084 7.0330
Highest Probability 361.1724 336.0628 309.3230
Random Record 0.5973 0.9053 3.1293
Variable: ALWPRIM Database: 2819 Records Terminal Nodes 7 14 29
Probability Distribution 0.3369 0.2300 0.0785
Probability Distribution 24.4128 23.4767 22.5288
Nearest Neighbour 0.3714 3.6640 7.0167
Wald Statistic (p-value)
Highest Probability 0 0 0
Random Record 0.8971 0.8236 0.3721
Random Category 0 0 0
Nearest Neighbour 0.9460 0.3001 0.0713
Diagonal Statistic (value)
Variable: ALWPRIM Database: 2819 Records Terminal Nodes 7 14 29
Random Category 377.7455 323.6367 350.1329
Highest Probability 15.7666 15.9461 15.8435
Random Record 24.2910 23.5366 23.7166
Random Category 43.2407 42.8823 43.6470
Nearest Neighbour 21.8876 23.2083 22.5875
The first table above shows the values for the Wald statistic W generated by different imputation methods and different sized trees grown by CART for imputing the variable AWLPRIM in the data set containing 2819 records where just this variable is missing. The next table shows the p-values associated with these W statistic values, while the last table shows the values of the Diagonal Statistic t D for the same set of situations. In the following subsections we present summary results, based on the above statistical and graphical measures, for imputations generated from trees grown using each of the three software packages CART, S-Plus and CHAID. The complete results from our analysis, including all graphs and tables, can be found in Appendix 2 that can be obtained by sending an e-mail to
[email protected].
31
AUTIMP Using Tree-Based Models for Missing Data Imputation
7.1. Summary Results for CART Preservation of Marginal Distributions •
Using different tree sizes: - Our main conclusion here is that increasing the size of the tree does not necessarily improve the results of the imputation. That is, increasing tree complexity does not lead to better imputation. Furthermore, even using the optimal trees generated by CART do not guarantee better imputation results. - There is some variation in the values of the Wald statistic, although they do not follow any specific pattern. In general, the largest p-value for ALWPRIM is obtained using the smallest tree, and the largest for ALWPRIM-LTILL is obtained using the medium tree. However, since all of these p-values are over 0.05, one cannot really say anything about which tree size is better for maintaining marginal distributions.
•
Using different imputation methods: - Normally, random selection of records performs best. - However, probability distribution, random selection of records and nearest neighbour are good methods for preserving marginal distributions. - Within node imputation based on highest probability and random selection of categories are not recommended for preserving marginal distributions. - For ALWPRIM, around 10% of records are category 2. However, when using highest probability as imputation method, none were imputed as a category 2. The same phenomenon was observed for some of the categories of ALWPRIM-LTILL (e.g. categories 1, 3, 4, 7, and sometimes category 5). - Under random selection of categories, the nodal distributions of the imputed values are almost uniform (as one would expect), which is not the case for the corresponding distributions of real values.
•
Inclusion of cases with missing information in the explanatory variables when growing trees: - There is no significant improvement in the imputation results when missing explanatory information is allowed when growing trees. - The p-values for the Wald statistic for ALWPRIM both increase and decrease compared to the situation where data with missing explanators are excluded. However, they decrease as the tree size increases for ALWPRIM-LTILL, though still remaining always over 0.05. In general, the highest p-value tends to be associated with the smallest tree.
Preservation of Individual Values •
Using different tree sizes: - As in the comparison of marginal distributions, increasing tree size does not improve the imputation results. - There is little variation in the values of the Diagonal Statistic when different tree sizes are used.
•
Using different imputation methods: - The smallest values of the Diagonal Statistic tend to be associated with the highest probability imputation method and the largest are always associated with the random selection of categories method. 32
AUTIMP Using Tree-Based Models for Missing Data Imputation
•
However, one cannot say that individual values are preserved by any of the methods of within node imputation. In particular, no records were imputed for some categories of the variable of interest by the best performing method (highest probability imputation). The worst within node imputation method in terms of preservation of values is clearly random selection of category. None of the methods we considered could be recommended as preserving individual values.
Inclusion of cases with missing information in the explanatory variables when growing trees: - There is no significant improvement in the individual value imputation results when missing explanatory information is allowed when growing a tree.
7.2. Summary Results for S-Plus Preservation of Marginal Distributions •
Using different tree sizes: - Increasing the size of a tree does not necessary improve the imputation results. In most cases smaller trees are better for imputation than larger and most complex trees. - There is some variation in the value of the Wald statistic. In general, the largest p-value for this statistic for the variable ALWPRIM is obtained using the smallest tree, and the largest for the composite variable ALWPRIM-LTILL is obtained using the medium tree. However, since all of the p-values are over 0.05, one can argue that any tree size can be used for maintaining marginal distributions. Moreover, these statistics do not follow any specific pattern. That is, they do not always decrease when the tree sizes increases or vice-versa.
•
Using different imputation methods: - From the graphs it is clear that within node imputation methods based on probability distribution, random selection of record and nearest neighbour perform better than methods based on highest probability and random selection of category. The Wald statistic values confirm this. These also suggested that there little difference in performance between random selection of record and nearest neighbour methods. For ALWPRIM, random selection of record is the “best” performing method, while for ALWPRIM-LTILL the best within node imputation method is nearest neighbour. - Based on the Wald statistic values generated, highest probability and random selection of category imputation methods are not recommended for preserving marginal distributions. - For the variable ALWPRIM, around 10% of the records being imputed are category 2. When imputing using the highest probability method, none of these records were imputed as category 2. This also happens with some of the categories of ALWPRIMLTILL (i.e. categories 1, 3, 4 and 7, and sometimes category 5). - In the case of random selection of category, the distribution of the imputed values is almost uniform, which does not reflect the true distribution of values.
•
Inclusion of cases with missing information in the explanatory variables when growing trees: - Tree-based modelling in S-Plus does not allow missing values in explanators.
33
AUTIMP Using Tree-Based Models for Missing Data Imputation
Preservation of Individual Values •
Using different tree sizes: - As is the case when evaluating preservation of marginal distributions, increasing the tree size does not improve the imputation results. - There is little variation in the values of the Diagonal Statistic when different tree sizes are used. Smaller trees perform better than more complex trees.
•
Using different imputation methods: - All five methods are good at imputing category 4 but they are very bad at predicting category 2. There is not much difference between the methods when imputing category 3. The highest probability imputation method seems to be “best” for predicting category 1. - The high values for the Diagonal Statistic indicate that all five methods are not good at preserving individual values. Nevertheless, the highest probability method appears to perform better than the other methods in this regard, with the worse performing method being random selection of category. - Based on the values of the Diagonal Statistic, none of the methods can be recommended for preserving individual values.
•
Inclusion of cases with missing information in the explanatory variables when growing trees: - Tree-based modelling in S-Plus does not allow missing values in explanators.
7.3. Summary Results for CHAID Preservation of Marginal Distributions •
Using different tree sizes: - As with CART and S-Plus, the main conclusion drawn for CHAID was that increasing the size of tree does not necessarily improve imputation. In most cases smaller trees impute better than larger trees. - In general, the largest p-values for the variable ALWPRIM were obtained using the smallest tree, and the largest p-values for the composite variable ALWPRIM-LTILL were also obtained using the smallest tree. However, since all of the p-values are over 0.05, one can argue that any tree size could be used for maintaining marginal distributions. As with CART and S-Plus, these statistics do not follow any specific pattern.
•
Using different imputation methods: - Imputation methods based on probability distribution, random selection of record and nearest neighbour appear to perform better than methods based on highest probability and random selection of category. For ALWPRIM, there is very little difference in performance between random selection of record and nearest neighbour methods, while for ALWPRIM-LTILL the best imputation method is probability distribution. - The highest probability and random selection of category imputation methods are not recommended for preserving marginal distributions. - Around 10% of records are category 2 for ALWPRIM. However, when imputing using the highest probability method, none of these records were imputed as category 2. The same behaviour can be seen with some of the categories of ALWPRIM-LTILL (i.e. categories 1, 3, 4, 5 and 7, and sometimes categories 6 and 8). 34
AUTIMP Using Tree-Based Models for Missing Data Imputation
•
Random selection of category imputation induces a uniform distribution for imputed categories within a node, which is typically inappropriate.
Inclusion of cases with missing information in the explanatory variables when growing trees: - There is no significant improvement in the imputation results when missing information is allowed for when growing a tree.
Preservation of Individual Values •
Using different tree sizes: - Increasing the tree size does not improve imputation. - There is little variation in the values of the Diagonal Statistic when different tree sizes are used. Small trees typically perform better than large trees with respect to this measure.
•
Using different imputation methods: - For AWLPRIM, all five methods are very bad at predicting category 2. The highest probability method is best at imputing category 1 but the worst method for predicting category 4. There is not much difference between the methods when imputing category 3. The nearest neighbour method is the best method for predicting category 4. - The high values for the Diagonal Statistic indicate that all five methods are poor at preserving individual values. Nevertheless, the nearest neighbour imputation method appears to perform better overall than the other methods, with the worst performing method being random selection of category.
•
Inclusion of cases with missing information in the explanatory variables when growing trees: - There is no significant improvement in the imputation results when missing explanatory information is allowed for when growing a tree.
7.4. A "No Tree" Analysis How important is the tree structure for an imputation method? To answer this question, an analysis using the imputation methods described above was carried out, but in this case using the whole database as a single impute class. This lead to the following conclusions. Preservation of Marginal Distributions •
Using different imputation methods: Methods that preserve marginal distributions within a tree-based approach also preserve them when no tree structure is used. That is, probability distribution, random selection of record and nearest neighbour imputation methods all preserve marginal distributions, as well as joint distributions, when no trees are used. However, highest probability and random selection of category methods perform worse compared to when trees are used.
Preservation of Individual Values •
Using different imputation methods: When no trees are used, imputation performance in terms of preserving individual values is worse than when trees are used. With the exception of the nearest neighbour method, this 35
AUTIMP Using Tree-Based Models for Missing Data Imputation
conclusion applies to all the methods considered in our analysis. Althought none of the methods preserve individual values (in terms of p-values for the Diagonal Statistic), all methods except nearest neighbour displayed a large increase in the value of this statistic when no tree structure was incorporated into the imputation process. In contrast, the value of the Diagonal Statistic for the nearest neighbour method was little effected by the presence of tree structure. This preceding analysis suggests that the nearest neighbour imputation method does not depend on the use of tree structure, while all other method require tree structure to be present for "reasonable" imputation performance (though individual values are still not preserved). This result is not unexpected. The distance metric underlying the nearest neighbour method itself defines an "implicit" tree structure in the sense that cases that are "far" from one another in a tree will also be "distant" with respect to this metric. Consequently there is little to be gained statistically by combining nearest neighbour imputation with a formal tree structure. However there may be a considerable gain in operational efficiency from tree-based nearest neighbour imputation since this approach can substantially reduce the extent of the "search" for the nearest neighbour donor case.
7.5. Overall Conclusions A number of general conclusions can be drawn from the results that are described above. The first is that tree-based models are useful for imputation, and that furthermore their imputation performance, when teamed with probability distribution or random record selection imputation within terminal nodes, is acceptable for preserving the distribution of the missing data in the synthetic data base we created from the 1991 UK Census data. What is also of some interest is that, for the variables that were collected in this Census, the “size” and/or “complexity” of a tree has little to do with its imputation performance, with small trees often providing superior imputation performance to large trees. Our results also indicate that the particular type of tree-modelling software used is of comparatively little importance, with CART, S-Plus and CHAID growing trees with similar imputation performance, even though the “appearance” of these trees was often very different. We also note that highest probability and random selection of category methods of within node imputation often perform very poorly, particularly as far as maintaining the distribution of missing value/s of a variable. Furthermore, none of the methods we investigated was satisfactory from the viewpoint of recovering the actual missing values (though this may be an impossible requirement for any statistically-based imputation method).
36
AUTIMP Using Tree-Based Models for Missing Data Imputation
References Breiman, L., Freidman, J. H., Olshen, R. A and Stone, C. J. (1984). Classification and Regression Trees. Pacific Grove, CA: Wadsworth. Clark, L. A. and Pregibon, D. (1992). Tree-Based Models. In Statistical Methods in S (eds. J. M. Chambers and T. J. Hastie). AT&T Bell Laboratories and Wadsworth & Brooks/Cole. Kass , G. V. (1980). An Exploratory Technique for Investigating Large Quantities of Categorical Data. Applied Statistics 29, 119 - 127. Martin, R. D. and Minardi, J. (1995). S-PLUS: Tools for Engineers and Scientists. Seattle, WA: MathSoft Inc MathSoft (1998). S-Plus User’s Guide Version 4.5. Seattle, WA: MathSoft Inc Sonquist, J. N., Baker, E. L. and Morgan, J. A. (1971). Searching for Structure. Institute for Social Research. University of Michigan. SPSS (1998). AnswerTree 2.0 User’s Guide. Chicago, IL: SPSS Inc. Steinberg, D. and Colla, P. (1995). Tree-Structured Non-Parametric Data Analysis. San Diego, CA: Salford Systems.
37
Appendix 1: The CHAID Algorithm In this appendix we describe the steps in the CHAID algorithm by applying it to a simple data set consisting just of a categorical response variable ALWPRIM (Y) and two explanatory variables ETHNIC (X1) and WELSH (X2). The variable Y has four categories, Y = 1,2,3,4; the variable X1 has four categories, X1 = 1,2,3,4; the variable X2 has three categories, X2 = 0,1,2. The steps in the CHAID algorithm are then: 1. Calculate the distribution of the response variable Y in the “root” node. Cat.
%
n
1 2 3 4
35.00 8.00 35.00 22.00
35 8 35 22
Total
(100.00) 100
2. For each explanatory variable X, find that pair of categories of X that are least significantly different (that is has the largest p-value) with respect to the distribution of Y within this node. The method use to calculate this p-value depends on the measurement level of Y. In this example Y is categorical, and so a series of chisquare tests are performed: i.
The relationship between the explanator ETHNIC (X1) and the response ALWPRIM (Y) within the node is given by the following crosstabulation:
X1/Y 1 2 3 4 Coltotl
1 23 12 0 0 35
2 5 2 1 0 8
3 19 15 0 1 35
4 4 13 1 4 22
RowTotl 51 42 2 5 100
Chi^2 = 25.63559 d.f.= 9 (p=0.002342955) Since X1 has four categories, there are six 2 × 4 sub-crosstabulations to consider X1/Y 1 2 Coltotl
1 23 12 35
2 5 2 7
3 19 15 34
4 4 13 17
RowTotl 51 42 93
Chi^2 = 9.193281 d.f.= 3 (p=0.02682849) X1/Y 1
1 23
2 5
3 19 39
4 4
RowTotl 51
AUTIMP Appendix 1 to “Tree based models...”: The CHAID Algorithm
3 Coltotl
0 23
1 6
0 19
1 5
2 53
Chi^2 = 8.019281 d.f.= 3 (p=0.0456149) X1/Y 1 4 Coltotl
1 23 0 23
2 5 0 5
3 19 1 20
RowTotl
4 4 4 8
51 5 56
Chi^2 = 19.72078 d.f.= 3 (p=0.0001939266) X1/Y 2 3 Coltotl
1 12 0 12
2 2 1 3
3 15 0 15
RowTotl
4 13 1 14
42 2 44
Chi^2 = 7.23356 d.f.= 3 (p=0.0648145) X1/Y 2 4 Coltotl
1 12 0 12
2 2 0 2
3 15 1 16
RowTotl
4 13 4 17
42 5 47
Chi^2 = 4.962482 d.f.= 3 (p=0.1745651) X1/Y 3 4 Coltotl
2 1 0 1
3 0 1 1
RowTotl
4 1 4 5
2 5 7
Chi^2 = 3.08 d.f.= 2 (p=0.2143811) ii.
The algorithm then identifies the pair of categories of X1 with largest p-value above and compares this p-value to a prespecified alpha level, αmerge (= 0.05, default value). In this example, this is the pair defined by categories 3 and 4 of X1. Since the pvalue for this pair (0.2143) is greater than αmerge, the two categories are merged to form a single compound category. As a result, a new set of categories of X1 is formed, and the sub-crosstabulation analysis of (i) is repeated. There are now three such subcrosstabulations: X1/Y 1 3,4 Coltotl
1 23 0 23
2 5 1 6
3 19 1 20
4 4 5 9
RowTotl 51 7 58
Chi^2 = 20.25577 d.f.= 3 (p=0.0001502343)
40
AUTIMP Appendix 1 to “Tree based models...”: The CHAID Algorithm
X1/Y 2 3,4 Coltotl
1 12 0 12
2 2 1 3
3 15 1 16
4 13 5 18
RowTotl 42 7 49
Chi^2 = 6.408565 d.f.= 3 (p=0.09333908) X1/Y 1 2 Coltotl
1 23 12 35
2 5 2 7
3 19 15 34
4 4 13 17
RowTotl 51 42 93
Chi^2 = 9.193281 d.f.= 3 (p=0.02682849) iii.
Again, the pair that results in the largest p-value greater than αmerge = 0.05 is merged. Here this corresponds to merging category 2 with the compound category 3,4. If further mergers were possible, this process would continue. However, since there are now only two (merged) categories remaining for X1, the merging process stops.
iv.
The algorith now calculates an adjusted p-value for the set of merged categories of X1 and the categories of Y using a Bonferroni adjustment. X1/Y 1 2,3,4 Coltotl
1 23 12 35
2 5 3 8
3 19 16 35
4 4 18 22
RowTotl 51 49 100
Chi^2 = 13.08861 d.f.= 3 (p=0.004448843)
The chisquare p-value above is adjusted using a Bonferroni multiplier. Since X1 is nominal, the Bonferroni multiplier is calculated as follows: r −1
B free = ∑ (−1) i i =0
(r − i) c r!(r − i )!
where c = number of original categories of X1 (4) and r = number of merged categories (2). This leads to an adjusted p-value of 0.0311 (= 0.00448843 × 7). 3. Steps (i) to (iv) above are now repeated, replacing X1 by X2 (WELSH). i. The crosstabulation of X2 by Y for the root node is X2/Y 0 1 2 Coltotl
1 33 1 1 35
2 8 0 0 8
3 34 1 0 35
4 22 0 0 22
RowTotl 97 2 1 100
Chi^2 = 2.768778 d.f.= 6 (p=0.8372577)
41
AUTIMP Appendix 1 to “Tree based models...”: The CHAID Algorithm
ii.
X2 has three categories, so there are only three 2 × 4 sub-crosstabulations: X2/Y 0 1 Coltotl
1 33 1 34
2 8 0 8
3 34 1 35
4 22 0 22
RowTotl 97 2 99
Chi^2 = 0.8881097 d.f.= 3 (p=0.8282962) X2/Y 0 2 Coltotl
1 33 1 34
2 8 0 8
3 34 0 34
4 22 0 22
RowTotl 97 1 98
Chi^2 = 1.901759 d.f.= 3 (p=0.5930452) X2/Y 1 2 Coltotl
1 1 1 2
3 1 0 1
RowTotl 2 1 3
Chi^2 = 0.75 d.f.= 1 (p=0.3864762) iii.
iv.
Here, the pair of categories of X2 with largest p-value are categories 0 and 1 with a p-value of 0.8283. We therefore merge categories 0 and 1 to form a single compound category. The merging process for this variable now stops since there are just two categories left.
The final p-value for the crosstabulation of X2 and Y is then calculated: X2/Y 0,1 2 Coltotl
1 34 1 35
2 8 0 8
3 35 0 35
4 22 0 22
RowTotl 99 1 100
Chi^2 = 1.875902 d.f.= 3 (p=0.5985587) The adjusted p-value in this case is 1 (since 0.5985587 × 3 is greater than 1). 4. The final step is to split the node on the basis of that “merged” explanator with smallest adjusted p-value less than asplit = 0.05. Here this means we split the root node into two subnodes on the basis of the merged categories of X1. That is, one subnode contains the 35 cases with X1 = 1 while the other contains the remaining 65 cases with X1 = 2, 3 or 4. 5. Continue to grow the tree until the stopping criteria are satisfied. Figure A1 shows the (two branch) tree constructed by CHAID in steps 1 - 4.
42
AUTIMP Appendix 1 to “Tree based models...”: The CHAID Algorithm
Figure A1
43
Short Evaluation of Imputer 2 Pasi Piela – Statistics Finland November 3, 2000
45
AUTIMP Short Evaluation of Imputer 2
Summary Evaluation of Imputer 2.0 (Oct. 1999) has been done as a part of the Eurostat Project AutImp (Automatic Imputation Methods and Software for Business Surveys and Population Censuses). In the evaluation task of existing imputation softwares Statistics Finland has been concentrated in Solas 2.1®. Imputer 2.0 was decided to take under short evaluation because its uniqueness as hot deck imputation software. It has been developed at Infostat Bratislava. Contact person of Imputer 2.0: Ladislav Meszaros,
[email protected].
46
AUTIMP Short Evaluation of Imputer 2
1. Introduction Imputer, written in Delphi for Windows, is a program for hot deck imputation. Imputer 2.0 basically provides three types of imputation methods. It has been used within the Project Demography of Small and Medium Enterprises (DosmE) in 12 Eastern European Countries. Imputer is a non-commercial product. The necessary components to test or run the program can be outlined in the following items: i. defining the general settings ii. choosing common variables iii. setting up stratification key variables iv. defining the variable/nonresponse variable -relations v. creating parameter files for imputation procedures. The program includes two main modules: SQL and Run Module. The build in SQL module offers to the user a quick data view and for some very simple analytical outputs, but without any buttons or menus. That is, all the analysis before and after imputation have to be done by some other program such as Excel or SAS.
Picture 1. Main screen (Run Module) of Imputer 2.0.
Run module (Picture 1) is the main working screen of the program where user defines components listed above and imputations as well.
47
AUTIMP Short Evaluation of Imputer 2
2. Evaluation Criteria Imputer 2.0 have been evaluated by using the following criteria imposed in The AUTIMP Contract 1999, Portugal (p. 12): 1. The imputation methods that are supported by the software package 2. The quality of the description of these imputation methods in the documentation 3. Technical aspects, such as user-friendliness, speed of the package, supported languages, etc 4. The help-facilities 5. Importing/exporting data 6. Graphical tools 7. Diagnostics, in particular how easy is it to evaluate the quality/effect of imputations 8. Comparison of the supported imputation methods with alternative methods 9. Estimation (of totals and variances) after imputation 10. Dissemination of the data (e.g. are the imputed data flagged, how are they flagged, and are alternatives reported?)
3. Evaluation 1. The supported imputation methods are stratified mean imputation, sequential hot deck and hierarchical hot deck. In the stratified mean imputation missing value is replaced by the average of the same variable values in a same stratum or class. Classes are simply arranged by the userdefined key variables. In the hot deck methods user can choose traditional random imputation within classes, where a donor observation is randomly chosen among the observations in a same class. The chosen donor donates its value of the imputation variable to the observation with missing value. However, this is not nearest neighbour in question. If the donor can not be chosen from exactly same class defined by explanatory/key variables then imputation is impossible. Variables can also be chosen hierarchically, so that when imputing the primary variables then the secondary or tertial variables are imputed at the same time from the same donor regardless of their true missingness. That is, true values of other than primary variables might be overwritten.
Figure 1. Imputation by using one primary (VAR1) and two secondary variables. When value of VAR1 is imputed then also values of VAR2 and VAR3 are imputed regardless of their true missingness. Var1 205 . 276 .
Var2 1 2 2 .
Var3 32 34 34 35
Var4 35 35 35 35
Var5 10 10 12 12
Specifically, imputation can be made by using different filters and values of Acceptor Ratio. Acceptor Ratio defines the ratio of acceptors (observations to be imputed) and possible donors in the imputation stratum. Maximum value of Acceptor Ratio is 200 %. This means that the same donor can be used twice. 2. Good documents/manuals were not available at the time of the evaluation. 48
AUTIMP Short Evaluation of Imputer 2
3. To get a data view user has to write SQL code. Different files have to be created before imputation processes. Imputation variables need corresponding nonresponse indicator variables (nrvar = 1 when value is missing otherwise 0) Thus, user interface is far away from the best one. It is, however, satisfactory and very logical. Speed of package is good enough. 4. Help-facilities without any help-menus are not satisfactory at all, although there are some self-documentation with hint lines. Especially imputation methods should be documented in the program. 5. The only supported data type is dBase (.dbf). There were some technical problems in opening the dBase files by other programs. 6. Graphical tools are not available. 7, 9. The Program gives very detailed results for every imputed value with its donor. That is, user can easily see, how imputation has been done. But any diagnostics or estimation functions as well as information about the effect of imputations are not available. Diagnostics are aimed to be done in some other program such as Excel, which supports dBase files. 8. Comparisons of imputation methods are not available. 10. The Program only produces output data set without any dissemination or separation of imputed values by flagging in any way. In generally, Imputer 2 is an interesting hot deck imputation program being non-commercial product. It has been used by some Eastern European statistical offices. Still it should be more versatile and user-friendly for wider use.
4. Some Test Results Some of the test results are given in Table 1. They are based on the Finnish consumption data (description can be found in the AutImp Reports of Solas 2.1 and WAID 4.0 by S. Laaksonen & P. Piela). Mean imputation gives naturally smaller variance estimates than hot decking. Number of zeros which is somewhat important part of the data is usually very easy to estimate by hot deck methods. There are lot of unimputed observations which can be imputed in the second round. These results are easily comparable with the result from Solas 2.1 and WAID 4.0.
49
AUTIMP Short Evaluation of Imputer 2
Table 1. Imputation Results for Yearly Consumption of Health of Household Member (HP5). M = Number of Household Members, C = Number of Children, D = Decile of Disposable Income Distribution, A = Classified Age of Member, S = Sosio-Economic Status of Member. Method
Mean
Std. Dev. Median
True values (N = 6011) Available cases (N = 4563) Mean imputation, MC Mean imputation, MCDAS Hot decking, random donor, MCDAS
395.7 396.3 401,1 404,6
786.1 810.5 716,2 792,0
119 114 201 134
401,6
837,8
113
50
75 % Number UnimputQuantile of 0s (%) ed I. Obs. 469 21.7 459 22.2 552 16,9 25 498 19,8 442 470
22,5
442
Evaluative Comments on Imputation Methods of Solas 2.1 Seppo Laaksonen and Pasi Piela – Statistics Finland October 17, 2000
51
AUTIMP Evaluative Comments on Imputation Methods of Solas 2.1
Summary Evaluation of Solas 2.1® has been done as a part of the Eurostat Project AutImp (Automatic Imputation Methods and Software for Business Surveys and Population Cencuses). In the evaluation task of existing imputation software packages Statistics Finland has been mainly concentrated on this software. The following evaluative comments include also practical analyses for the real data set, which is appropriate for testing the imputation methods of Solas 2.1.
52
AUTIMP Evaluative Comments on Imputation Methods of Solas 2.1
1 Introduction An accompanying task of the AutImp is to describe and evaluate the characteristics of the existing software packages available for imputation. A new software, called Solas, was especially mentioned in the contract. During the AutImp project, the three various versions of Solas have been available. The first version was evaluated in 1999 and many comments were sent to the developers of Solas. The comments, among others, were utilised when creating the next versions, version 2.0 and 2.1. Even some bugs were corrected in the latest version based on our strange findings. This being the case, version 2.1 of Solas differs essentially the version available in the beginning of the AutImp project. As a general comment, we can say that this version is a practicable and user-friendly tool for some common situations when dealing with missing item values in survey and census data. This note considers in more details the imputation methods available in Solas, and also presents results based on the Finnish consumption data (Annex 1). First, we evaluate single imputation techniques of Solas, and then multiple imputation techniques, respectively. As far as we have understood, Solas is focused on multiple imputation, which family of methods has not been included in most other packages.
2. Solas approaches to imputation Solas also includes some tools for standard statistical analysis such as graphs, tabulations and regression analysis. The package is able to read several data files including SAS, SPSS, Statistica, S-Plus, dBase and Excel. The output Solas files, respectively, may be saved in these formats. This is a clear advantage for the software and facilitates the data handling with other tools, since no certain software rarely is reasonable for full analysis of complex data. So, it is usual that a user first makes a good input data file using SAS, for example, including editing and variable transformations before running Solas. Respectively, he/she continues his/her analysis after a Solas imputation task. Solas has also other useful tools, as the opportunity to look ‘missing data patterns’ which helps in understanding different types of missingness of the data. A user can also specify a grouping variable in which case separate patterns will be generated for each case. Solas supports for the two types of imputation methods, single imputation and multiple imputation. The details of these are presented in Sections 3 and 4. There are the four single imputation methods available, that is, using the terminology of the Solas manual: ‘Group Mean,’ ‘Hot-Deck Imputation,’ ‘Last Value Carried Forward (LVCF)’ and ‘Predicted Mean’. Moreover, the following two multiple imputation methods are covered: ‘Predictive Model Based Method’ and ‘Propensity Score Method.’ A user, of course, should understand which method is best for each case. Each user, maybe, does not understand that in fact those two multiple imputation methods may be used as single imputation techniques as well but a user should choose in this case one of the outputs obtained with a multiple method. Thus, ‘Predictive Model Based Method’ is a (multiple) extension of ‘Predicted Mean,’ and ‘Propensity Score Method’ is a special (multiple) case ‘Hot-Deck Imputation Method.’ This would be easier to see if a user first could choose an imputation method, then to apply it multiply if possible to find a ‘proper’ technique to it, or to be satisfied with the application of the corresponding single imputation technique. 53
AUTIMP Evaluative Comments on Imputation Methods of Solas 2.1
3. Single imputation 1. Group mean Solas allows for using one grouping variable which may be cut down in the same procedure. This cutting operation is not ideally comfortable but runs. Maybe, it is easier to do before the Solas step.
2. Hot deck The method may be used as a cell based so that five auxiliary variables may be chosen for grouping. There are two options for hot decking: (i) the donor is the first respondent or (ii) randomly selected. The previous one is not rarely good to use except if the order of respondents has been done correctly before this operation. Nevertheless, the first respondent is not necessarily the closest to a nonrespondent, and thus this option does not find a nearest neighbour.
3. Predicted mean imputation This method utilises simple multivariate regression model so that the imputed values are equal to the predicted values without random term in case of a metric variable. Usually, this method is called ‘regression imputation’ or something like that. When the variable to be imputed is a binary or categorical variable, a discriminant method is applied to impute the most likely value. We do not give the exact test results based on the latter method, but we made some simple tests so that first variable K1 and then variable DRINKS was binarised (if >0 then the new variable = 1, other wise = 0). The results, when used this variable as a continuous one, were not ideal because the maximum value in the previous case was larger than 1, but the minimum was correctly =0; on other hand, in the latter case, the minimum was incorrectly less than 0 but the maximum correctly = 1. A good point was that the average was fairly correct. On the other hand, when using this binary variable as a nominal one, the Solas running was stopped. In the case of the ‘integer’ variable, all the values were correct in principle, but the average for DRINKS was much underestimated due to fact that most imputed values were zero’s.
4. Last value carried forward method Imputed values are based on previously observed value. We have not tested this method because the data are not available for this purpose.
4. Multiple imputation 1. Predictive model-based multiple imputation This factually means that random noise with various random numbers have been added to predicted mean imputation method. We have not been perfectly able to check whether there are some bounds or other robustness techniques used, obviously not.
54
AUTIMP Evaluative Comments on Imputation Methods of Solas 2.1
2. Propensity score-based multiple imputation The method first uses logistic regression and then uses “information contained in a set of covariates to predict missingness in the variable to be imputed.” So, the software gives propensity scores for each individual, i.e. estimated probabilities for missingness. For imputation, the propensity scores may be sub-grouped in three ways: - “divide score into N sub-classes (option is 5)”, - “use the N cases which are closest on propensity score”, - “use N% of cases which are closest on propensity score”. After that, random hot decking within each sub-group may be used for imputation. When using various random numbers in each drawing multiple imputation has been applied. This method thus takes much advantage of the missingness mechanism of each variable being imputed. Since there may be used three different ‘donor selection’ criteria, the advantages of the method will be improved. This is analogous to a reweighting method used for unit nonresponse such as presented in Ekholm and Laaksonen (1991). If the imputation model is the best possible, and the missingness mechanism is ignorable within those imputation cells (or within adjustment cells which is a term when speaking about reweighting), and these cells have a reasonable number of available donors, the method should run well. The disadvantage of this approach may be that it does not fully exploit auxiliary data since it only operates with missingness, not with different values of target variables and missingness.
MI features Solas may give maximum 10 multiply imputed data sets which may be looked like in Excel inside various sheets, quite nicely; e.g. the same part of the data file will be seen when looking each particular file. The software also provides easily so-called roll-up statistics, that is, summary statistics of these imputations, see the example below.
5. Results Based on Consumption Data DRINKS The results in Table 1 are concerned variable DRINKS, at household level. We were able to use several auxiliary variables with fully non-missing values, these typically being categorical such as gender, living region, characteristics of the house where living including such variables as whether they have an own sauna or not, whether they have mobile phone, education level; moreover, there are some metric variables as income (this was categorised into 10 decile groups), age (may be categorised), number of household members/children. Various applications have been presented in Table 1. As expected, model-donor methods do not succeed as well as real-donor methods, especially when the other indicators than the means are concerned. Best results are from hot decking applications in the cases when the reasonably good imputation cells have been found.
55
AUTIMP Evaluative Comments on Imputation Methods of Solas 2.1
HP5/KP5 = Health, K500 = Medicines Imputation results for yearly consumption of health of household member (HP5) are given in Table 2. As we can see nonresponse rate is 24.1 %. The estimates calculated straight from the available cases are very close to the corresponding true values. That is, nonresponse is quite non-ignorable for many estimators except for standard deviations as it is easy to see. Because the data are full of 0s, it is also important to consider how well they can be estimated. Hot deck methods seem to work pretty well in estimating the mean of HP5, although it is clearly overestimated due to skewness of the distribution. Standard deviation being the hardest one are usually underestimated by mean-based imputations and over-estimated by hot deck methods, which is rather obvious. Clearly, hot deck methods are more suitable for estimating the number of zeros; results are very good. Yearly consumption of health of household are calculated by summing over yearly consumptions of each member. Missingness is 33.4 % but proportion of 0s is much smaller, 3.6 %, than proportion of 0s at the member level. Now non-response is ignorable in every cases which can also be seen in the imputation results given in Table 3. Simple hot deck imputation with three sorting variables gives the best results. In this method data are first sorted by number of members in household, which seems to be the most important explanatory variable, then by age of breadwinner and by sosio-economic status of breadwinner to get suitable donors for the missing values of KP5. Consumption of medicine (K500) is very hard to estimate because of the large amount of zeros (57 %) in the data, see Table 4. However, non-response is clearly non-ignorable for number of 0s, although it is ignorable in other cases. Mean and variance are now over-estimated by all methods. Missingness of consumption of health implicates missingness of consumption of medicines so that KP5/HP5 can only been used as a sorting variable if they are first imputed well enough. In one case KP5 was imputed by sorting order MAS. Then K500 was imputed by using also imputed values of KP5. Usually this method gives little better results at least. Table 5 includes imputation results for K500 from the special member level imputations, although K500 is a household level variable. One way to impute a household level variable is to choose values of breadwinners (as in Table 3 and 4), but tests have shown that information of all members should be taken into account in imputations.
6. Conclusions This evaluation on Solas does not cover all the aspects of this package (see another evaluation by Larsen and Madsen 2000 too). Especially monotonic missing data patterns have not been dealt with. All possible useful options may not also been considered reasonably. Although Solas is user-friendly in general, it needs much knowledge in imputation techniques in order observe all potential elegance of it. The manual is practicable but still fairly technical. There are many software packages available for imputation, Solas being one of the pioneers. Its newest version, 2.1, could be used in many applications of real life, including statistical 56
AUTIMP Evaluative Comments on Imputation Methods of Solas 2.1
institutes. But it requires an experienced user who understands its limitations and is able to complete the gaps of this tool using his/her own programmes before and after a Solas step. This software is not fast at all, for example, compared with SAS which we were able to test using many same methods (based on our own programming). This slowness is concerned especially hot decking. Our data file was not very big and then we do not know how long time is needed to wait when using large data sets. Furthermore, a user should be very informed about his data set and its variables. Solas does not give any automatic solutions for any methods nor even for any variables. That is, the software does not recommend anything as expected choice for a case in question. However, it is still very easy to ask the program to do imputations by using any method or any variables.
57
AUTIMP Evaluative Comments on Imputation Methods of Solas 2.1
References Ekholm, A. and Laaksonen, S. (1991). Weighting via Response Modeling in the Finnish Household Budget Survey. Journal of Official Statistics.325-337. Laaksonen, S. (2000). Regression-Based Nearest Neighbour Hot Decking. Computational Statistics 15, 1, 65-71. Larsen, B.S. and Madsen, B. (2000). Evaluation of Solas 2.0 for Imputing Missing Values. Paper prepared for UN/ECE Work Session on Statistical Data Editing, Cardiff, UK, 18-20 October 2000. Little, R. (1988). Missing-Data Adjustments in Large Surveys. Journal of Business & Economic Statistics 6, 287-297. Little, R. and Rubin, D. (1987). Statistical Analysis with Missing Data. John Wiley & Sons. Rubin, D. (1987). Multiple Imputation in Surveys. John Wiley & Sons. Rubin, D. and the papers and the discussion by B. Fay, J. Rao, D. Binder, J. Eltinge and D. Judkins (1996). Multiple Imputation After 18+ Years. Journal of the American Statistical Association 91, 473-520. Särndal, C-E., Swensson, B. and Wretman, J. (1992). Model Assisted Survey Sampling. Springer. Shafer, J.L. (1997). Analysis of Incomplete Multivariate Data. Chapman & Hall. Schulte Nordholt, E. (1998). Imputation: Methods, Simulation, Experiments and Practical Examples. International Statistical Review, 66, 157-180. Solas for Missing Data Analysis. Version 2.1 (2000). Statistical Solutions, Ltd, Cork, Ireland.
58
AUTIMP Annex 1 to Evaluative Comments on Imputation Methods of Solas 2.1
Annex 1. Description of the test data file The initial file is created from the Finnish Household Expenditure Survey (HES) 1996. This survey has been the first time conducted from 1966. Its background is very international, especially concerning consumption classifications. The survey design is also harmonised to some extent, but the sampling designs may be more varying. This survey tries to estimate the yearly consumption of the households and partially, this of household members. The sample survey data are gathered from several sources, but the most difficult part is based on the diaries kept by households and their members. The reference period in diaries is only two weeks but this results are extracted to the yearly level. It is natural that the individual values may be varying a lot and not necessarily illustrating well the yearly consumption. Hence, in individual consumption items, there may be a high number of zeros. Nevertheless, the aggregate estimates may (should) be of high quality. Some consumption items are directly based on the interviews of the full year or 6 months, and respectively, the values are less sensitive. On the other hand, if a consumption item is of a higher level, such as all the alimentation, individual values are also less sensitive and less often equal to zero. These characteristics should have been understood when approaching to an imputation task in this test data file. There is no need to impute the individual level missing values correctly but the distribution should be able to track well. In our test results, we thus present averages, standard deviations and some distributional statistics. The data set is thus derived from the 1996 Finnish HES which covers about 1000 consumption item variables. In this data set, we however have only chosen those 137 presented in the following SAS table: CONTENTS PROCEDURE -----Alphabetic List of Variables and Attributes----# Variable Type Len Pos Label ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 16 ADULTS Num 8 117 Number of ADULTS 133 AGEM Num 8 1020 Age of Member * AGEMC Classified AGEM (1=(AGEM