A Quantitative Evaluation Framework for Missing Value ... - arXiv

17 downloads 0 Views 368KB Size Report
Nov 10, 2013 - (MMD) (Zaremba et al., 2013; Gretton et al., 2012;. Gretton et al.). ...... Winn, John and Bishop, Christopher M. Variational mes- sage passing.
arXiv:1311.2276v1 [cs.LG] 10 Nov 2013

000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 054

A Quantitative Evaluation Framework for Missing Value Imputation Algorithms

Vinod Nair, Rahul Kidambi, Sundararajan Sellamanickam S. Sathiya Keerthi Johannes Gehrke Vijay Narayanan

Abstract We consider the problem of quantitatively evaluating missing value imputation algorithms. Given a dataset with missing values and a choice of several imputation algorithms to fill them in, there is currently no principled way to rank the algorithms using a quantitative metric. We develop a framework based on treating imputation evaluation as a problem of comparing two distributions and show how it can be used to compute quantitative metrics. We present an efficient procedure for applying this framework to practical datasets, demonstrate several metrics derived from the existing literature on comparing distributions, and propose a new metric called Neighborhood-based Dissimilarity Score which is fast to compute and provides similar results. Results are shown on several datasets, metrics, and imputations algorithms.

1. Introduction Many commonly used machine learning algorithms assume that the input data matrix does not have any missing entries. In practice this assumption is often violated. Imputation is one popular approach to handling missing values – it fills in the dataset by inferring them from the observed values. Once filled, the dataset can then be passed on to an ML algorithm as if it were fully observed. Multiple imputation (Little & Rubin, 1986) takes into account the uncertainty in the missing values by sampling each missing variable from an appropriate distribution to produce multiple filled-in versions of the dataset.

Microsoft Research, India CISL, Microsoft, CA CISL, Microsoft, WA CISL, Microsoft, CA

Multiple imputation has been an active area of research in Statistics and ML (Burgette & Reiter, 2010; Buuren et al., 1999; Gelman & Raghunathan, 2001; Li et al., 2012; Raghunathan et al., 2001; Su et al., 2011; Templ et al.; 2011; van Buuren, 2007; van Buuren & Groothuis-Oudshoorn, 2011) and has been implemented in several popular Statistics/ML packages (Su et al., 2011; van Buuren & Groothuis-Oudshoorn, 2011). Although there has been much work on multiple imputation, there is little work on the quantitative measurement of the relative performance of a set of imputation algorithms on a given dataset. In the special case where it is known ahead of time that the dataset is to be used only for a specific task (e.g. classification), one can use the evaluation metric for that task (e.g. 0/1 loss) to indirectly measure the performance of an imputation algorithm. But here we are interested in directly evaluating the quality of the imputation, without using a subsequent task for a proxy evaluation. Current procedures available for evaluating imputations are qualitative. Some imputation tools, such as the R packages mi and MICE, allow visual comparison of the histogram of observed values and imputed values for each column of the dataset. Such comparisons can be done only for one or two variables at a time and so is at best a weak sanity check. van Buuren and Groothuis-Oudshoorn mention criteria for checking whether the imputations are plausible, e.g. imputations should be values that could have been obtained had they not been missing, imputations should be close to the data, and they should reflect the appropriate amount of uncertainty about their ‘true’ values. But such criteria have not yet been translated into quantitative metrics. One possibility for quantifying imputation performance is to hold out a subset of observed values from the dataset, impute them, and measure the “reconstruction error” between the known and predicted val-

055 056 057 058 059 060 061 062 063 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088 089 090 091 092 093 094 095 096 097 098 099 100 101 102 103 104 105 106 107 108 109

A Quantitative Evaluation Framework for Missing Value Imputation Algorithms

110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164

ues using an appropriate distance metric, e.g. Euclidean distance for real-valued variables and Hamming distance for categorical variables. As Rubin explains (Rubin, 1996), this is not a good metric because there can be uncertainty in the value of the missing variables (conditioned on the values of the observed variables), and directly comparing imputed samples to a single groundtruth value cannot measure this uncertainty. The key quantity needed to evaluate an imputation is the distribution of the missing variables conditioned on the observed variables. Consider a data vector Y with observed variables Yobs and missing variables Ymis , with Y = {Yobs , Ymis }. Let R be a vector of indicator variables indicating which variables in Y are missing or observed. As explained in Little and Rubin (Little & Rubin, 1986) (Chapter 11, page 219, equation 11.4), the joint distribution of Y and R can be written using Bayes rule, as P (Y, R) = P (Ymis |Yobs , R)P (Yobs , R).

(1)

A dataset with missing values contains samples for Yobs and R generated according to some ground truth joint distribution Ptrue (Y, R). An imputation algorithm generates samples from an estimate of Pˆ (Ymis |Yobs , R). Note P (Yobs , R) does not depend on imputation. Thus, if we can compare Ptrue (Ymis |Yobs , R) to Pˆ (Ymis |Yobs , R) that is estimated by an imputation algorithm, then it is possible to quantify how good the imputation is. Therefore the imputation evaluation problem can be turned into a problem of comparing two conditional distributions. In this work we use this connection between imputation evaluation and distribution comparison to develop a quantitative framework for evaluating imputations that can be tractably applied to real-world datasets. Problem setup: We have 1) a dataset with values missing according to some (possibly unknown) missingness type, and 2) the output of K different imputation algorithms, with each one producing multiple filled-in versions of the dataset. The goal is to rank the K algorithms according to a chosen metric. Several metric choices are discussed in section 3. Assumption: It should be possible to estimate a model of Ptrue (Ymis |Yobs , R) from the data for each unique missingness pattern R. Section 3 shows how this can be done tractably. The samples given by an imputation algorithm from its estimated conditional Pˆ (Ymis |Yobs , R) are then compared against this model, rather than the original data itself. Algorithm: The main steps are the following:

• A) Construct a model for Ptrue (Ymis |Yobs , R) for each unique R in the dataset. • B) For each row of the dataset: 1. Generate samples from the model of Ptrue (Ymis |Yobs , R). 2. Generate samples from the conditional Pˆk (Ymis |Yobs , R) estimated by the k th imputation algorithm for the current row. 3. Compute a score sk for the k th algorithm according to a chosen evaluation metric that compares the two conditional distributions using their corresponding samples. 4. Sort the scores to rank the K algorithms for the current row. • C) Use the method recommended by Demsar (Demˇsar, 2006) to combine the rankings for all the rows into a single average ranking over the K algorithms. Confidence intervals are assigned to the average ranks to indicate which algorithms have significantly different imputation quality. There are several choices possible for 1) the model of Ptrue (Ymis |Yobs , R), and 2) the metric for distribution comparison. Our evaluation framework is general in that it is not tied to any specific choice for either of these. Nevertheless in section 3 we discuss a few specific choices that are tractable and work well for highdimensional data, e.g., problems with hundreds of features. For modeling Ptrue (Ymis |Yobs , R) we propose a Markov Random Field-based approach. We consider two choices: (a) a separate model for each missingness pattern; and (b) a single model for all missingness patterns. For distribution comparison metrics we consider 1) Kullback-Leibler (KL) divergence, 2) Symmetric KL divergence, and 3) two-sample testing for equality of two distributions. The challenge is that only samples of the true and imputed distributions are available. Recent advances in kernel-based approaches to estimating these metrics from samples have now made it possible to apply them to high-dimensional data. We estimate KL divergence using the convex risk minimization approach of (Nguyen et al., 2010), symmetric KL using the cost-sensitive binary classification approach of (Reid & Williamson, 2011), and perform two-sample testing using Maximum Mean Discrepancy (MMD) (Zaremba et al., 2013; Gretton et al., 2012; Gretton et al.). In addition, we have proposed a new metric called Neighborhood-based Dissimilarity Score (NDS) that is faster to compute while providing similar results. We demonstrate the framework on several real datasets to compare four representative imputation algorithms from the literature.

165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219

A Quantitative Evaluation Framework for Missing Value Imputation Algorithms

10

10

10

10

10

5

5

5

5

5

0

0

0

0

0

• An efficient algorithm for jointly building distribution models for each unique missingness pattern.

Outline: Section 2 presents two motivating examples that demonstrate our evaluation framework on datasets where it is possible to visually check the quality of the imputations. Section 3 explains the details of constructing the distribution models, the metrics used for comparing distributions, and statistical significance testing for the ranking of the algorithms. Section 4 presents the experimental results on several datasets, followed by a closing discussion in section 5.

2. Illustrative Examples Before explaining the details of our framework, we first show results on two datasets for which the imputations

5

10

0

5

10

0 0

5

10

10

10

5

5

5

5

5

0

0

0

0

5

10

0

5

10

0

5

10

5

10

10

10

10

10

5

5

5

5

5

0

0

0

0

0

5

10

0

5

10

0

5

10

5

10

10

10

10

5

5

5

5

5

0

0

0

0

A4

10

5

10

0

5

10

0

5

10

5

10

0

5

10

0

5

10

0

5

10

0 0

10

0

0

0 0

10

0 0

5

10

(a) KL

MMD score

NDS

4 3 2 1

5 Average rank

5 Average rank

5

4 3 2 1

A1 A2 A3 A4 A5 Imputation algorithm

4 3 2 1

A1 A2 A3 A4 A5 Imputation algorithm

A1 A2 A3 A4 A5 Imputation algorithm

(b)

(c) KL

3 2 1 A1 A2 A3 A4 Imputation algorithm

4

NDS Average rank

4

MMD score Average rank

Notes: 1) The evaluation does not require knowledge of the missingness type, such as whether values are Missing Completely at Random (MCAR) or Not Missing at Random (NMAR) (Little & Rubin, 1986). It simply checks whether the samples generated by the imputation algorithm are distributed correctly compared to the true distribution. The imputation algorithm needs to be aware of the missingness type, but not the evaluation. 2) In our experiments we only consider up to 50% missingness. Imputation is typically used for problems where the missingness percentage is not very high. At much higher percentages, e.g. 99% missingness in Collaborative Filtering datasets, more specialized approaches are used and evaluation is focused more on predicting a few relevant missing entries rather than all of them.

0

10

A3

We can use this framework to evaluate the quality of imputation algorithms on different datasets, or understand the sensitivity of various imputation algorithms to different missingness types. The experiments presented in this paper are a first step towards such deeper investigations.

10 10

0

Average rank

• Neighborhood-based Dissimilarity Score, which is two orders of magnitude faster than KL, symmetric KL and MMD, while giving similar results.

5

10 A2

• A principled framework for computing quantitative metrics to evaluate imputations and rank algorithms with statistical significance tests. Its generality allows different choices for distribution modeling and distribution comparison metrics to be plugged in according to the problem.

can be visualized. This allows a visual comparison of how good the different imputation algorithms are,

A1

Contributions: To our knowledge this is the first work that proposes the following.

Average rank

220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274

3 2 1 A1 A2 A3 A4 Imputation algorithm

4 3 2 1 A1 A2 A3 A4 Imputation algorithm

(d) Figure 1. Results of evaluating imputation algorithms on the Mixture of Gaussians (MoG) dataset and USPS. (a) Mixture components are shown as one-standarddeviation ellipses. Columns are different missing examples, rows are algorithms. The true conditional distribution of each column are: 1) P (x, y|label =blue), 2) P (x|y = 4,label =blue), 3) P (x, y|label =red ), 4) P (y|x = 6,label =blue), 5) P (x, y|label =red ). The quality of the imputations can be judged by visually comparing the samples to the correct conditional distribution. (b) Average ranks computed by KL, MMD score, and NDS, for the full MoG dataset. (c) Each of three sets of images is a different USPS digit with missing values. The top row of each set shows the true image (left) and missing pixel indicators as a binary mask (right). The next four rows show five samples generated by A1 to A4 from top to bottom. (d) Average ranks computed by KL, MMD score, and NDS, for the full USPS dataset.

275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329

A Quantitative Evaluation Framework for Missing Value Imputation Algorithms

330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384

which can then be checked against the evaluation by our framework. Datasets: The first dataset is generated from a twodimensional mixture of Gaussians with two components having equal prior probability. The Gaussians are shown in figure 1 by their one-standard-deviation ellipses. The value of the component variable is also included in the dataset as a class label, color-coded in the figure by red and blue. We use 2000 data points sampled from the distribution. The second dataset is the USPS handwritten digits which contains 16×16 binary images flattened as 256-dimensional vectors. We select images for digits 2 and 3 to construct a dataset with 1979 rows and 256 columns. 50% of the values are removed from both datasets completely at random. Imputation algorithms & Metrics: The missing values are imputed using four different algorithms: Mean/Mode, Mixture Model, k Nearest Neighbors (kNN), and MICE (labeled as A1, A2, A3, A4, respectively, in fig. 1). These algorithms are described in section 3. For the 2D mixture of Gaussians, we have also included the true distribution itself as an imputation algorithm (labeled A5 in the plots) by using its conditionals to sample the missing variables given the observed ones. This is a good sanity check since a good metric should rank it as the best algorithm. We use KL divergence, MMD score, and NDS as the metrics (defined in section 3). Results: Figure 1 summarizes the results for both datasets. To give a visual feel for the quality of the imputations, we have shown as examples a few specific data vectors and the samples generated by the algorithms to fill in the vectors. To interpret these plots, figure 1 caption lists the variables that are missing for each example from the MoG dataset and explains which pixels are missing in the USPS images. The rank plots show the average rank attained by each algorithm over all the rows with missing values in the dataset. Error bars show 95% confidence intervals for significant differences. Visually judging the imputations, A3 (kNN) and A4 (MICE) produce the most plausible imputations for both datasets, while A1 (Mode/Mean) and A2 (Mixture Model) produce the worst ones. The ordering computed by all the three metrics in figure 1 agree with this judgement. The true distribution for the MoG dataset has the best average rank for all three metrics, which is reassuring. Interestingly, MICE ties with the true distribution as the best algorithm. In the case of USPS, the ranking A1 to A4 (worst to best) is visually clear and is consistent with the average ranks and the small error bars. The results in section 4 show

a similar ordering of algorithms on datasets where the rankings cannot be verified in some other way. Doing this visual verification and corroborating its results with that of other datasets gives us more confidence that the evaluation framework is sensible.

3. Evaluation Approach We provide details of the three key parts given in algorithm A: (1) Model construction and sample generation (steps A and B1) (section 3.1), (2) Score computation using metrics (step B3) (section 3.2) and (3) Ranking of algorithms with statistical significance testing (step B4 and C) (section 3.3). We make the following assumptions. (a) There is a sufficient number of examples with no missing values for learning a good true model; this is true in many practical scenarios. (b) A conditional probabilistic model powerful enough to capture the underlying distribution can be built using these examples. (c) Generating samples from the model distribution can be done easily via Gibbs sampling. Notations Let DN and DM denote the sets of examples with no missing values and with missing values respectively. Let mj ⊂ {1, · · · , d}, j = 1, · · · , J denote a subset of variables for which values are missing in DM with d denoting the number of variables. Note that each subset mj refers to a unique missing pattern and there are J missing patterns in DM . Also, let oj = {1, · · · , d} \ mj be the observed variables. Let (i) (i) Ym and Yo denote the sets of missing and observed variables in the ith example respectively. 3.1. Distribution Model Learning The first step is to build a distribution model such that for each missing pattern mj , we can generate samples given the values for the corresponding observed pattern oj . One possibility is to build a joint model P (Y ; θ) using the set DN (where θ is the model parameter). However, this may be hard and intractable; more importantly, it is not necessary. This is because we always have some observed variables that can be used to predict the missing variables. Furthermore, conditional models are relatively easier to build. Due to these reasons, for each j we build a conditional distribution model P (Ymj |Yoj ; θj ) for the missing pattern mj in DM using the fully observed data set DN ; here, θj denotes the j th model’s parameter vector. Model and Learning While any good conditional model is sufficient for our method to work, we use the pairwise Markov random field (MRF) in this work. For ease of notation, let us remove the subscript from mj . We use the model: P (Ym |Yo ; θ) = 1 Z(Yo ) exp(s(Ym ; θm ) + s(Ym ; Yo , θmo )) where Z(Yo ) =

385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439

A Quantitative Evaluation Framework for Missing Value Imputation Algorithms

440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494

P

Ym exp(s(Ym ; θm ) + s(Ym ; Yo , θmo )) is the partition function and, s(Ym ; θm ) and s(Ym ; Yo , θmo ) denote parametrized scoring functions involving self and cross features among the missing and observed variables. In our experiments, we use a simple linear model P P for the scoring function: s(Ym ; θm ) = ¯l ) where Yk is the label k∈m l:¯ yl ∈Yk θk,l I(yk = y space for the k th variable and I(·) is the indicator feature function. The scoring function s(Ym ; Yo , θmo ) is defined using similar feature functions involving a pair of variables and associated parameters. Given a missing pattern m, we construct a training set compris(i) (i) ing of input-target pairs {(Yo , Ym ) : i = 1, ..., n} from DN and learn the model parameter θ using the piecewise pseudo-likelihood maximization training approach suggested by Sutton and McCallum (2008).

Samples from the distribution model: Using the model parameters learned for each missing pattern mj , we generate multiple imputations for each example in DM (having the missing pattern mj ) via Gibbs sampling. For instance, sample generation for a discrete variable involves sampling a multinomial distribution with probabilities that can be easily computed from the conditional MRF distribution. Single Model One disadvantage of the above modeling is that we need to build J conditional models. However, these models can be built in parallel and is a one time effort. Alternatively, we can build a single conditional model to deal with all missing patterns. The key difference is that we learn a single model parameter vector θ (instead of learning one model parameter vector θj per missing pattern mj ). In this case, there is only one dataset. The log likelihood term for each example involves different missing patterns. Therefore, the subset of model parameters used for each example is different, depending on the missing and non-missing variables involved; clearly, parameter overlap happens across the examples. Building a single model helps in reducing the computational complexity. Though such a model is expected to be inferior in prediction quality compared to the per missing pattern model, we found it to be adequate to meet the requirement of ranking the methods correctly. 3.2. Metrics Given samples (Ym ) from the model (true) conditional distribution (P ) and multiple imputations (Yˆm ) obtained from an imputation algorithm k (with its distribution Pˆk ), we need a score/metric for measuring the the discrepancy between the samples Ym and Yˆm ). In this section we briefly describe the metrics (popular ones selected from the literature) used in our experiments. It is worth pointing out that the challenge

here is the reliable estimation of metrics between two distributions using samples generated (independently) from them. Given its importance in applications, this problem has received much attention in the literature and several good methods have been proposed. Our work is the first one to make use of them for comparing missing value imputation algorithms. For ease of notation in the discussion below, we drop the subscript k indicating the dependency on the imputation algorithm, from Pˆk . Below we assume that |Ym | = |Yˆm |). For all metrics other than KL divergence estimation, this condition can be relaxed. KL divergence via Convex Optimization: Nguyen et al (2010) shows that the metric estimation problem can be posed as an empirical risk minimization problem and solved using standard convex programs. Estimating KL divergence is one instantiation of this problem; it is obtained using the solution of the following optimization problem: L 1 1 X g(α, Ym , Yˆm ) log(Lαu ) + L u=1 2λL L (2) where λL is a regularization constant, L is the number of imputation samples, g(α, Ym , Yˆm ) = P (u) (v) (u) (v) 1 ym , yˆm ) − + u,v αu αv K(ym , ym ) L K(ˆ P (u) (v) ˆm ) and K(·, ·) is a keru,v αu αv K(ym , y nel function. We used the Gaussian kernel K(y, yˆ) = exp(−||y − yˆ||2 /σ) and set the kernel width parameter σ to 1 in our experiments. Finally, the KL divergence value is computed using ˆ PL the solution α obtained from solving (2) as: − L1 u=1 log(Lα ˆ u ).

−minα>0 F = −

Symmetric KL divergence via binary classification It is known that KL divergence is not symmetric with respect to the distributions used, and, it is often useful to consider Jensen-Shannon divergence having ˆ the symmetric property (defined as: 21 (KL(P, P +2 P + ˆ

P KL(P, P + 2 )). While it is theoretically possible to adapt the estimation approach suggested by Nguyen et al (2012) to estimate the symmetric KL divergence, it does not result in a simpler convex program like (2). So we take a different route. Reid et al (2011) developed a unified framework to show that many machine learning problems can be viewed as binary experiments and estimation of various divergence measures can be done by solving a binary classification problem with suitably weighted loss functions for the two categories. In our context, the symmetric KL divergence can be estimated as follows: (a) assign class labels +1 and -1 for the samples from the model distribution and imputation algorithm respectively, and, (b) build a classifier using the standard logistic loss function. The intuition

495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549

A Quantitative Evaluation Framework for Missing Value Imputation Algorithms

550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604

is that, more divergent the distributions are, easier is the classification task. Therefore, the error rate measures the discrepancy between the distributions. Maximum Mean Discrepancy Gretton et al (2012) proposed a framework in which a statistical test can be conducted to compare samples from two distributions. The main idea is to compute a statistic known as maximum mean discrepancy (MMD) and use an asymptotic distribution of this statistic to conduct a hypothesis test H0 : P = Pˆ with a suitably defined threshold. The MMD statistic is defined as the supremum over the mean difference between some function score computed from the samples of the respective distributions. Gretton et al (2012) shows that functions from reproducing kernel Hilbert space (RKHS) are rich for computing this statistic from finite samples and derives conditions under which MMD can be used to distinguish the distributions. Given the MMD function class F and the unit ball in a RKHS H with kernel K(y, yˆ) (e.g., Gaussian), the statistic is defined as: ˆ M D2 (F , Ym , Yˆm ) = gyy (Ym )+gyˆyˆ(Yˆm )−gyyˆ(Ym , Yˆm ) M u (3) where gyyˆ(Ym , Yˆm ) =

L L X X 1 (u) (v) K(ym , yˆm ); L(L − 1) u=1 v6=u

gyy (Ym ) and gyˆyˆ(Yˆm ) are similarly defined. To reduce the computational complexity and produce low variance estimates in the finite sample issue associated with computing (3), Zaremba et al (2013) proposes a statistical test known as B-test. Based on this theoretical backing, we pick up two metrics: (a) the statistic MMD and (b) the B-test result, for our study on ranking imputation algorithms. The B-test gives 0 or 1 as the output (with 1 indicating acceptance of H0 ); therefore, there will be ties. However, the rank aggregation algorithm that we describe in section 3.3 can handle this. Neighborhood Dissimilarity Score One disadvantage associated with the metrics described above is that they are computationally expensive since we need to solve the associated problems for each example in DM . This becomes an issue especially when J (the number of missing patterns) is large. We propose a neighborhood dissimilarity score (NDS) that is cheaper to compute and is defined as: nds(h, Ym , Yˆm ) =

L L 1 XX (u) (v) wuv h(ym , yˆm ) L2 u=1 v=1

(4)

where h(·) is a distance score (e.g., Hamming loss) computed between a pair of examples on the im-

puted values and wuv is a weighting factor given by: λh(·) + (1 − λ)(hmax −h(·)) ; λ is a scaling parameter; and, hmax is the maximum Hamming distance possible. Note that wuv gives more weight to neighbors and λ controls the rate at which the weight falls off in terms of distance. We used λ = 0.1 in our experiments. The overall complexity is O(JL2 ), but L is typically small (e.g., 25). NDS is closely related to (3), but with three key differences: (a) we consider only the cross terms, (b) there are no kernel parameters to learn in NDS, and (c) the weighting is not uniform (wuv compared 1 to the uniform weighting L(L−1) used in (3)). Our experimental results show that NDS is a good metric in the sense of producing results similar to MMD and KL based metrics. 3.3. Rank Aggregation Let us take one metric M. For each algorithm k ∈ A (a set of algorithms to compare), we compute the (i) score sk,M for the ith example in DM using the procedures presented in the previous subsection. Using these scores, we rank the algorithms for each i. Collecting the ranks over all i results in a rank matrix RM with each row representing an example and the columns representing the ranks of the algorithms in A. To compare the algorithms, we draw analogy to the problem of statistically comparing multiple classifiers over multiple datasets (Demsar, 2006). In particular, we treat multiple imputations for each example as one dataset and we have scores for |DM | datasets for a given algorithm (and the chosen metric). Our goal is to statistically compare the performance of algorithms in A on these datasets. Following Demsar (2006), we conduct the Friedman test with the null hypothesis that all algorithms are equivalent. This test involves computing the average ranks of each algorithm, followed by computing a statistic and checking against the XF2 distribution for a given confidence level β (we used 0.05). It turned out that the null hypothesis is always rejected for the algorithms that we considered. So, we conducted the Nemenyi test as the post-hoc test (Demsar, 2006). In this test, we compute q the critical difference (CD) score, defined as: qβ |A|(|A|+1) 6|DM | where qβ is obtained from a look-up table. The observed average rank difference between any two methods is statistically significant when it is greater than CD. We demonstrate this in our experiments later and make key observations.

4. Experiments We apply our framework to four UCI datasets and four representative imputation algorithms from the liter-

605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659

A Quantitative Evaluation Framework for Missing Value Imputation Algorithms

Dataset Flare Mushroom SPECT Yeast

#Rows 838 2027 213 1200

#Cols 9 10 22 14

Table 1. Datasets used for experiments KL

MMD score

2

A1

A2

A3

3 2 1

A4

4 Average rank

3

1

A1

KL

2

A1

A2

A3

2

A1

A2

A3

A1

A4

A3

A4

NDS

3 2 1

A2

4 Average rank

Average rank A3

A4

2 1

A4

4

2

A3

3

MMD score

3

A2

NDS

3

1

A4

4

A2

A1

4

KL

A1

2 1

A4

Average rank

Average rank

30% missing Average rank

A3

4

3

1

A2

3

MMD score

4

1

NDS

4 Average rank

10% missing Average rank

4

50% missing Average rank

660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714

A1

A2

A3

A4

3 2 1

A1

A2

A3

A4

Figure 3. Average rank of the four imputation algorithms on the KL (left column), MMD score (middle column) and NDS (right column) metrics, for the Yeast dataset with missing entries ranging from 10% (top row), 30% (middle row) and 50% (bottom row).

ature. We present the ranking results for the per(missing)pattern model and show that the rankings are similar to those in section 2. We then compare the results between the per-pattern model approach and the single model approach and show that the latter gives nearly identical results but is much cheaper to train. Then we consider various properties of the metrics, such as the self-consistency of the rankings for different choices of the “true” distribution, and their ability to discriminate among the algorithms with statistically significant rank differences. Datasets: See table 1. Note that in the case of Mushroom, we selected the first 10 columns of the original dataset to keep the running times low. None of the datasets contain missing values originally. Values are removed completely at random, i.e. the probability of a variable missing is independent of its value or the value of other observed variables. The missing percentage is varied from 10 to 50% in steps of 10. We use the full datasets without any missing values to build the distribution models. Imputation algorithms: The following algorithms are included in the experiments:

A1: Mode/Mean: A simple baseline that samples from the marginal distribution of each column, fit by a Gaussian for continuous variables and a multinomial distribution for categorical variables. A2: Mixture Model: A mixture model in which each mixture component is a product of univariate distributions. Inference and learning are done using Variational Bayes inference (Winn & Bishop, 2005). A3: k-Nearest Neighbors: We implement the k-NN algorithm in (Hastie et al., 1999). Multiple samples are drawn from the distribution of values for the missing variable given by the set of nearest neighbors. A4: MICE: Multiple Imputation using Chained Equations (MICE) (van Buuren & Groothuis-Oudshoorn, 2011) is a state-of-the-art imputation algorithm which builds a predictor for each variable using all other variables as inputs. The learning alternates between predicting the missing values using the current predictors and updating the predictors using the current estimates of the missing values. These algorithms are selected as representatives of commonly used approaches, rather than as a comprehensive set. Ranking results: We have computed results using the per-pattern distribution models for 4 imputation algorithms, 5 metrics, 4 datasets, and 5 missing percentages, for a total of 400 combinations. The results can be analyzed along any of these dimensions to derive insight. For example, figure 2 shows the results for the four different algorithms and five metrics for Yeast with 30% missing values. All metrics pick MICE as the best algorithm. Or if we want to understand the effect of missing percentage on the performance of the imputation algorithms, we can compute the average ranks over all the datasets for each missing percentage and metric. Figure 3 shows the results for three missing percentages and three metrics. (We omitted some plots to save space). As the missing percentage increases, the algorithms become harder to distinguish under the KL and MMD score metrics, while NDS maintains discriminability. If we want to know the overall ranking of the imputation algorithms for each metric, we can aggregate the results across all datasets and all missing percentages. We observe that MICE is overall the best imputation algorithm, Mode/Mean is the worst, with Mixture Model and k-NN in between. These are some examples of the kind of analysis that become possible with our framework. Per-pattern vs. single models: In section 3 we propose an alternative to learning a separate distribution model for each unique missingness pattern by

715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769

A Quantitative Evaluation Framework for Missing Value Imputation Algorithms Symmetric KL

MMD B−test

MMD score

NDS

4

4

3.5

3.5

3.5

3.5

3.5

3 2.5 2 1.5 1

3 2.5 2 1.5

3 2.5 2 1.5

1 A1 A2 A3 A4

1 A1 A2 A3 A4

Average rank

4

Average rank

4

Average rank

4

Average rank

Average rank

KL

3 2.5 2 1.5 1

3 2.5 2 1.5 1

A1 A2 A3 A4

A1 A2 A3 A4

A1 A2 A3 A4

Figure 2. Average rank of the four imputation algorithms on each of the five metrics for the Yeast dataset with 30% entries missing at random 20% missing

30% missing

0.65

KL

Sym. KL

2.27%

1.10%

MMD B-test 1.40%

MMD score 2.06%

NDS 0.36%

NDS

NDS

4

0.6

0.6 0.55

0.5

0.5

0.45

0.45

1

2

3 Metric

4

5

0.4

1

2

40% missing

3 Metric

4

5

4

5

50% missing

0.65

0.65

0.6

0.6

0.55

0.55

0.5

0.5

0.45

0.45

4 3.5 Average rank

3.5 3 2.5 2 1.5 1

0.65

0.55

0.4

Table 2. % change in average rank between per-pattern model and single model, averaged over all imputation algorithms, datasets, and missing percentages.

Average rank

770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824

0.4

3 2.5 2

1

2

3 Metric

4

5

0.4

1

2

3 Metric

Figure 5. Inconsistency scores for various metrics on Flare with different missing percentages. The metric labels are: MMD B-test = 1, MMD score = 2, SymmKL = 3, KL = 4, NDS = 5.

1.5 A2 A4 A3 A5 True distribution: A1

1

A1 A2 A3 A5 True distribution: A4

Figure 4. Rankings of imputation algorithms with respect to different ‘true’ distributions for Flare with 10% missing values and NDS metric. The algorithm labels are: A1 = Mode/Mean, A2 = Mixture Model, A3 = k-NN, A4=MICE, A5 = Single model distribution.

learning all of them jointly with a single set of parameters. Here we verify whether learning a combined model in this manner introduces any approximation error that causes the average rankings to change. We re-run the same 400 combinations of algorithms, metrics, datasets, and missing percentages from the previous experiment, but now using the single distribution model instead of the per-pattern model. The percent difference in the average rank values between the two models for each metric is given in table 2. The differences in the average rank values are too small to alter the rankings. (Note that the critical difference values do not depend on the choice of per-pattern vs. single model.) So both models give identical rankings over all metrics, datasets, and missing percentages. Measuring metric inconsistency: We can use our framework to investigate whether an evaluation metric has certain desirable properties, and based on the

results decide on its usefulness. One such property we examine here is how “inconsistent” a metric is. Suppose we have K algorithms A1 to AK . We can rank all algorithms except Ai by treating its samples as coming from the ‘true’ distribution and comparing the rest against it. K such rankings can be computed for i = 1 to K. It is possible to check how inconsistent these rankings are with each other. Consider the example in figure 4. When A1 is used as the true distribution, we see that A4 has a small average rank difference with A3 and A5, and a large difference with A2. When A4 itself is used as the true distribution, the ranking of {A2, A3, A5} follows the same pattern. A3 and A5 are ranked as closest to A4 while A2 is ranked as distant from A4. So for the pair (A1,A4) the ranking of {A2, A3, A5} given by the average rank differences in the left plot matches the ranking in the right plot. When they do not match, we call them inconsistent. We can quantify the inconsistency using the KendallTau distance between the two rankings for every pair of algorithms and averaging the distances. Figure 5 shows the inconsistency scores computed for five metrics on Flare for different missing percentages. Note that NDS has the lowest inconsistency score among

825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879

A Quantitative Evaluation Framework for Missing Value Imputation Algorithms 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934

all the metrics. The difference between NDS and other metrics is particularly large at the lower missing percentages. Similar results are observed on other datasets also. This provides further evidence that NDS is a good metric for evaluating imputations.

Reid, Mark D. and Williamson, Robert C. Information, divergence and risk for binary experiments. J. Mach. Learn. Res., 12:731–817, July 2011.

5. Conclusion

Su, Yu-Sung, Gelman, Andrew, Hill, Jennifer, and Yajima, Masanao. Multiple imputation with diagnostics (mi) in R: Opening windows into the black box. Journal of Statistical Software, 45(2):1–31, 2011.

We presented a framework to quantitatively evaluate and rank the goodness of imputation algorithms. It allows users to (a) work with any decent conditional modeling choice, (b) use metrics that are efficient and consistent (any improved quality metrics that appear in future as this field matures can be easily incorporated), (c) compare metrics using any new consistency checks, and (d) rank algorithms using well-laid out statistical significance tests.

References Burgette, L. F. and Reiter, J. P. Multiple Imputation for Missing Data via Sequential Regression Trees. American Journal of Epidemiology, 172(9):1070–1076, 2010. Buuren, S Van, Boshuizen, H C, and Knook, D L. Multiple imputation of missing blood pressure covariates in survival analysis. Statistics in Medicine, 18:681–694, 1999. Demˇsar, Janez. Statistical comparisons of classifiers over multiple data sets. JMLR, 7:1–30, December 2006. Gelman, Andrew and Raghunathan, Trivellore E. Using conditional distributions for missing-data imputation. Statistical Science, 3:268–269, 2001. Gretton, A, Sriperumbudur, B, Sejdinovic, D, Strathmann, H, Balakrishnan, S, Pontil, M, and Fukumizu, K. Optimal kernel choice for large-scale two-sample tests. pp. 1214–1222. Gretton, Arthur, Borgwardt, Karsten M., Rasch, Malte J., Sch¨ olkopf, Bernhard, and Smola, Alexander. A kernel two-sample test. in JMLR, 13(1):723–773, March 2012. Hastie, Trevor, Tibshirani, Robert, Sherlock, Gavin, Eisen, Michael, Brown, Patrick, and Botstein, David. Imputing missing data for gene expression arrays. Technical report, 1999. Li, F., Yu, Y., and Rubin, D.B. Imputing missing data by fully conditional models: Some cautionary examples and guidelines. 2012. Little, Roderick J A and Rubin, Donald B. Statistical analysis with missing data. John Wiley & Sons, Inc., 1986. Nguyen, XuanLong, Wainwright, Martin J., and Jordan, Michael I. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Trans. Inf. Theor., 56(11):5847–5861, November 2010. Raghunathan, Trivellore E, Lepkowski, James M, Hoewyk, John Van, and Solenberger, Peter. A multivariate technique for multiply imputing missing values using a sequence of regression models. Survey Methodology, 2001.

Rubin, Donald B. Multiple imputation after 18+ years. Journal of the American Statistical Association, pp. pp. 473–489, 1996.

Templ, M., Alfons, A., and Filzmoser, P. Exploring the multivariate structure of missing values using the r package vim. The R User Conference, 2009. Templ, Matthias, Kowarik, Alexander, and Filzmoser, Peter. Iterative stepwise regression imputation using standard and robust methods. Computational Statistics & Data Analysis, 55(10):2793 – 2806, 2011. van Buuren, Stef. Multiple imputation of discrete and continuous data by fully conditional specification. Statistical Methods in Medical Research, 16(3):219–242, 2007. van Buuren, Stef and Groothuis-Oudshoorn, Karin. Mice: Multivariate imputation by chained equations in r. Journal of Statistical Software, 45(3):1–67, 2011. Winn, John and Bishop, Christopher M. Variational message passing. JMLR, pp. 661–694, 2005. Zaremba, Wojciech, Gretton, Arthur, and Blaschko, Matthew. B-test: A Non-parametric, Low Variance Kernel Two-sample Test. In In NIPS, 2013.

935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989

Suggest Documents