regression is applied in order to decide if a missing value is imputed with zero or by applying robust regression based on the continuous part of the response.
IMPUTATION OF COMPLEX DATA WITH R-PACKAGE VIM: TRADITIONAL AND NEW METHODS BASED ON ROBUST ESTIMATION. Key Invited Paper
Matthias Templ
2
1
, , Alexander Kowarik1 , Peter Filzmoser2
1 2
Department of Methodology, Statistics Austria
Department of Statistics and Probability Theory, TU WIEN, Austria
UNECE Worksession on data editing Ljubljana, Mai 10, 2011
Templ, et al. (STAT, TU)
Robust Imputation
Ljubljana, Mai 10, 2011
1 / 50
Content 1
Overview
2
Visualisation Tools
3
Robust Imputation: Motivation
4
Challenges
5
IRMI
6
Simulation results
7
Results from real-world data
8
Templ, et al. (STAT, TU)
Robust Imputation
Ljubljana, Mai 10, 2011
2 / 50
Overview
R-package VIM
VIM
=
Visualization and Imputation of Missings
Univariate, bivariate, multiple and multivariate plot methods to highlight missing values in complex data sets to learn about their structure (MCAR, MAR, MNAR). Comes with a GUI as well. Hot-deck, k -NN and EM-based (robust) imputation methods for complex data sets. Due to time reasons we mostly concentrate on EM-based imputation. For hot-deck and k -NN, please have a look at the paper. VIM-book in 2012
Templ, et al. (STAT, TU)
Robust Imputation
Ljubljana, Mai 10, 2011
3 / 50
Visualisation Tools
The GUI . . .
Templ, et al. (STAT, TU)
Robust Imputation
Ljubljana, Mai 10, 2011
4 / 50
Visualisation Tools
The GUI . . .
Templ, et al. (STAT, TU)
Robust Imputation
Ljubljana, Mai 10, 2011
5 / 50
Visualisation Tools
The GUI . . .
Templ, et al. (STAT, TU)
Robust Imputation
Ljubljana, Mai 10, 2011
6 / 50
Visualisation Tools
The GUI . . .
Templ, et al. (STAT, TU)
Robust Imputation
Ljubljana, Mai 10, 2011
7 / 50
Visualisation Tools
The GUI . . .
Templ, et al. (STAT, TU)
Robust Imputation
Ljubljana, Mai 10, 2011
8 / 50
Visualisation Tools
The GUI . . .
Templ, et al. (STAT, TU)
Robust Imputation
Ljubljana, Mai 10, 2011
9 / 50
Visualisation Tools
The GUI . . .
Templ, et al. (STAT, TU)
Robust Imputation
Ljubljana, Mai 10, 2011
10 / 50
Visualisation Tools
The GUI . . .
Templ, et al. (STAT, TU)
Robust Imputation
Ljubljana, Mai 10, 2011
11 / 50
Templ, et al. (STAT, TU) Robust Imputation py140n
py130n
py120n
py110n
py100n
py090n
py080n
py070n
py050n
py035n
py010n
py140n
py130n
py120n
py110n
py100n
py090n
py080n
py070n
py050n
py035n
py010n
0.00
0.02
0.06
Combinations
0.04
Proportion of missings 0.08
0.10
Visualisation Tools
The GUI . . .
1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 4 4 5 6 8 9 11 11 13 23 32 48 53 119 425 3827
Ljubljana, Mai 10, 2011 12 / 50
Visualisation Tools
0.6 0.4 0
5
10
15
P033000
Templ, et al. (STAT, TU)
Robust Imputation
20
25
35
missing
0.0
0.2
missing/observed in py010n
0.8
1.0
The GUI . . .
Ljubljana, Mai 10, 2011
13 / 50
Visualisation Tools
The GUI . . .
Templ, et al. (STAT, TU)
Robust Imputation
Ljubljana, Mai 10, 2011
14 / 50
Visualisation Tools
The GUI - Imputation
Templ, et al. (STAT, TU)
Robust Imputation
Ljubljana, Mai 10, 2011
15 / 50
Visualisation Tools
The GUI - Imputation
kNN ( data , variable = colnames ( data ) , metric = NULL , k =5 , dist _ var = colnames ( data ) , weights = NULL , numFun = median , catFun = maxCat , makeNA = NULL , NAcond = NULL , impNA = TRUE , donorcond = NULL , mixed = vector () , trace = FALSE , imp _ var = TRUE , imp _ suffix = " imp " , addRandom = FALSE ) Templ, et al. (STAT, TU)
Robust Imputation
Ljubljana, Mai 10, 2011
16 / 50
Visualisation Tools
The GUI - Imputation
hotdeck ( data , variable = colnames ( data ) , ord _ var = NULL , domain _ var = NULL , makeNA = NULL , NAcond = NULL , impNA = TRUE , donorcond = NULL , imp _ var = TRUE , imp _ suffix = " imp " ) Templ, et al. (STAT, TU)
Robust Imputation
Ljubljana, Mai 10, 2011
17 / 50
Visualisation Tools
The GUI - Imputation
irmi (x , eps = 0 .01 , maxit = 100 , mixed = NULL , step = FALSE , robust = FALSE , takeAll = TRUE , noise = TRUE , noise.factor = 1 , force = FALSE , robMethod = " lmrob " , force.mixed = TRUE , mi = 1 , trace = FALSE )
Templ, et al. (STAT, TU)
Robust Imputation
Ljubljana, Mai 10, 2011
18 / 50
Robust Imputation: Motivation
Median Imputation fully observed include missings in x include missings in y ● imputed values
●
3
● ● ●
●
2
● ● ● ● ● ●
● ● ●
● ● ●
y
● ● ●
●
●
●
1 −1
0
●●
● ●● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ●● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ●● ● ● ●
●
●
●
●
●
●
●
−2
● ●
● ●
● ● ●
−3
● ●
−2
0
2
4
6
8
10
x
Templ, et al. (STAT, TU)
Robust Imputation
Ljubljana, Mai 10, 2011
19 / 50
Robust Imputation: Motivation
k NN
Imputation ●
fully observed include missings in x include missings in y ● imputed values
3
● ● ●
●
2
● ● ● ● ● ●
● ● ●
● ● ●
y
● ● ●
●
●
●
1 −1
0
●●
● ●● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ●● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ●● ● ● ●
●
●
●
●
●
●
●
−2
● ●
● ●
● ● ●
−3
● ●
−2
0
2
4
6
8
10
x
Templ, et al. (STAT, TU)
Robust Imputation
Ljubljana, Mai 10, 2011
20 / 50
Robust Imputation: Motivation
IVEWARE ● ● ●
fully observed include missings in x include missings in y imputed values ●
●
●
● ●
● ●
● ● ●
● ●
● ● ● ●●
●
●● ● ● ● ●● ● ● ● ● ●
● ● ●
●
●
●
●
●
● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●
●
●
●
● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ●
●
●
● ●
● ●
● ● ●
● ●
Templ, et al. (STAT, TU)
Robust Imputation
Ljubljana, Mai 10, 2011
21 / 50
Robust Imputation: Motivation
IRMI ● ● ●
fully observed include missings in x include missings in y imputed values ●
●
●
● ●
● ●
● ● ●
● ●
● ● ● ●●
●
●● ● ● ● ●● ● ● ● ● ●
● ● ●
●
●
●
●
●
● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●
●
●
●
● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ●
●
●
● ●
● ●
● ● ●
● ●
Templ, et al. (STAT, TU)
Robust Imputation
Ljubljana, Mai 10, 2011
22 / 50
Robust Imputation: Motivation
Median Imputation ●
●
●
●
2
● original data imputed data values
● ● ● ●
● ●
−1
y
0
●●
●●
●
●
●
1
● ●
●
●
● ● ● ● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ●●● ● ● ● ●● ● ●●● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ●
●
●
●
●
−2
● ●
● ●
● ● ●
−3
● tolerance ellipse from original data ● imputed data tolerance ellipse from
−2
0
2
4
6
8
10
x
Templ, et al. (STAT, TU)
Robust Imputation
Ljubljana, Mai 10, 2011
23 / 50
Robust Imputation: Motivation
k NN
Imputation ●
●
●
●
2
● original data imputed data values
● ● ● ●
● ●
−1
y
0
●●
●
●
●
1
● ●
●
●
● ● ● ● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ●●● ● ● ● ●● ● ●●● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ●● ●
●
●
●
●
−2
● ●
● ●
● ● ●
−3
● tolerance ellipse from original data ● imputed data tolerance ellipse from
−2
0
2
4
6
8
10
x
Templ, et al. (STAT, TU)
Robust Imputation
Ljubljana, Mai 10, 2011
24 / 50
Robust Imputation: Motivation
IVEWARE original data imputed data
●
original data values imputed data values
●
● ●
● ● ●
● ●●● ● ● ●●●●●● ● ● ● ●● ●● ● ● ● ● ● ●● ●
● ●● ● ● ● ●● ● ●
● ● ●
●
●
●
● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ●● ●● ●● ●●
Templ, et al. (STAT, TU)
●
● ● ●●
● ●
●
● ●●
●●
●●
Robust Imputation
●
●
●
Ljubljana, Mai 10, 2011
25 / 50
Robust Imputation: Motivation
IRMI original data imputed data
original data values imputed data values
●
●
● ●
● ● ●
● ●●● ● ● ●●●●●● ● ● ● ●● ●● ● ● ● ● ● ● ● ●
● ●● ● ● ● ● ● ● ●
● ● ●
●
Templ, et al. (STAT, TU)
● ● ●
●
●
● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ●● ●● ●● ●●
●
●
● ● ●●
● ●●
●●
●●
Robust Imputation
●
●
Ljubljana, Mai 10, 2011
26 / 50
Robust Imputation: Motivation
25
Median Imputation
●
10
y
15
20
tolerance ellipse from original data tolerance ellipse from imputed data
● ● ●
● ● ●● ● ● ● ●● ● ● ● ●●● ● ● ●●●●●● ● ● ● ●● ●● ●● ● ● ● ●●●
●
●
● ●
●
●
● ●
● ●
●
● ● ●●● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ●●● ● ● ●●
● ● ●●
● ●●
●●
●●
●
●
−5
0
5
●
−2
0
2
4
6
8
x
Templ, et al. (STAT, TU)
Robust Imputation
Ljubljana, Mai 10, 2011
27 / 50
Robust Imputation: Motivation
Imputation tolerance ellipse from original data tolerance ellipse from imputed data
15
20
25
k NN
10
y
●
● ● ●
● ●●● ● ●●●● ●●● ● ● ● ●● ●● ●● ● ● ● ●●●
● ●● ●● ● ●● ●● ●
●
●
●
●
●
●
● ●
● ● ●
0
5
●
● ● ●●● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ●●● ● ● ●●
● ● ●●
● ●●
●●
●●
●
●
−5
●
−2
0
2
4
6
8
x
Templ, et al. (STAT, TU)
Robust Imputation
Ljubljana, Mai 10, 2011
28 / 50
Robust Imputation: Motivation
IVEWARE original data imputed data
original data values imputed data values
●
●
● ● ●
● ●●● ● ● ●●●●●● ● ● ● ●● ●● ● ● ● ● ● ●● ●
● ●● ● ● ● ● ● ● ● ●
●
●
● ●
●
● ●
●
● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ●● ●● ●● ●●
● ● ●●
● ●●
●●
● ●
●●
●
●
Templ, et al. (STAT, TU)
●
●
Robust Imputation
●
●
Ljubljana, Mai 10, 2011
29 / 50
Robust Imputation: Motivation
IRMI original data imputed data
original data values imputed data values
●
●
●
●
● ● ●
● ●●● ● ● ●●●●●● ● ● ● ●● ●● ● ● ● ● ● ● ● ●
● ●● ● ● ● ● ● ● ●
● ● ●
●
●
●
● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ●● ●● ●● ●●
Templ, et al. (STAT, TU)
●
● ● ●●
● ●
●
● ●●
●●
●●
Robust Imputation
●
●
●
Ljubljana, Mai 10, 2011
30 / 50
Challenges
Some Challenges
Mixed type of variables: various variables being nominal scaled, some variables might be ordinal and some variables could be determined to be of continuous scale. Semi-continuous variables: semi-continuous distributions, i.e. a variable consisting of a continuous scaled part and a certain proportion of equal values. Far from normality: Virtually always outlying observations included in real-world data. multiple imputation: Imputated must be both, reect the multivariate structure of the data and including randomness.
Templ, et al. (STAT, TU)
Robust Imputation
Ljubljana, Mai 10, 2011
31 / 50
Challenges
Some Challenges
Mixed type of variables: various variables being nominal scaled, some variables might be ordinal and some variables could be determined to be of continuous scale. Semi-continuous variables: semi-continuous distributions, i.e. a variable consisting of a continuous scaled part and a certain proportion of equal values. Far from normality: Virtually always outlying observations included in real-world data. multiple imputation: Imputated must be both, reect the multivariate structure of the data and including randomness.
Templ, et al. (STAT, TU)
Robust Imputation
Ljubljana, Mai 10, 2011
31 / 50
Challenges
Some Challenges
Mixed type of variables: various variables being nominal scaled, some variables might be ordinal and some variables could be determined to be of continuous scale. Semi-continuous variables: semi-continuous distributions, i.e. a variable consisting of a continuous scaled part and a certain proportion of equal values. Far from normality: Virtually always outlying observations included in real-world data. multiple imputation: Imputated must be both, reect the multivariate structure of the data and including randomness.
Templ, et al. (STAT, TU)
Robust Imputation
Ljubljana, Mai 10, 2011
31 / 50
Challenges
MIX, MICE, MI, MITOOLS, . . .
All missing values imputed with simulated values drawn from their predictive distribution given the observed data and the specied parameter.
−→
based on sequential regressions.
EM-based In general, there are often problems when applied to complex data sets. . . . and, of course, they are highly driven by inuencial points, representative and non-representative outliers.
Templ, et al. (STAT, TU)
Robust Imputation
Ljubljana, Mai 10, 2011
32 / 50
Challenges
MIX, MICE, MI, MITOOLS, . . .
All missing values imputed with simulated values drawn from their predictive distribution given the observed data and the specied parameter.
−→
based on sequential regressions.
EM-based In general, there are often problems when applied to complex data sets. . . . and, of course, they are highly driven by inuencial points, representative and non-representative outliers.
Templ, et al. (STAT, TU)
Robust Imputation
Ljubljana, Mai 10, 2011
32 / 50
Challenges
MIX, MICE, MI, MITOOLS, . . .
All missing values imputed with simulated values drawn from their predictive distribution given the observed data and the specied parameter.
−→
based on sequential regressions.
EM-based In general, there are often problems when applied to complex data sets. . . . and, of course, they are highly driven by inuencial points, representative and non-representative outliers.
Templ, et al. (STAT, TU)
Robust Imputation
Ljubljana, Mai 10, 2011
32 / 50
Challenges
MIX, MICE, MI, MITOOLS, . . .
All missing values imputed with simulated values drawn from their predictive distribution given the observed data and the specied parameter.
−→
based on sequential regressions.
EM-based In general, there are often problems when applied to complex data sets. . . . and, of course, they are highly driven by inuencial points, representative and non-representative outliers.
Templ, et al. (STAT, TU)
Robust Imputation
Ljubljana, Mai 10, 2011
32 / 50
Challenges
MIX, MICE, MI, MITOOLS, . . .
All missing values imputed with simulated values drawn from their predictive distribution given the observed data and the specied parameter.
−→
based on sequential regressions.
EM-based In general, there are often problems when applied to complex data sets. . . . and, of course, they are highly driven by inuencial points, representative and non-representative outliers.
Templ, et al. (STAT, TU)
Robust Imputation
Ljubljana, Mai 10, 2011
32 / 50
Challenges
IVEWARE Very popular software used in many applications in Ocial Statistics. Similar to the previous mentioned methods. The imputations are obtained by tting a sequence of (Bayesian) regression models and drawing values from the corresponding predictive distributions.
Sequentially imputation: in each step, one variable serve as response and certain other variables serves as predictors. Fit a certain model using the observed part of the response and estimate (update) the (former) missing values in the response. Initialization loop: . . . Second outer loop: Estimates of missing values are updated sequentially using one variable as response and all other variables as predictors until convergency.
Since missing values are drawn from their predictive distribution given the observed data and the specied parameter, the procedure allows
multiple imputation. Templ, et al. (STAT, TU)
Robust Imputation
Ljubljana, Mai 10, 2011
33 / 50
Challenges
IVEWARE Very popular software used in many applications in Ocial Statistics. Similar to the previous mentioned methods. The imputations are obtained by tting a sequence of (Bayesian) regression models and drawing values from the corresponding predictive distributions.
Sequentially imputation: in each step, one variable serve as response and certain other variables serves as predictors. Fit a certain model using the observed part of the response and estimate (update) the (former) missing values in the response. Initialization loop: . . . Second outer loop: Estimates of missing values are updated sequentially using one variable as response and all other variables as predictors until convergency.
Since missing values are drawn from their predictive distribution given the observed data and the specied parameter, the procedure allows
multiple imputation. Templ, et al. (STAT, TU)
Robust Imputation
Ljubljana, Mai 10, 2011
33 / 50
Challenges
IVEWARE Very popular software used in many applications in Ocial Statistics. Similar to the previous mentioned methods. The imputations are obtained by tting a sequence of (Bayesian) regression models and drawing values from the corresponding predictive distributions.
Sequentially imputation: in each step, one variable serve as response and certain other variables serves as predictors. Fit a certain model using the observed part of the response and estimate (update) the (former) missing values in the response. Initialization loop: . . . Second outer loop: Estimates of missing values are updated sequentially using one variable as response and all other variables as predictors until convergency.
Since missing values are drawn from their predictive distribution given the observed data and the specied parameter, the procedure allows
multiple imputation. Templ, et al. (STAT, TU)
Robust Imputation
Ljubljana, Mai 10, 2011
33 / 50
Challenges
IVEWARE Very popular software used in many applications in Ocial Statistics. Similar to the previous mentioned methods. The imputations are obtained by tting a sequence of (Bayesian) regression models and drawing values from the corresponding predictive distributions.
Sequentially imputation: in each step, one variable serve as response and certain other variables serves as predictors. Fit a certain model using the observed part of the response and estimate (update) the (former) missing values in the response. Initialization loop: . . . Second outer loop: Estimates of missing values are updated sequentially using one variable as response and all other variables as predictors until convergency.
Since missing values are drawn from their predictive distribution given the observed data and the specied parameter, the procedure allows
multiple imputation. Templ, et al. (STAT, TU)
Robust Imputation
Ljubljana, Mai 10, 2011
33 / 50
Challenges
IVEWARE Very popular software used in many applications in Ocial Statistics. Similar to the previous mentioned methods. The imputations are obtained by tting a sequence of (Bayesian) regression models and drawing values from the corresponding predictive distributions.
Sequentially imputation: in each step, one variable serve as response and certain other variables serves as predictors. Fit a certain model using the observed part of the response and estimate (update) the (former) missing values in the response. Initialization loop: . . . Second outer loop: Estimates of missing values are updated sequentially using one variable as response and all other variables as predictors until convergency.
Since missing values are drawn from their predictive distribution given the observed data and the specied parameter, the procedure allows
multiple imputation. Templ, et al. (STAT, TU)
Robust Imputation
Ljubljana, Mai 10, 2011
33 / 50
IRMI
IRMI
Only the second outer loop is used (missing values are initialised in an other manner) In contradiction to IVEWARE we use quite dierent regression
methods
→
Robust methods (Note: a lot of problems has to be
solved when using robust methods for complex data like EU-SILC). Alternatively, stepwise model selction tools are integrated using AIC or BIC.
Multiple imputation is provided.
Templ, et al. (STAT, TU)
Robust Imputation
Ljubljana, Mai 10, 2011
34 / 50
IRMI
IRMI
Only the second outer loop is used (missing values are initialised in an other manner) In contradiction to IVEWARE we use quite dierent regression
methods
→
Robust methods (Note: a lot of problems has to be
solved when using robust methods for complex data like EU-SILC). Alternatively, stepwise model selction tools are integrated using AIC or BIC.
Multiple imputation is provided.
Templ, et al. (STAT, TU)
Robust Imputation
Ljubljana, Mai 10, 2011
34 / 50
IRMI
IRMI
Only the second outer loop is used (missing values are initialised in an other manner) In contradiction to IVEWARE we use quite dierent regression
methods
→
Robust methods (Note: a lot of problems has to be
solved when using robust methods for complex data like EU-SILC). Alternatively, stepwise model selction tools are integrated using AIC or BIC.
Multiple imputation is provided.
Templ, et al. (STAT, TU)
Robust Imputation
Ljubljana, Mai 10, 2011
34 / 50
IRMI
Selection of Regression Models
If the response is continuous , robust (IRMI) or ols (IMI, IVEWARE) regression methods are used. categorical , generalized linear regression is applied (IRMI: robust or non-robust). binary , logistic linear regression is applied (IRMI: robust but non-robust is prefered). mixed , a two-stage approach is used whereas in the rst stage logistic regression is applied in order to decide if a missing value is imputed with zero or by applying robust regression based on the continuous part of the response. count , robust generalized linear models (family: Poisson) is used.
Templ, et al. (STAT, TU)
Robust Imputation
Ljubljana, Mai 10, 2011
35 / 50
IRMI
Selection of Regression Models
If the response is continuous , robust (IRMI) or ols (IMI, IVEWARE) regression methods are used. categorical , generalized linear regression is applied (IRMI: robust or non-robust). binary , logistic linear regression is applied (IRMI: robust but non-robust is prefered). mixed , a two-stage approach is used whereas in the rst stage logistic regression is applied in order to decide if a missing value is imputed with zero or by applying robust regression based on the continuous part of the response. count , robust generalized linear models (family: Poisson) is used.
Templ, et al. (STAT, TU)
Robust Imputation
Ljubljana, Mai 10, 2011
35 / 50
IRMI
Selection of Regression Models
If the response is continuous , robust (IRMI) or ols (IMI, IVEWARE) regression methods are used. categorical , generalized linear regression is applied (IRMI: robust or non-robust). binary , logistic linear regression is applied (IRMI: robust but non-robust is prefered). mixed , a two-stage approach is used whereas in the rst stage logistic regression is applied in order to decide if a missing value is imputed with zero or by applying robust regression based on the continuous part of the response. count , robust generalized linear models (family: Poisson) is used.
Templ, et al. (STAT, TU)
Robust Imputation
Ljubljana, Mai 10, 2011
35 / 50
IRMI
Selection of Regression Models
If the response is continuous , robust (IRMI) or ols (IMI, IVEWARE) regression methods are used. categorical , generalized linear regression is applied (IRMI: robust or non-robust). binary , logistic linear regression is applied (IRMI: robust but non-robust is prefered). mixed , a two-stage approach is used whereas in the rst stage logistic regression is applied in order to decide if a missing value is imputed with zero or by applying robust regression based on the continuous part of the response. count , robust generalized linear models (family: Poisson) is used.
Templ, et al. (STAT, TU)
Robust Imputation
Ljubljana, Mai 10, 2011
35 / 50
IRMI
Selection of Regression Models
If the response is continuous , robust (IRMI) or ols (IMI, IVEWARE) regression methods are used. categorical , generalized linear regression is applied (IRMI: robust or non-robust). binary , logistic linear regression is applied (IRMI: robust but non-robust is prefered). mixed , a two-stage approach is used whereas in the rst stage logistic regression is applied in order to decide if a missing value is imputed with zero or by applying robust regression based on the continuous part of the response. count , robust generalized linear models (family: Poisson) is used.
Templ, et al. (STAT, TU)
Robust Imputation
Ljubljana, Mai 10, 2011
35 / 50
IRMI
Selection of Regression Models
If the response is continuous , robust (IRMI) or ols (IMI, IVEWARE) regression methods are used. categorical , generalized linear regression is applied (IRMI: robust or non-robust). binary , logistic linear regression is applied (IRMI: robust but non-robust is prefered). mixed , a two-stage approach is used whereas in the rst stage logistic regression is applied in order to decide if a missing value is imputed with zero or by applying robust regression based on the continuous part of the response. count , robust generalized linear models (family: Poisson) is used.
Templ, et al. (STAT, TU)
Robust Imputation
Ljubljana, Mai 10, 2011
35 / 50
Simulation results
Measures of information loss
Errors from Categorical and Binary Variables
This error measure is dened as the proportion of imputed values taken from an incorrect category on all missing categorical or binary values:
errc
with
I
=
1 mc
pc n X X
I(xijorig 6= xijimp ) ,
(1)
j =1 i =1
the indicator function, mc the number of missing values in the pc
categorical variables, and n the number of observations.
Templ, et al. (STAT, TU)
Robust Imputation
Ljubljana, Mai 10, 2011
36 / 50
Simulation results
Measures of information loss
Errors from Continuous and Semi-continuous Variables
Here we assume that the constant part of the semi-continuous variable is zero. Then, the joint error measure is
errs
=
1 ms
" orig imp ps n (x X X − x ) ij ij orig · I(xij 6= 0 ∧ orig x j =1 i =1
I((xijorig = 0 ∧
ij
imp xij
6= 0) ∨ (xijorig 6= 0 ∧
imp xij imp xij
6= 0) +
i = 0))
(6 )
with ms the number of missing values in the ps continuous and semi-continuous variables.
Templ, et al. (STAT, TU)
Robust Imputation
Ljubljana, Mai 10, 2011
37 / 50
Simulation results
Specic simulation results
0.30
0.5
Simulation Results: Varying the Correlation
●
0.4
0.25
●
●
● ●
0.20
●
●
0.3
●
0.1
IRMI IMI IVEWARE
0.15 ●
0.00
0.0
●
0.05
0.1
0.10
0.2
errs
errc
●
●
0.3
0.5
0.7
0.9
Correlation
Templ, et al. (STAT, TU)
0.1
IRMI IMI IVEWARE 0.3
0.5
0.7
0.9
Correlation
Robust Imputation
Ljubljana, Mai 10, 2011
38 / 50
Simulation results
Specic simulation results
●
●
0.20
●
● ●
0.30
●
20
28
36
0.15
0.25
errs
0.20 0.15
0.05
0.10
errc
●
●
4
IRMI IMI IVEWARE 12
●
0.00
●
0.00
0.05
● ●
0.10
0.35
Simulation Results: Varying the Amount of Variables
20
28
36
Number of Variables
Templ, et al. (STAT, TU)
4
IRMI IMI IVEWARE 12
Number of Variables
Robust Imputation
Ljubljana, Mai 10, 2011
39 / 50
Simulation results
Specic simulation results
Including (moderate) Outliers and Varying their Amount,
●
●
0.35
0.35
high Correlation
●
0.30 0.25
●
0.25
0.30
● ●
0.20
errs
●
0.15
0.20 0.15
errc
●
0
0.05
0.10
IRMI IMI IVEWARE 10
●
0.00
●
0.00
0.05
●
●
0.10
●
●
20
30
40
50
Outliers [%]
Templ, et al. (STAT, TU)
0
IRMI IMI IVEWARE 10
20
30
40
50
Outliers [%]
Robust Imputation
Ljubljana, Mai 10, 2011
40 / 50
Simulation results
Specic simulation results
Including (moderate) Outliers and Varying their Amount,
●
●
●
●
0.30
0.4
● ●
0.35
low Correlation
0.20
errs
●
●
●
20
30
0.15
●
0.10
0.2 0
IRMI IMI IVEWARE 10
●
0.00
0.0
●
0.05
0.1
errc
0.3
0.25
●
●
20
30
40
50
Outliers [%]
Templ, et al. (STAT, TU)
0
IRMI IMI IVEWARE 10
40
50
Outliers [%]
Robust Imputation
Ljubljana, Mai 10, 2011
41 / 50
Results from real-world data
Imputation in EU-SILC
We considered certain HH-components, but also some nominal variables, such as household size, region and htype3.
1
R
2
Set missing values in HH-components randomly (MCAR). R
3
Impute the missing values.
4
Evaluate the imputations using certain information loss measures.
5
Go to (2) until R
=0
Templ, et al. (STAT, TU)
++
= 100.
Robust Imputation
Ljubljana, Mai 10, 2011
42 / 50
Results from real-world data
Imputation in EU-SILC, Results
80
100
●
60 40
errs
●
● ● ● ●
IMI
IVEWARE
20
● ●
● ● ●
IRMI
Templ, et al. (STAT, TU)
Robust Imputation
Ljubljana, Mai 10, 2011
43 / 50
Results from real-world data
CENSUS Data - no outliers
●
0.8
● ●
0.6
● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ●
● ● ● ●
● ● ● ● ●
●
0.4
errs
● ● ● ● ●
IRMI
IMI
IVEWARE
●
●
IRMI
IMI
0.0
0.2
0.2
0.4
errs
0.6
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
IVEWARE
(i) Error for categorical variables
Templ, et al. (STAT, TU)
(j) Error for numerical variables
Robust Imputation
Ljubljana, Mai 10, 2011
44 / 50
Results from real-world data
Airquality Data
● ● ●
● ● ●
IRMI
IMI
● ● ● ● ● ● ● ●
0.5
errs
● ● ● ● ●
1.0
1.5
● ● ● ● ● ● ● ●
● ● ●
Templ, et al. (STAT, TU)
Robust Imputation
IVE
Ljubljana, Mai 10, 2011
45 / 50
Templ, et al. (STAT, TU) Umsatz B31 B41 B23 a1 a2 a6 a25 e2 Besch
Umsatz B31 B41 B23 a1 a2 a6 a25 e2 Besch
0.00
0.01
0.03
Combinations
0.02
Proportion of missings 0.04
0.05
Small walkthrough to VIM
Example Data: SBS data
Robust Imputation Ljubljana, Mai 10, 2011 46 / 50
Small walkthrough to VIM
Most important functionality for imputation
Listing 1: Hotdeck imputation.
hotdeck (x , ord _ var = c ( " Besch " ," Umsatz " ) , imp _ var = FALSE )
Listing 2: k -nearest neighbor imputation.
kNN ( x )
Listing 3: Application of robust iterative model-based imputation.
imp