IMPUTATION OF COMPLEX DATA WITH R-PACKAGE VIM - unece

50 downloads 0 Views 1MB Size Report
regression is applied in order to decide if a missing value is imputed with zero or by applying robust regression based on the continuous part of the response.
IMPUTATION OF COMPLEX DATA WITH R-PACKAGE VIM: TRADITIONAL AND NEW METHODS BASED ON ROBUST ESTIMATION. Key Invited Paper

Matthias Templ

2

1

, , Alexander Kowarik1 , Peter Filzmoser2

1 2

Department of Methodology, Statistics Austria

Department of Statistics and Probability Theory, TU WIEN, Austria

UNECE Worksession on data editing Ljubljana, Mai 10, 2011

Templ, et al. (STAT, TU)

Robust Imputation

Ljubljana, Mai 10, 2011

1 / 50

Content 1

Overview

2

Visualisation Tools

3

Robust Imputation: Motivation

4

Challenges

5

IRMI

6

Simulation results

7

Results from real-world data

8

Templ, et al. (STAT, TU)

Robust Imputation

Ljubljana, Mai 10, 2011

2 / 50

Overview

R-package VIM

VIM

=

Visualization and Imputation of Missings

Univariate, bivariate, multiple and multivariate plot methods to highlight missing values in complex data sets to learn about their structure (MCAR, MAR, MNAR). Comes with a GUI as well. Hot-deck, k -NN and EM-based (robust) imputation methods for complex data sets. Due to time reasons we mostly concentrate on EM-based imputation. For hot-deck and k -NN, please have a look at the paper. VIM-book in 2012

Templ, et al. (STAT, TU)

Robust Imputation

Ljubljana, Mai 10, 2011

3 / 50

Visualisation Tools

The GUI . . .

Templ, et al. (STAT, TU)

Robust Imputation

Ljubljana, Mai 10, 2011

4 / 50

Visualisation Tools

The GUI . . .

Templ, et al. (STAT, TU)

Robust Imputation

Ljubljana, Mai 10, 2011

5 / 50

Visualisation Tools

The GUI . . .

Templ, et al. (STAT, TU)

Robust Imputation

Ljubljana, Mai 10, 2011

6 / 50

Visualisation Tools

The GUI . . .

Templ, et al. (STAT, TU)

Robust Imputation

Ljubljana, Mai 10, 2011

7 / 50

Visualisation Tools

The GUI . . .

Templ, et al. (STAT, TU)

Robust Imputation

Ljubljana, Mai 10, 2011

8 / 50

Visualisation Tools

The GUI . . .

Templ, et al. (STAT, TU)

Robust Imputation

Ljubljana, Mai 10, 2011

9 / 50

Visualisation Tools

The GUI . . .

Templ, et al. (STAT, TU)

Robust Imputation

Ljubljana, Mai 10, 2011

10 / 50

Visualisation Tools

The GUI . . .

Templ, et al. (STAT, TU)

Robust Imputation

Ljubljana, Mai 10, 2011

11 / 50

Templ, et al. (STAT, TU) Robust Imputation py140n

py130n

py120n

py110n

py100n

py090n

py080n

py070n

py050n

py035n

py010n

py140n

py130n

py120n

py110n

py100n

py090n

py080n

py070n

py050n

py035n

py010n

0.00

0.02

0.06

Combinations

0.04

Proportion of missings 0.08

0.10

Visualisation Tools

The GUI . . .

1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 4 4 5 6 8 9 11 11 13 23 32 48 53 119 425 3827

Ljubljana, Mai 10, 2011 12 / 50

Visualisation Tools

0.6 0.4 0

5

10

15

P033000

Templ, et al. (STAT, TU)

Robust Imputation

20

25

35

missing

0.0

0.2

missing/observed in py010n

0.8

1.0

The GUI . . .

Ljubljana, Mai 10, 2011

13 / 50

Visualisation Tools

The GUI . . .

Templ, et al. (STAT, TU)

Robust Imputation

Ljubljana, Mai 10, 2011

14 / 50

Visualisation Tools

The GUI - Imputation

Templ, et al. (STAT, TU)

Robust Imputation

Ljubljana, Mai 10, 2011

15 / 50

Visualisation Tools

The GUI - Imputation

kNN ( data , variable = colnames ( data ) , metric = NULL , k =5 , dist _ var = colnames ( data ) , weights = NULL , numFun = median , catFun = maxCat , makeNA = NULL , NAcond = NULL , impNA = TRUE , donorcond = NULL , mixed = vector () , trace = FALSE , imp _ var = TRUE , imp _ suffix = " imp " , addRandom = FALSE ) Templ, et al. (STAT, TU)

Robust Imputation

Ljubljana, Mai 10, 2011

16 / 50

Visualisation Tools

The GUI - Imputation

hotdeck ( data , variable = colnames ( data ) , ord _ var = NULL , domain _ var = NULL , makeNA = NULL , NAcond = NULL , impNA = TRUE , donorcond = NULL , imp _ var = TRUE , imp _ suffix = " imp " ) Templ, et al. (STAT, TU)

Robust Imputation

Ljubljana, Mai 10, 2011

17 / 50

Visualisation Tools

The GUI - Imputation

irmi (x , eps = 0 .01 , maxit = 100 , mixed = NULL , step = FALSE , robust = FALSE , takeAll = TRUE , noise = TRUE , noise.factor = 1 , force = FALSE , robMethod = " lmrob " , force.mixed = TRUE , mi = 1 , trace = FALSE )

Templ, et al. (STAT, TU)

Robust Imputation

Ljubljana, Mai 10, 2011

18 / 50

Robust Imputation: Motivation

Median Imputation fully observed include missings in x include missings in y ● imputed values



3

● ● ●



2

● ● ● ● ● ●

● ● ●

● ● ●

y

● ● ●







1 −1

0

●●

● ●● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ●● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ●● ● ● ●















−2

● ●

● ●

● ● ●

−3

● ●

−2

0

2

4

6

8

10

x

Templ, et al. (STAT, TU)

Robust Imputation

Ljubljana, Mai 10, 2011

19 / 50

Robust Imputation: Motivation

k NN

Imputation ●

fully observed include missings in x include missings in y ● imputed values

3

● ● ●



2

● ● ● ● ● ●

● ● ●

● ● ●

y

● ● ●







1 −1

0

●●

● ●● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ●● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ●● ● ● ●















−2

● ●

● ●

● ● ●

−3

● ●

−2

0

2

4

6

8

10

x

Templ, et al. (STAT, TU)

Robust Imputation

Ljubljana, Mai 10, 2011

20 / 50

Robust Imputation: Motivation

IVEWARE ● ● ●

fully observed include missings in x include missings in y imputed values ●





● ●

● ●

● ● ●

● ●

● ● ● ●●



●● ● ● ● ●● ● ● ● ● ●

● ● ●











● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●







● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ●





● ●

● ●

● ● ●

● ●

Templ, et al. (STAT, TU)

Robust Imputation

Ljubljana, Mai 10, 2011

21 / 50

Robust Imputation: Motivation

IRMI ● ● ●

fully observed include missings in x include missings in y imputed values ●





● ●

● ●

● ● ●

● ●

● ● ● ●●



●● ● ● ● ●● ● ● ● ● ●

● ● ●











● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●







● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ●





● ●

● ●

● ● ●

● ●

Templ, et al. (STAT, TU)

Robust Imputation

Ljubljana, Mai 10, 2011

22 / 50

Robust Imputation: Motivation

Median Imputation ●







2

● original data imputed data values

● ● ● ●

● ●

−1

y

0

●●

●●







1

● ●





● ● ● ● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ●●● ● ● ● ●● ● ●●● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ●









−2

● ●

● ●

● ● ●

−3

● tolerance ellipse from original data ● imputed data tolerance ellipse from

−2

0

2

4

6

8

10

x

Templ, et al. (STAT, TU)

Robust Imputation

Ljubljana, Mai 10, 2011

23 / 50

Robust Imputation: Motivation

k NN

Imputation ●







2

● original data imputed data values

● ● ● ●

● ●

−1

y

0

●●







1

● ●





● ● ● ● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ●●● ● ● ● ●● ● ●●● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ●● ●









−2

● ●

● ●

● ● ●

−3

● tolerance ellipse from original data ● imputed data tolerance ellipse from

−2

0

2

4

6

8

10

x

Templ, et al. (STAT, TU)

Robust Imputation

Ljubljana, Mai 10, 2011

24 / 50

Robust Imputation: Motivation

IVEWARE original data imputed data



original data values imputed data values



● ●

● ● ●

● ●●● ● ● ●●●●●● ● ● ● ●● ●● ● ● ● ● ● ●● ●

● ●● ● ● ● ●● ● ●

● ● ●







● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ●● ●● ●● ●●

Templ, et al. (STAT, TU)



● ● ●●

● ●



● ●●

●●

●●

Robust Imputation







Ljubljana, Mai 10, 2011

25 / 50

Robust Imputation: Motivation

IRMI original data imputed data

original data values imputed data values





● ●

● ● ●

● ●●● ● ● ●●●●●● ● ● ● ●● ●● ● ● ● ● ● ● ● ●

● ●● ● ● ● ● ● ● ●

● ● ●



Templ, et al. (STAT, TU)

● ● ●





● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ●● ●● ●● ●●





● ● ●●

● ●●

●●

●●

Robust Imputation





Ljubljana, Mai 10, 2011

26 / 50

Robust Imputation: Motivation

25

Median Imputation



10

y

15

20

tolerance ellipse from original data tolerance ellipse from imputed data

● ● ●

● ● ●● ● ● ● ●● ● ● ● ●●● ● ● ●●●●●● ● ● ● ●● ●● ●● ● ● ● ●●●





● ●





● ●

● ●



● ● ●●● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ●●● ● ● ●●

● ● ●●

● ●●

●●

●●





−5

0

5



−2

0

2

4

6

8

x

Templ, et al. (STAT, TU)

Robust Imputation

Ljubljana, Mai 10, 2011

27 / 50

Robust Imputation: Motivation

Imputation tolerance ellipse from original data tolerance ellipse from imputed data

15

20

25

k NN

10

y



● ● ●

● ●●● ● ●●●● ●●● ● ● ● ●● ●● ●● ● ● ● ●●●

● ●● ●● ● ●● ●● ●













● ●

● ● ●

0

5



● ● ●●● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ●●● ● ● ●●

● ● ●●

● ●●

●●

●●





−5



−2

0

2

4

6

8

x

Templ, et al. (STAT, TU)

Robust Imputation

Ljubljana, Mai 10, 2011

28 / 50

Robust Imputation: Motivation

IVEWARE original data imputed data

original data values imputed data values





● ● ●

● ●●● ● ● ●●●●●● ● ● ● ●● ●● ● ● ● ● ● ●● ●

● ●● ● ● ● ● ● ● ● ●





● ●



● ●



● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ●● ●● ●● ●●

● ● ●●

● ●●

●●

● ●

●●





Templ, et al. (STAT, TU)





Robust Imputation





Ljubljana, Mai 10, 2011

29 / 50

Robust Imputation: Motivation

IRMI original data imputed data

original data values imputed data values









● ● ●

● ●●● ● ● ●●●●●● ● ● ● ●● ●● ● ● ● ● ● ● ● ●

● ●● ● ● ● ● ● ● ●

● ● ●







● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ●● ●● ●● ●●

Templ, et al. (STAT, TU)



● ● ●●

● ●



● ●●

●●

●●

Robust Imputation







Ljubljana, Mai 10, 2011

30 / 50

Challenges

Some Challenges

Mixed type of variables: various variables being nominal scaled, some variables might be ordinal and some variables could be determined to be of continuous scale. Semi-continuous variables: semi-continuous distributions, i.e. a variable consisting of a continuous scaled part and a certain proportion of equal values. Far from normality: Virtually always outlying observations included in real-world data. multiple imputation: Imputated must be both, reect the multivariate structure of the data and including randomness.

Templ, et al. (STAT, TU)

Robust Imputation

Ljubljana, Mai 10, 2011

31 / 50

Challenges

Some Challenges

Mixed type of variables: various variables being nominal scaled, some variables might be ordinal and some variables could be determined to be of continuous scale. Semi-continuous variables: semi-continuous distributions, i.e. a variable consisting of a continuous scaled part and a certain proportion of equal values. Far from normality: Virtually always outlying observations included in real-world data. multiple imputation: Imputated must be both, reect the multivariate structure of the data and including randomness.

Templ, et al. (STAT, TU)

Robust Imputation

Ljubljana, Mai 10, 2011

31 / 50

Challenges

Some Challenges

Mixed type of variables: various variables being nominal scaled, some variables might be ordinal and some variables could be determined to be of continuous scale. Semi-continuous variables: semi-continuous distributions, i.e. a variable consisting of a continuous scaled part and a certain proportion of equal values. Far from normality: Virtually always outlying observations included in real-world data. multiple imputation: Imputated must be both, reect the multivariate structure of the data and including randomness.

Templ, et al. (STAT, TU)

Robust Imputation

Ljubljana, Mai 10, 2011

31 / 50

Challenges

MIX, MICE, MI, MITOOLS, . . .

All missing values imputed with simulated values drawn from their predictive distribution given the observed data and the specied parameter.

−→

based on sequential regressions.

EM-based In general, there are often problems when applied to complex data sets. . . . and, of course, they are highly driven by inuencial points, representative and non-representative outliers.

Templ, et al. (STAT, TU)

Robust Imputation

Ljubljana, Mai 10, 2011

32 / 50

Challenges

MIX, MICE, MI, MITOOLS, . . .

All missing values imputed with simulated values drawn from their predictive distribution given the observed data and the specied parameter.

−→

based on sequential regressions.

EM-based In general, there are often problems when applied to complex data sets. . . . and, of course, they are highly driven by inuencial points, representative and non-representative outliers.

Templ, et al. (STAT, TU)

Robust Imputation

Ljubljana, Mai 10, 2011

32 / 50

Challenges

MIX, MICE, MI, MITOOLS, . . .

All missing values imputed with simulated values drawn from their predictive distribution given the observed data and the specied parameter.

−→

based on sequential regressions.

EM-based In general, there are often problems when applied to complex data sets. . . . and, of course, they are highly driven by inuencial points, representative and non-representative outliers.

Templ, et al. (STAT, TU)

Robust Imputation

Ljubljana, Mai 10, 2011

32 / 50

Challenges

MIX, MICE, MI, MITOOLS, . . .

All missing values imputed with simulated values drawn from their predictive distribution given the observed data and the specied parameter.

−→

based on sequential regressions.

EM-based In general, there are often problems when applied to complex data sets. . . . and, of course, they are highly driven by inuencial points, representative and non-representative outliers.

Templ, et al. (STAT, TU)

Robust Imputation

Ljubljana, Mai 10, 2011

32 / 50

Challenges

MIX, MICE, MI, MITOOLS, . . .

All missing values imputed with simulated values drawn from their predictive distribution given the observed data and the specied parameter.

−→

based on sequential regressions.

EM-based In general, there are often problems when applied to complex data sets. . . . and, of course, they are highly driven by inuencial points, representative and non-representative outliers.

Templ, et al. (STAT, TU)

Robust Imputation

Ljubljana, Mai 10, 2011

32 / 50

Challenges

IVEWARE Very popular software used in many applications in Ocial Statistics. Similar to the previous mentioned methods. The imputations are obtained by tting a sequence of (Bayesian) regression models and drawing values from the corresponding predictive distributions.

Sequentially imputation: in each step, one variable serve as response and certain other variables serves as predictors. Fit a certain model using the observed part of the response and estimate (update) the (former) missing values in the response. Initialization loop: . . . Second outer loop: Estimates of missing values are updated sequentially using one variable as response and all other variables as predictors until convergency.

Since missing values are drawn from their predictive distribution given the observed data and the specied parameter, the procedure allows

multiple imputation. Templ, et al. (STAT, TU)

Robust Imputation

Ljubljana, Mai 10, 2011

33 / 50

Challenges

IVEWARE Very popular software used in many applications in Ocial Statistics. Similar to the previous mentioned methods. The imputations are obtained by tting a sequence of (Bayesian) regression models and drawing values from the corresponding predictive distributions.

Sequentially imputation: in each step, one variable serve as response and certain other variables serves as predictors. Fit a certain model using the observed part of the response and estimate (update) the (former) missing values in the response. Initialization loop: . . . Second outer loop: Estimates of missing values are updated sequentially using one variable as response and all other variables as predictors until convergency.

Since missing values are drawn from their predictive distribution given the observed data and the specied parameter, the procedure allows

multiple imputation. Templ, et al. (STAT, TU)

Robust Imputation

Ljubljana, Mai 10, 2011

33 / 50

Challenges

IVEWARE Very popular software used in many applications in Ocial Statistics. Similar to the previous mentioned methods. The imputations are obtained by tting a sequence of (Bayesian) regression models and drawing values from the corresponding predictive distributions.

Sequentially imputation: in each step, one variable serve as response and certain other variables serves as predictors. Fit a certain model using the observed part of the response and estimate (update) the (former) missing values in the response. Initialization loop: . . . Second outer loop: Estimates of missing values are updated sequentially using one variable as response and all other variables as predictors until convergency.

Since missing values are drawn from their predictive distribution given the observed data and the specied parameter, the procedure allows

multiple imputation. Templ, et al. (STAT, TU)

Robust Imputation

Ljubljana, Mai 10, 2011

33 / 50

Challenges

IVEWARE Very popular software used in many applications in Ocial Statistics. Similar to the previous mentioned methods. The imputations are obtained by tting a sequence of (Bayesian) regression models and drawing values from the corresponding predictive distributions.

Sequentially imputation: in each step, one variable serve as response and certain other variables serves as predictors. Fit a certain model using the observed part of the response and estimate (update) the (former) missing values in the response. Initialization loop: . . . Second outer loop: Estimates of missing values are updated sequentially using one variable as response and all other variables as predictors until convergency.

Since missing values are drawn from their predictive distribution given the observed data and the specied parameter, the procedure allows

multiple imputation. Templ, et al. (STAT, TU)

Robust Imputation

Ljubljana, Mai 10, 2011

33 / 50

Challenges

IVEWARE Very popular software used in many applications in Ocial Statistics. Similar to the previous mentioned methods. The imputations are obtained by tting a sequence of (Bayesian) regression models and drawing values from the corresponding predictive distributions.

Sequentially imputation: in each step, one variable serve as response and certain other variables serves as predictors. Fit a certain model using the observed part of the response and estimate (update) the (former) missing values in the response. Initialization loop: . . . Second outer loop: Estimates of missing values are updated sequentially using one variable as response and all other variables as predictors until convergency.

Since missing values are drawn from their predictive distribution given the observed data and the specied parameter, the procedure allows

multiple imputation. Templ, et al. (STAT, TU)

Robust Imputation

Ljubljana, Mai 10, 2011

33 / 50

IRMI

IRMI

Only the second outer loop is used (missing values are initialised in an other manner) In contradiction to IVEWARE we use quite dierent regression

methods



Robust methods (Note: a lot of problems has to be

solved when using robust methods for complex data like EU-SILC). Alternatively, stepwise model selction tools are integrated using AIC or BIC.

Multiple imputation is provided.

Templ, et al. (STAT, TU)

Robust Imputation

Ljubljana, Mai 10, 2011

34 / 50

IRMI

IRMI

Only the second outer loop is used (missing values are initialised in an other manner) In contradiction to IVEWARE we use quite dierent regression

methods



Robust methods (Note: a lot of problems has to be

solved when using robust methods for complex data like EU-SILC). Alternatively, stepwise model selction tools are integrated using AIC or BIC.

Multiple imputation is provided.

Templ, et al. (STAT, TU)

Robust Imputation

Ljubljana, Mai 10, 2011

34 / 50

IRMI

IRMI

Only the second outer loop is used (missing values are initialised in an other manner) In contradiction to IVEWARE we use quite dierent regression

methods



Robust methods (Note: a lot of problems has to be

solved when using robust methods for complex data like EU-SILC). Alternatively, stepwise model selction tools are integrated using AIC or BIC.

Multiple imputation is provided.

Templ, et al. (STAT, TU)

Robust Imputation

Ljubljana, Mai 10, 2011

34 / 50

IRMI

Selection of Regression Models

If the response is continuous , robust (IRMI) or ols (IMI, IVEWARE) regression methods are used. categorical , generalized linear regression is applied (IRMI: robust or non-robust). binary , logistic linear regression is applied (IRMI: robust but non-robust is prefered). mixed , a two-stage approach is used whereas in the rst stage logistic regression is applied in order to decide if a missing value is imputed with zero or by applying robust regression based on the continuous part of the response. count , robust generalized linear models (family: Poisson) is used.

Templ, et al. (STAT, TU)

Robust Imputation

Ljubljana, Mai 10, 2011

35 / 50

IRMI

Selection of Regression Models

If the response is continuous , robust (IRMI) or ols (IMI, IVEWARE) regression methods are used. categorical , generalized linear regression is applied (IRMI: robust or non-robust). binary , logistic linear regression is applied (IRMI: robust but non-robust is prefered). mixed , a two-stage approach is used whereas in the rst stage logistic regression is applied in order to decide if a missing value is imputed with zero or by applying robust regression based on the continuous part of the response. count , robust generalized linear models (family: Poisson) is used.

Templ, et al. (STAT, TU)

Robust Imputation

Ljubljana, Mai 10, 2011

35 / 50

IRMI

Selection of Regression Models

If the response is continuous , robust (IRMI) or ols (IMI, IVEWARE) regression methods are used. categorical , generalized linear regression is applied (IRMI: robust or non-robust). binary , logistic linear regression is applied (IRMI: robust but non-robust is prefered). mixed , a two-stage approach is used whereas in the rst stage logistic regression is applied in order to decide if a missing value is imputed with zero or by applying robust regression based on the continuous part of the response. count , robust generalized linear models (family: Poisson) is used.

Templ, et al. (STAT, TU)

Robust Imputation

Ljubljana, Mai 10, 2011

35 / 50

IRMI

Selection of Regression Models

If the response is continuous , robust (IRMI) or ols (IMI, IVEWARE) regression methods are used. categorical , generalized linear regression is applied (IRMI: robust or non-robust). binary , logistic linear regression is applied (IRMI: robust but non-robust is prefered). mixed , a two-stage approach is used whereas in the rst stage logistic regression is applied in order to decide if a missing value is imputed with zero or by applying robust regression based on the continuous part of the response. count , robust generalized linear models (family: Poisson) is used.

Templ, et al. (STAT, TU)

Robust Imputation

Ljubljana, Mai 10, 2011

35 / 50

IRMI

Selection of Regression Models

If the response is continuous , robust (IRMI) or ols (IMI, IVEWARE) regression methods are used. categorical , generalized linear regression is applied (IRMI: robust or non-robust). binary , logistic linear regression is applied (IRMI: robust but non-robust is prefered). mixed , a two-stage approach is used whereas in the rst stage logistic regression is applied in order to decide if a missing value is imputed with zero or by applying robust regression based on the continuous part of the response. count , robust generalized linear models (family: Poisson) is used.

Templ, et al. (STAT, TU)

Robust Imputation

Ljubljana, Mai 10, 2011

35 / 50

IRMI

Selection of Regression Models

If the response is continuous , robust (IRMI) or ols (IMI, IVEWARE) regression methods are used. categorical , generalized linear regression is applied (IRMI: robust or non-robust). binary , logistic linear regression is applied (IRMI: robust but non-robust is prefered). mixed , a two-stage approach is used whereas in the rst stage logistic regression is applied in order to decide if a missing value is imputed with zero or by applying robust regression based on the continuous part of the response. count , robust generalized linear models (family: Poisson) is used.

Templ, et al. (STAT, TU)

Robust Imputation

Ljubljana, Mai 10, 2011

35 / 50

Simulation results

Measures of information loss

Errors from Categorical and Binary Variables

This error measure is dened as the proportion of imputed values taken from an incorrect category on all missing categorical or binary values:

errc

with

I

=

1 mc

pc n X X

I(xijorig 6= xijimp ) ,

(1)

j =1 i =1

the indicator function, mc the number of missing values in the pc

categorical variables, and n the number of observations.

Templ, et al. (STAT, TU)

Robust Imputation

Ljubljana, Mai 10, 2011

36 / 50

Simulation results

Measures of information loss

Errors from Continuous and Semi-continuous Variables

Here we assume that the constant part of the semi-continuous variable is zero. Then, the joint error measure is

errs

=

1 ms

" orig imp ps n (x X X − x ) ij ij orig · I(xij 6= 0 ∧ orig x j =1 i =1

I((xijorig = 0 ∧

ij

imp xij

6= 0) ∨ (xijorig 6= 0 ∧

imp xij imp xij

6= 0) +

i = 0))

(6 )

with ms the number of missing values in the ps continuous and semi-continuous variables.

Templ, et al. (STAT, TU)

Robust Imputation

Ljubljana, Mai 10, 2011

37 / 50

Simulation results

Specic simulation results

0.30

0.5

Simulation Results: Varying the Correlation



0.4

0.25





● ●

0.20





0.3



0.1

IRMI IMI IVEWARE

0.15 ●

0.00

0.0



0.05

0.1

0.10

0.2

errs

errc





0.3

0.5

0.7

0.9

Correlation

Templ, et al. (STAT, TU)

0.1

IRMI IMI IVEWARE 0.3

0.5

0.7

0.9

Correlation

Robust Imputation

Ljubljana, Mai 10, 2011

38 / 50

Simulation results

Specic simulation results





0.20



● ●

0.30



20

28

36

0.15

0.25

errs

0.20 0.15

0.05

0.10

errc





4

IRMI IMI IVEWARE 12



0.00



0.00

0.05

● ●

0.10

0.35

Simulation Results: Varying the Amount of Variables

20

28

36

Number of Variables

Templ, et al. (STAT, TU)

4

IRMI IMI IVEWARE 12

Number of Variables

Robust Imputation

Ljubljana, Mai 10, 2011

39 / 50

Simulation results

Specic simulation results

Including (moderate) Outliers and Varying their Amount,





0.35

0.35

high Correlation



0.30 0.25



0.25

0.30

● ●

0.20

errs



0.15

0.20 0.15

errc



0

0.05

0.10

IRMI IMI IVEWARE 10



0.00



0.00

0.05





0.10





20

30

40

50

Outliers [%]

Templ, et al. (STAT, TU)

0

IRMI IMI IVEWARE 10

20

30

40

50

Outliers [%]

Robust Imputation

Ljubljana, Mai 10, 2011

40 / 50

Simulation results

Specic simulation results

Including (moderate) Outliers and Varying their Amount,









0.30

0.4

● ●

0.35

low Correlation

0.20

errs







20

30

0.15



0.10

0.2 0

IRMI IMI IVEWARE 10



0.00

0.0



0.05

0.1

errc

0.3

0.25





20

30

40

50

Outliers [%]

Templ, et al. (STAT, TU)

0

IRMI IMI IVEWARE 10

40

50

Outliers [%]

Robust Imputation

Ljubljana, Mai 10, 2011

41 / 50

Results from real-world data

Imputation in EU-SILC

We considered certain HH-components, but also some nominal variables, such as household size, region and htype3.

1

R

2

Set missing values in HH-components randomly (MCAR). R

3

Impute the missing values.

4

Evaluate the imputations using certain information loss measures.

5

Go to (2) until R

=0

Templ, et al. (STAT, TU)

++

= 100.

Robust Imputation

Ljubljana, Mai 10, 2011

42 / 50

Results from real-world data

Imputation in EU-SILC, Results

80

100



60 40

errs



● ● ● ●

IMI

IVEWARE

20

● ●

● ● ●

IRMI

Templ, et al. (STAT, TU)

Robust Imputation

Ljubljana, Mai 10, 2011

43 / 50

Results from real-world data

CENSUS Data - no outliers



0.8

● ●

0.6

● ● ● ● ●

● ● ● ● ● ● ● ● ● ●

● ●

● ● ● ●

● ● ● ● ●



0.4

errs

● ● ● ● ●

IRMI

IMI

IVEWARE





IRMI

IMI

0.0

0.2

0.2

0.4

errs

0.6

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

IVEWARE

(i) Error for categorical variables

Templ, et al. (STAT, TU)

(j) Error for numerical variables

Robust Imputation

Ljubljana, Mai 10, 2011

44 / 50

Results from real-world data

Airquality Data

● ● ●

● ● ●

IRMI

IMI

● ● ● ● ● ● ● ●

0.5

errs

● ● ● ● ●

1.0

1.5

● ● ● ● ● ● ● ●

● ● ●

Templ, et al. (STAT, TU)

Robust Imputation

IVE

Ljubljana, Mai 10, 2011

45 / 50

Templ, et al. (STAT, TU) Umsatz B31 B41 B23 a1 a2 a6 a25 e2 Besch

Umsatz B31 B41 B23 a1 a2 a6 a25 e2 Besch

0.00

0.01

0.03

Combinations

0.02

Proportion of missings 0.04

0.05

Small walkthrough to VIM

Example Data: SBS data

Robust Imputation Ljubljana, Mai 10, 2011 46 / 50

Small walkthrough to VIM

Most important functionality for imputation

Listing 1: Hotdeck imputation.

hotdeck (x , ord _ var = c ( " Besch " ," Umsatz " ) , imp _ var = FALSE )

Listing 2: k -nearest neighbor imputation.

kNN ( x )

Listing 3: Application of robust iterative model-based imputation.

imp

Suggest Documents