IMPUTATION OF COMPLEX DATA WITH R-PACKAGE VIM - unece

IMPUTATION OF COMPLEX DATA WITH R-PACKAGE VIM: TRADITIONAL AND NEW METHODS BASED ON ROBUST ESTIMATION. Key Invited Paper

Matthias Templ

2

1

, , Alexander Kowarik1 , Peter Filzmoser2

1 2

Department of Methodology, Statistics Austria

Department of Statistics and Probability Theory, TU WIEN, Austria

UNECE Worksession on data editing Ljubljana, Mai 10, 2011

Templ, et al. (STAT, TU)

Robust Imputation

Ljubljana, Mai 10, 2011

1 / 50

Content 1

Overview

2

Visualisation Tools

3

Robust Imputation: Motivation

4

Challenges

5

IRMI

6

Simulation results

7

Results from real-world data

8


Robust Imputation


2 / 50

Overview

R-package VIM

VIM

=

Visualization and Imputation of Missings

Univariate, bivariate, multiple and multivariate plot methods to highlight missing values in complex data sets to learn about their structure (MCAR, MAR, MNAR). Comes with a GUI as well. Hot-deck, k -NN and EM-based (robust) imputation methods for complex data sets. Due to time reasons we mostly concentrate on EM-based imputation. For hot-deck and k -NN, please have a look at the paper. VIM-book in 2012


Robust Imputation


3 / 50

Visualisation Tools

The GUI . . .


Robust Imputation


4 / 50

Visualisation Tools

The GUI . . .


Robust Imputation


5 / 50

Visualisation Tools

The GUI . . .


Robust Imputation


6 / 50

Visualisation Tools

The GUI . . .


Robust Imputation


7 / 50

Visualisation Tools

The GUI . . .


Robust Imputation


8 / 50

Visualisation Tools

The GUI . . .


Robust Imputation


9 / 50

Visualisation Tools

The GUI . . .


Robust Imputation


10 / 50

Visualisation Tools

The GUI . . .


Robust Imputation


11 / 50

Templ, et al. (STAT, TU) Robust Imputation py140n

py130n

py120n

py110n

py100n

py090n

py080n

py070n

py050n

py035n

py010n

py140n

py130n

py120n

py110n

py100n

py090n

py080n

py070n

py050n

py035n

py010n

0.00

0.02

0.06

Combinations

0.04

Proportion of missings 0.08

0.10

Visualisation Tools

The GUI . . .

1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 4 4 5 6 8 9 11 11 13 23 32 48 53 119 425 3827

Ljubljana, Mai 10, 2011 12 / 50

Visualisation Tools

0.6 0.4 0

5

10

15

P033000


Robust Imputation

20

25

35

missing

0.0

0.2

missing/observed in py010n

0.8

1.0

The GUI . . .


13 / 50

Visualisation Tools

The GUI . . .


Robust Imputation


14 / 50

Visualisation Tools

The GUI - Imputation


Robust Imputation


15 / 50

Visualisation Tools


kNN ( data , variable = colnames ( data ) , metric = NULL , k =5 , dist _ var = colnames ( data ) , weights = NULL , numFun = median , catFun = maxCat , makeNA = NULL , NAcond = NULL , impNA = TRUE , donorcond = NULL , mixed = vector () , trace = FALSE , imp _ var = TRUE , imp _ suffix = " imp " , addRandom = FALSE ) Templ, et al. (STAT, TU)

Robust Imputation


16 / 50

Visualisation Tools


hotdeck ( data , variable = colnames ( data ) , ord _ var = NULL , domain _ var = NULL , makeNA = NULL , NAcond = NULL , impNA = TRUE , donorcond = NULL , imp _ var = TRUE , imp _ suffix = " imp " ) Templ, et al. (STAT, TU)

Robust Imputation


17 / 50

Visualisation Tools


irmi (x , eps = 0 .01 , maxit = 100 , mixed = NULL , step = FALSE , robust = FALSE , takeAll = TRUE , noise = TRUE , noise.factor = 1 , force = FALSE , robMethod = " lmrob " , force.mixed = TRUE , mi = 1 , trace = FALSE )


Robust Imputation


18 / 50


Median Imputation fully observed include missings in x include missings in y ● imputed values

●

3

● ● ●

●

2

● ● ● ● ● ●

● ● ●

● ● ●

y

● ● ●

●

●

●

1 −1

0

●●

● ●● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ●● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ●● ● ● ●

●

●

●

●

●

●

●

−2

● ●

● ●

● ● ●

−3

● ●

−2

0

2

4

6

8

10

x


Robust Imputation


19 / 50


k NN

Imputation ●

fully observed include missings in x include missings in y ● imputed values

3

● ● ●

●

2

● ● ● ● ● ●

● ● ●

● ● ●

y

● ● ●

●

●

●

1 −1

0

●●

● ●● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ●● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ●● ● ● ●

●

●

●

●

●

●

●

−2

● ●

● ●

● ● ●

−3

● ●

−2

0

2

4

6

8

10

x


Robust Imputation


20 / 50


IVEWARE ● ● ●

fully observed include missings in x include missings in y imputed values ●

●

●

● ●

● ●

● ● ●

● ●

● ● ● ●●

●

●● ● ● ● ●● ● ● ● ● ●

● ● ●

●

●

●

●

●

● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●

●

●

●

● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ●

●

●

● ●

● ●

● ● ●

● ●


Robust Imputation


21 / 50


IRMI ● ● ●

fully observed include missings in x include missings in y imputed values ●

●

●

● ●

● ●

● ● ●

● ●

● ● ● ●●

●

●● ● ● ● ●● ● ● ● ● ●

● ● ●

●

●

●

●

●

● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●

●

●

●

● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ●

●

●

● ●

● ●

● ● ●

● ●


Robust Imputation


22 / 50


Median Imputation ●

●

●

●

2

● original data imputed data values

● ● ● ●

● ●

−1

y

0

●●

●●

●

●

●

1

● ●

●

●

● ● ● ● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ●●● ● ● ● ●● ● ●●● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ●

●

●

●

●

−2

● ●

● ●

● ● ●

−3

● tolerance ellipse from original data ● imputed data tolerance ellipse from

−2

0

2

4

6

8

10

x


Robust Imputation


23 / 50


k NN

Imputation ●

●

●

●

2

● original data imputed data values

● ● ● ●

● ●

−1

y

0

●●

●

●

●

1

● ●

●

●

● ● ● ● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ●●● ● ● ● ●● ● ●●● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ●● ●

●

●

●

●

−2

● ●

● ●

● ● ●

−3

● tolerance ellipse from original data ● imputed data tolerance ellipse from

−2

0

2

4

6

8

10

x


Robust Imputation


24 / 50


IVEWARE original data imputed data

●

original data values imputed data values

●

● ●

● ● ●

● ●●● ● ● ●●●●●● ● ● ● ●● ●● ● ● ● ● ● ●● ●

● ●● ● ● ● ●● ● ●

● ● ●

●

●

●

● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ●● ●● ●● ●●


●

● ● ●●

● ●

●

● ●●

●●

●●

Robust Imputation

●

●

●


25 / 50


IRMI original data imputed data


●

●

● ●

● ● ●

● ●●● ● ● ●●●●●● ● ● ● ●● ●● ● ● ● ● ● ● ● ●

● ●● ● ● ● ● ● ● ●

● ● ●

●


● ● ●

●

●

● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ●● ●● ●● ●●

●

●

● ● ●●

● ●●

●●

●●

Robust Imputation

●

●


26 / 50


25

Median Imputation

●

10

y

15

20

tolerance ellipse from original data tolerance ellipse from imputed data

● ● ●

● ● ●● ● ● ● ●● ● ● ● ●●● ● ● ●●●●●● ● ● ● ●● ●● ●● ● ● ● ●●●

●

●

● ●

●

●

● ●

● ●

●

● ● ●●● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ●●● ● ● ●●

● ● ●●

● ●●

●●

●●

●

●

−5

0

5

●

−2

0

2

4

6

8

x


Robust Imputation


27 / 50


Imputation tolerance ellipse from original data tolerance ellipse from imputed data

15

20

25

k NN

10

y

●

● ● ●

● ●●● ● ●●●● ●●● ● ● ● ●● ●● ●● ● ● ● ●●●

● ●● ●● ● ●● ●● ●

●

●

●

●

●

●

● ●

● ● ●

0

5

●

● ● ●●● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ●●● ● ● ●●

● ● ●●

● ●●

●●

●●

●

●

−5

●

−2

0

2

4

6

8

x


Robust Imputation


28 / 50


IVEWARE original data imputed data


●

●

● ● ●

● ●●● ● ● ●●●●●● ● ● ● ●● ●● ● ● ● ● ● ●● ●

● ●● ● ● ● ● ● ● ● ●

●

●

● ●

●

● ●

●

● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ●● ●● ●● ●●

● ● ●●

● ●●

●●

● ●

●●

●

●


●

●

Robust Imputation

●

●


29 / 50


IRMI original data imputed data


●

●

●

●

● ● ●

● ●●● ● ● ●●●●●● ● ● ● ●● ●● ● ● ● ● ● ● ● ●

● ●● ● ● ● ● ● ● ●

● ● ●

●

●

●

● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ●● ●● ●● ●●


●

● ● ●●

● ●

●

● ●●

●●

●●

Robust Imputation

●

●

●


30 / 50

Challenges

Some Challenges

Mixed type of variables: various variables being nominal scaled, some variables might be ordinal and some variables could be determined to be of continuous scale. Semi-continuous variables: semi-continuous distributions, i.e. a variable consisting of a continuous scaled part and a certain proportion of equal values. Far from normality: Virtually always outlying observations included in real-world data. multiple imputation: Imputated must be both, reect the multivariate structure of the data and including randomness.


Robust Imputation


31 / 50

Challenges

Some Challenges



Robust Imputation


31 / 50

Challenges

Some Challenges



Robust Imputation


31 / 50

Challenges

MIX, MICE, MI, MITOOLS, . . .

All missing values imputed with simulated values drawn from their predictive distribution given the observed data and the specied parameter.

−→

based on sequential regressions.

EM-based In general, there are often problems when applied to complex data sets. . . . and, of course, they are highly driven by inuencial points, representative and non-representative outliers.


Robust Imputation


32 / 50

Challenges



−→




Robust Imputation


32 / 50

Challenges



−→




Robust Imputation


32 / 50

Challenges



−→




Robust Imputation


32 / 50

Challenges



−→




Robust Imputation


32 / 50

Challenges

IVEWARE Very popular software used in many applications in Ocial Statistics. Similar to the previous mentioned methods. The imputations are obtained by tting a sequence of (Bayesian) regression models and drawing values from the corresponding predictive distributions.

Sequentially imputation: in each step, one variable serve as response and certain other variables serves as predictors. Fit a certain model using the observed part of the response and estimate (update) the (former) missing values in the response. Initialization loop: . . . Second outer loop: Estimates of missing values are updated sequentially using one variable as response and all other variables as predictors until convergency.

Since missing values are drawn from their predictive distribution given the observed data and the specied parameter, the procedure allows

multiple imputation. Templ, et al. (STAT, TU)

Robust Imputation


33 / 50

Challenges





Robust Imputation


33 / 50

Challenges





Robust Imputation


33 / 50

Challenges





Robust Imputation


33 / 50

Challenges





Robust Imputation


33 / 50

IRMI

IRMI

Only the second outer loop is used (missing values are initialised in an other manner) In contradiction to IVEWARE we use quite dierent regression

methods

→

Robust methods (Note: a lot of problems has to be

solved when using robust methods for complex data like EU-SILC). Alternatively, stepwise model selction tools are integrated using AIC or BIC.

Multiple imputation is provided.


Robust Imputation


34 / 50

IRMI

IRMI


methods

→





Robust Imputation


34 / 50

IRMI

IRMI


methods

→





Robust Imputation


34 / 50

IRMI

Selection of Regression Models

If the response is continuous , robust (IRMI) or ols (IMI, IVEWARE) regression methods are used. categorical , generalized linear regression is applied (IRMI: robust or non-robust). binary , logistic linear regression is applied (IRMI: robust but non-robust is prefered). mixed , a two-stage approach is used whereas in the rst stage logistic regression is applied in order to decide if a missing value is imputed with zero or by applying robust regression based on the continuous part of the response. count , robust generalized linear models (family: Poisson) is used.


Robust Imputation


35 / 50

IRMI




Robust Imputation


35 / 50

IRMI




Robust Imputation


35 / 50

IRMI




Robust Imputation


35 / 50

IRMI




Robust Imputation


35 / 50

IRMI




Robust Imputation


35 / 50

Simulation results

Measures of information loss

Errors from Categorical and Binary Variables

This error measure is dened as the proportion of imputed values taken from an incorrect category on all missing categorical or binary values:

errc

with

I

=

1 mc

pc n X X

I(xijorig 6= xijimp ) ,

(1)

j =1 i =1

the indicator function, mc the number of missing values in the pc

categorical variables, and n the number of observations.


Robust Imputation


36 / 50

Simulation results

Measures of information loss

Errors from Continuous and Semi-continuous Variables

Here we assume that the constant part of the semi-continuous variable is zero. Then, the joint error measure is

errs

=

1 ms

" orig imp ps n (x X X − x ) ij ij orig · I(xij 6= 0 ∧ orig x j =1 i =1

I((xijorig = 0 ∧

ij

imp xij

6= 0) ∨ (xijorig 6= 0 ∧

imp xij imp xij

6= 0) +

i = 0))

(6 )

with ms the number of missing values in the ps continuous and semi-continuous variables.


Robust Imputation


37 / 50

Simulation results

Specic simulation results

0.30

0.5

Simulation Results: Varying the Correlation

●

0.4

0.25

●

●

● ●

0.20

●

●

0.3

●

0.1

IRMI IMI IVEWARE

0.15 ●

0.00

0.0

●

0.05

0.1

0.10

0.2

errs

errc

●

●

0.3

0.5

0.7

0.9

Correlation


0.1

IRMI IMI IVEWARE 0.3

0.5

0.7

0.9

Correlation

Robust Imputation


38 / 50

Simulation results


●

●

0.20

●

● ●

0.30

●

20

28

36

0.15

0.25

errs

0.20 0.15

0.05

0.10

errc

●

●

4

IRMI IMI IVEWARE 12

●

0.00

●

0.00

0.05

● ●

0.10

0.35

Simulation Results: Varying the Amount of Variables

20

28

36

Number of Variables


4

IRMI IMI IVEWARE 12

Number of Variables

Robust Imputation


39 / 50

Simulation results


Including (moderate) Outliers and Varying their Amount,

●

●

0.35

0.35

high Correlation

●

0.30 0.25

●

0.25

0.30

● ●

0.20

errs

●

0.15

0.20 0.15

errc

●

0

0.05

0.10

IRMI IMI IVEWARE 10

●

0.00

●

0.00

0.05

●

●

0.10

●

●

20

30

40

50

Outliers [%]


0

IRMI IMI IVEWARE 10

20

30

40

50

Outliers [%]

Robust Imputation


40 / 50

Simulation results


Including (moderate) Outliers and Varying their Amount,

●

●

●

●

0.30

0.4

● ●

0.35

low Correlation

0.20

errs

●

●

●

20

30

0.15

●

0.10

0.2 0

IRMI IMI IVEWARE 10

●

0.00

0.0

●

0.05

0.1

errc

0.3

0.25

●

●

20

30

40

50

Outliers [%]


0

IRMI IMI IVEWARE 10

40

50

Outliers [%]

Robust Imputation


41 / 50


Imputation in EU-SILC

We considered certain HH-components, but also some nominal variables, such as household size, region and htype3.

1

R

2

Set missing values in HH-components randomly (MCAR). R

3

Impute the missing values.

4

Evaluate the imputations using certain information loss measures.

5

Go to (2) until R

=0


++

= 100.

Robust Imputation


42 / 50


Imputation in EU-SILC, Results

80

100

●

60 40

errs

●

● ● ● ●

IMI

IVEWARE

20

● ●

● ● ●

IRMI


Robust Imputation


43 / 50


CENSUS Data - no outliers

●

0.8

● ●

0.6

● ● ● ● ●

● ● ● ● ● ● ● ● ● ●

● ●

● ● ● ●

● ● ● ● ●

●

0.4

errs

● ● ● ● ●

IRMI

IMI

IVEWARE

●

●

IRMI

IMI

0.0

0.2

0.2

0.4

errs

0.6

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

IVEWARE

(i) Error for categorical variables


(j) Error for numerical variables

Robust Imputation


44 / 50


Airquality Data

● ● ●

● ● ●

IRMI

IMI

● ● ● ● ● ● ● ●

0.5

errs

● ● ● ● ●

1.0

1.5

● ● ● ● ● ● ● ●

● ● ●


Robust Imputation

IVE


45 / 50

Templ, et al. (STAT, TU) Umsatz B31 B41 B23 a1 a2 a6 a25 e2 Besch

Umsatz B31 B41 B23 a1 a2 a6 a25 e2 Besch

0.00

0.01

0.03

Combinations

0.02

Proportion of missings 0.04

0.05

Small walkthrough to VIM

Example Data: SBS data

Robust Imputation Ljubljana, Mai 10, 2011 46 / 50

Small walkthrough to VIM

Most important functionality for imputation

Listing 1: Hotdeck imputation.

hotdeck (x , ord _ var = c ( " Besch " ," Umsatz " ) , imp _ var = FALSE )

Listing 2: k -nearest neighbor imputation.

kNN ( x )

Listing 3: Application of robust iterative model-based imputation.

imp

IMPUTATION OF COMPLEX DATA WITH R-PACKAGE VIM - unece

IMPUTATION OF COMPLEX DATA WITH R-PACKAGE VIM - unece

Suggest Documents

Imputation with the R Package VIM - Journal of Statistical Software

Imputation with the R Package VIM - Journal of Statistical Software

Scripting Vim With Ruby

efficient random imputation for missing data in complex surveys

Automated Editing and Imputation System for Administrative ... - unece

Realizing the Potential of FFS1 with Contextual Data - unece

Large-Scale Imputation for Complex Surveys - CiteSeerX

J2.6 Imputation of missing data with nonlinear relationships

Realizing the Potential of FFS1 with Contextual Data - unece

Open Data Portal Austria - unece

Data Collection WP template - unece

Scripting Vim With Ruby [PDF]

spss multiple imputation imputation algorithm - Applied Missing Data ...

Multiple Imputation of Missing Data: A Simulation

IMPUTATION OF MISSING DATA USING BAYESIAN ...

Vim Reference Card - Vim Intellisense - SourceForge

Missing-data imputation - Department of Statistics - Columbia ...

Outlier Robust Imputation of Survey Data - CiteSeerX

Evaluating Performance of Missing Data Imputation

missing data imputation of climate datasets - SciELO

Imputation And Classification Of Missing Data

Package 'VIM'

A Data Imputation Method with Support Vector ... - Semantic Scholar

Genomic data imputation with variational auto-encoderswww.researchgate.net › publication › fulltext › Genomic-