Boosted Stump Algorithm for Missing Data ...

9 downloads 0 Views 1MB Size Report
Novel Algorithms for Missing Data Imputation. • Benchmarking ... N. 4. The framework: Tree-based model. N. 8. N. 6. N. 9. N. 5. N. 12. N. 13. N. 7 s. 1. ≡ 0 s. 1.
Boosted Stump Algorithm for Missing Data Incremental Imputation Roberta Siciliano, Massimo Aria, Antonio D’Ambrosio

Università di Napoli Federico II CLADAG 2005

Outline • Tree Harvest Software (CLADAG ’03) – Recall the last slide of conclusions and perspectives

• Novel Algorithms for Missing Data Imputation • Benchmarking study: simulations and an application

The framework: Tree-based model N8

N12

N9 N4

N13 N6

N5 N2

N7

N3

if x  A then

if x  A then

s1  0

N1

s1  1

“Tree Harvest Software„ • About the name “Tree Harvest„ – It stands for getting fruits from tree-based models in terms of “any useful, available and discovered information for decision-making„

• Non-standard methods in “Tree Harvest„: Two-stage splitting and Fast algorithms Multi-class budget trees Two-stage discriminant trees Incremental Imputation for Missing Data

TH Graphical User Interface

Conclusions of CLADAG ‘03  Novel routines for TWO-STAGE  Matlab environment with GUI

 Enhancements in the exploratory trees

In progress  Stand-alone software  never end research!  Decision Trees  pruning, decision stump etc.  Ensemble Methods  bagging, boosting, etc.  Incremental Imputation  novel improved algorithms

Missing data mechanisms • When we say that data are missing completely at random (MCAR), we mean that the probability that an observation is missing is unrelated to the value of the variable or to the value of any other variables • Data can be considered as missing at random (MAR) if the data meet the requirement that missingness does not depend on the value of the variable after controlling for another variable

Model Based Imputation missing value = model function + error term Examples: • Linear Regression (e.g. Little, 1992) • Logistic Regression (e.g. Vach, 1994) • Generalized Linear Models (e.g. Ibrahim et. al, 1999)

• Nonparametric Regression (e.g. Chu & Cheng, 1995) • Trees (Conversano et al., 2002)

Motivations • • • • •

Nonparametric approach Deals with numerical and categorical inputs Computational feasibility Considers conditional interactions among inputs Derives simple imputation rules

Incremental Approach (INPI): key idea • Data Pre-Processing rearrange columns and rows of the original data matrix

• Missing Data Ranking define a lexicographical ordering of the data, that matches the order by value, corresponding to the numbers of missing values occurring in each record

• Incremental Imputation impute iteratively missing data using tree based models

The original data matrix A B C D E F G H I

J K L M N O P Q R S T U V W X Y Z

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0 3 0 0 0 2 1 3 0 1 3 2 0 0 2 0 1 0 1 0 3 1 0 0 1 1 1 0 2 0 0 1 1 0 1 1 0 0 1 0 1 Number of missing values in each column

Number of missing values in each row

Data re-arrangement by number of missing in each column A C E H I

M O P S V W Y B D G J K L Q R T U X Z N F

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

by number of missing in each row A

0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 2 2

1 3 4 5 9 13 14 7 10 6 12 15 2 8 11

C E H I

M O P S V W Y B D G J K L Q R T U X Z N F

0 0 0 0 0 0 0 1 1 2 2 2 3 3 3

Missing Data Ranking Lexicographical ordering A C E H I

1 3 4 5 9 13 14 7 10 6 12 15 2 8 11

M O P S V W Y B D G J K L Q R T U X Z N F

0_mis 0_mis 0_mis 0_mis 0_mis 0_mis 0_mis 1_f 1_l 2_j_x 2_u_f 2_d_j 3_t_n_f 3_b_l_n 3_d_r_z

The working matrices A C E H I

M O P S V W Y B D G J K L Q R T UXZ NF

1 3 4 5 9 13 14 7 10 6 12 15 2 8 11

First imputation

A

C

B

D

0_mis 0_mis 0_mis 0_mis 0_mis 0_mis 0_mis 1_f 1_l 2_j_x 2_u_f 2_d_j 3_t_n_f 3_b_l_n 3_d_r_z

D includes 8 missing data types

First Iteration A C E H I

1 3 4 5 9 13 14 7 10 6 12 15 2 8 11

M O P S V W Y B D G J K L Q R T UXZ NF

A

C

B D D includes 7 missing data types

0_mis 0_mis 0_mis 0_mis 0_mis 0_mis 0_mis 0_mis 1_l 2_j_x 2_u_f 2_d_j 3_t_n_f 3_b_l_n 3_d_r_z

Why Incremental? The data matrix Xn,p is partitioned in:

Xn , p

 A m, d  B n  m, d

Cm, p d   Dn m, p d 

where: A, B, C: matrix of observed data and imputed data D: matrix containing missing data

The Imputation is incremental because, as it goes on, more and more information is added to the data matrix. In fact: • A, B and C are updated in each iteration • D shrinks after each set of records with missing inputs has been filled-in

Enhancements to incremental imputation • Computationally more efficient algorithm: INPI_var • More robust imputation by using ensemble methods: – BINPI (Boosting Incremental Non Parametric Imputation) – BINPI_stump (BINPI with the use of STUMP)

Computational enhancement of INPI algorithm • INPI algorithm uses a lexicographic ordering of data (with respect to both rows and columns) • INPI_var algorithm uses a lexicographic ordering of only variables (with respect to only columns) and impute all values of each variable at turn; thus, it provides a suitable alternative to accelerate the imputation process

Recall ensemble methods

“Weak Learners”

Set 1

“Ensemble Learner”

Set 2

Training Set

Set T-1

Set T

Re-sampling methods

Ensemble process

General Idea of Boosting • Use a weak classifier (error rate slightly better than random) • Apply sequentially weak classifiers to modified versions of data • Predictions of these classifiers are combined to produce a powerful classifier

AdaBoost for two-classes response (Freund & Schapire, 1997) • Initialize the observation weights wi=1/N • For m=1 to M  Fit a classifier Gm(x) using weights wi  Compute errtraining

 Compute m  Compute

• Output

 log(

1 e r rm ) errm

wi  wi . exp[  m . I ( yi  G ( xi ) ]. M

G ( x )  s i g n (   m Gm ( x) ). m1

Boosted Incremental Non Parametric Imputation • Boosted algorithms to impute missing data: – Standard AdaBoost – AdaBoost with STUMP

• BINPI Algorithms: – Lexicographic ordering wrt columns (variables) and imputation by either AdaBoost or AdaBoost with STUMP

Suitable properties •

Two results can be proved: 1. Boosted Incremental Imputation is always preferred to Incremental Imputation in terms of accuracy 2. Lexicographic ordering of the columns is sufficient to get at least the same results of ordering both rows and columns  Simultaneous imputation of missing data of each variable at turn reduces considerably the computational cost of the imputation algorithm

BINPI Algorithm • Set r = 1 (r )

• Find yk* as the variable (column) with the smallest number of missing data, where k* :# misk *  misk and # misk  0 ; • Sort columns such that the first p variables are complete and the p+1-th is yk( r*) ; • Sort rows such that the first l rows are complete and the remaining N-l are missing in the p+1-th column; • Use Stump as weak learner for v-fold AdaBoost iterations to impute the N-l missing data in variable yk( r*)on the basis of the ' (r ) (r) learning sample for n = 1,…,l.   ynk , x  x ,..., x * n n1 np





• Set r = r+1. Go to step 1 until all missing data are imputed.

T(r)

T(r+1)

Tk(*r )

STEP 1

Tk(*r )

C

STEP 1

Tk(*r 1)

STEP 3

C

STEP 2

Tk(*r )

STEP 4

C  H B stump (X ) C(M)

Imputation

X

M

y

Simulation Setting:

binary missing case • X1,…………, X10 uniform in [0,10] • Data are missing with conditional probability:

  1  exp   Xβ   being a constant and β

1

a vector of coefficients.

• Goal: estimate the expected value  and the number of incorrect imputation of each variable under imputation. • Compared Methods: • Unconditional Mean Imputation (UMI) • Incremental Non Parametric Imputation (INPI) • Boosted Incremental Non Parametric Imputation (BINPI) (with 500 boosting iterations)

Simulation Study: binary missing case linear relationships

Sim 1

Sim 2

 2  0.35  X1  X 2   Y1 ~Bin  n,  10  





1

m1  1  exp  3  0.5  X1  X 2  



 3  0.35  X 3  X 4   Y3 ~Bin  n,  10  



 4  0.35  X 2  X 3   Y2 ~Bin  n,  10  

Sim 3



m3  1  exp  3  0.5  X 3  X 4 



m2  1  exp  3  0.5  X 2  X 3  

1

1

 5  0.35  X 2  X 3   Y2 ~Bin  n,  10  

Simulation Study: binary missing case non linear relationships Sim 5

Sim 4



Y1 ~Bin n, sin  0.3 X1  0.9 X 2 



 

m1  1  exp  1.5  0.5  0.3 X1  0.9 X 2 



Y2 ~Bin n, sin  0.9 X 2  0.3 X 3 







m2  1  exp  1.5  5  0.3 X 2  0.9 X 3 

1

1



Y3 ~Bin n, sin  0.5 X 3  0.5 X 4 



 

m3  1  exp  1.5  0.5  0.5 X 3  0.5 X 4  

1

Simulation Study: binary missing case Main Results Sim 1 # errors Y1 Y2 1 2 UMI 0 81 0,2130 0,1620 INPI 2 0 0,2150 0,2430 BINPI Stump 0 0 0,2130 0,2430 TRUE 0,2130 0,2430 # missings 203 81

# errors Y1 UMI 164 INPI 2 BINPI Stump 0 TRUE # missings 164

Sim 3 Y2 Y3 1 78 191 0,6260 1 1 0,4640 0 0 0,4620 0,4620 78 191

2 0,4350 0,5120 0,5130 0,5130

linear relationships

# errors UMI INPI BINPI Stump TRUE # missings 3 0,6470 0,4570 0,4560 0,4560

Y1 0 1 0 169

Sim 2 Y2 Y3 1 2 3 80 804 0,2120 0,1980 0,0150 0 432 0,2130 0,2780 0,3730 0 84 0,2120 0,2780 0,7350 0,2120 0,2780 0,8190 80 808

Simulation Study: binary missing case Main Results # errors UMI INPI BINPI Stump TRUE # missings

Sim 4 Y1 Y2 1 2 25 17 0,1630 0,1710 45 95 0,1870 0,2630 24 94 0,1640 0,2620 0,1760 0,1880 151 138

non linear relationships

# errors UMI INPI BINPI Stump TRUE # missings

Y1 76 70 62 180

Sim 5 Y2 Y3 1 2 3 86 77 0,6070 0,6320 0,6190 86 84 0,4510 0,6320 0,5216 72 66 0,4670 0,6200 0,5600 0,5310 0,5460 0,5420 170 180

Simulation Setting:

numerical missing values • X1,…………, X10 uniform in [0,10] • Data are missing with conditional probability:

  1  exp   Xβ 

1

 being a constant and β a vector of coefficients. • Goal: estimate the mean and the standard deviation of the variable(s) under imputation. • Compared Methods: • Unconditional Mean Imputation (UMI) • Parametric Imputation via Multiple Regression (PI) • Incremental Non Parametric Imputation by variable (INPI_var) • Boosted Incremental Non Parametric Imputation (BINPI with & without Stump) (with 500 boosting iterations)

Simulation Study: numerical missing values Sim 1

 ~ N X



Y1 ~ N X 1  X , exp 0.3 X 1  0.1X 2  Y2

2 2

Sim 2

 ~ N X ~ N X



Y1 ~ N X 1  X 22 , exp 0.3 X 1  0.1X 2  Y2 Y3



2  X 3 4 , exp  1  0.50.3 X 3  0.1X 4 

3

 

 X , exp  1  0.50.3 X 3  0.1X 4  2 4

2  X 5 6 , exp  1  0.50.2 X 5  0.3 X 6

Simulation Study: numerical missing values Main Results

A real dataset: Boston Housing • 13 numerical variables and 506 instances • 1 variable with missing data (“Median value of owner-occupied homes”) • Number of missing values: 238 instances • Pre-processing: Standardization of variables

Boston Housing: Main Result

Final remarks • INPI provides a more accurate imputation than NPI, PI, UMI • INPI_var is computationally preferred to INPI

• BINPI is always preferred to INPI_var in terms of accuracy • BINPI_stump is preferred to BINPI in two-classes problems in terms of computation efficiency • BINPI is preferred to BINPI_stump for imputation of numerical missing values in terms of accuracy

Conclusions • Using trees for missing data imputation • Novel algorithms: – Boosted Incremental Non Parametric Imputation (BINPI) (with or without STUMP)

• Results of a simulation study and an application • The implementation: – Tree Harvest Software

References • Aria, M. (2005). Multi-Class Budget Exploratory Trees. In Studies in Classification, Data Analysis, and Knowledge Organization: New Developments in Classification and Data Analysis, a cura di M. Vichi, P. Monari, S. Mignani, A. Montanari, ed. Springer-Verlag, pp. 3-8. • Aria M., Siciliano, R. (2003). Learning from Trees: Two-Stage Enhancements, CLADAG 2003, Book of Short Papers (Bologna, Sept. 22-24, 2003), CLUEB, Bologna, 21-24. • Eibl G., Pfeiffer K. P. (2002). How to make AdaBoost.M1 work for weak base classifiers by changing only one line of the code. Machine Learning: ECML 2002, Lecture Notes in Artificial Intelligence. Springer. • Freund Y., Schapire R.E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1). • Friedman J.H., Popescu B.E. (2005). Predictive Learning via Rule Ensembles, Technical Report of Stanford University. • Hastie T., Tibshirani R., Friedman J. (2002). The Elements of Statistical Learning, Springer Verlag, NY. • Petrakos G., Conversano C., Farmakis G., Mola F., Siciliano R., Stavropoulos P. (2004). New ways to specify data edits, Journal of Royal Statistical Society, Series A, volume 167, part 2, 249-274. • Siciliano R., Conversano C. (2002), Tree-based Classifiers for Conditional Missing Data Incremental Imputation, Proceedings of the International Conference on Data Clean (Jyv¨askyl¨a, May 29-31, 2002), University of Jyvaskyla. • Siciliano R., Aria M., Conversano C., (2004), Tree Harvest: Methods, Software and Some Applications. In Antoch J. (ed.): Proceedings in Computational Statistics, COMPSTAT 2004, Physica-Verlag, 1807-1814.

Suggest Documents