novel ensemble techniques for regression with missing data - Caltech

4 downloads 29507 Views 233KB Size Report
It is well-known that the set S is a closed convex cone (called for short PSD cone) 19. The .... Step 1: Generate a bootstrap sample S1 from training set Z, and a validation set V1 by ..... 21. http://www.gps.caltech.edu/~tapio/imputation/index.html.
New Mathematics and Natural Computation  World Scientific Publishing Company

NOVEL ENSEMBLE TECHNIQUES FOR REGRESSION WITH MISSING DATA MOSTAFA M. HASSAN Computer Engineering, Cairo University, Giza, Egypt. [email protected] AMIR F. ATIYA Computer Engineering, Cairo University, Giza, Egypt. [email protected] NEAMAT EL GAYAR Faculty of Computer and Information Technology, Cairo University, Giza, Egypt. [email protected] RAAFAT EL-FOULY Computer Engineering, Cairo University, Giza, Egypt. [email protected] In this paper we consider the problem of missing data, and develop an ensemble-network model for handling the missing data. The proposed method is based on utilizing the inherent uncertainty of the missing records in generating diverse training sets for the ensemble's networks. Specifically we generate the missing values using their probability distribution function. We repeat this procedure many times thereby creating a number of complete data sets. A network is trained for each of these data sets, thereby obtaining an ensemble of networks. Several variants are proposed, and we show analytically that one of these variants is superior to the conventional mean-substitution approach for the limit of large training set. Simulation results confirm the general superiority of the proposed methods compared to the conventional approaches. Keywords: Missing values, missing value imputation, ensemble networks, regression.

1. Introduction When confronted with real-world data, one usually finds that the integrity of the data is far from ideal. For example, one typically encounters missing records, erroneous data, outliers, etc, that confound the successful application of data analysis methods. When applying machine learning methods such as neural networks, this could present a problem, as some of the developed algorithms assume all training data are available and "clean". It is imperative to develop ways to mitigate the effects of these imperfections as a parallel effort to algorithm development. In this paper we consider the problem of missing data. Missing data could be substantial in some datasets. One could not simply delete the training data points having missing values, as this will 1

2 Hassan M. Atiya A., El Gayar N. and El-Fouly R.

waste a lot of useful data. A better strategy is to make the most out of all the data, that is to find intelligent ways to use all available records while alleviating the effect of the missing data 1. The missing data problem has been studied in the traditional statistics literature (see Ref. 1, and Ref. 2). There are some well-known methods that handle missing data. The most trivial method is the casewise deletion method (CWD). It is based on simply ignoring the training patterns that have some missing values. Problems have been documented with this method, particularly for the case of linear regression problems, such as an observed bias in the parameters, large standard errors, etc 3. A better approach is the so-called mean substitution method (MS). In this approach we fill the missing values by calculating the mean value of the given input or feature component, and use this mean to replace all missing values for that particular input. This method is simple, efficient and gives generally good results. 3 Another well-known method for handling missing values is the EM (expectation maximization) algorithm. This algorithm consists of the two recursive expectation and maximization steps. In the first step (E) it computes an expectation of the likelihood by including the latent (or missing) variables as if they were observed. In the maximization (M) step it computes the values of the maximum likelihood estimates of the parameters by maximizing the expected likelihood found in the E step. The parameters here are the covariance matrix and mean vector. Because the method is based on solid probability theory footing, it has established itself as among the best and most widely used methods for filling the missing data. The EM approach is considered a type of single imputation method. A more refined approach is the multiple imputation approach (Rubin 4). In this approach the missing values are imputed conditional on the non-missing values. This is repeated k times (typically 3-5 times). Each time we perform the analysis on the resulting complete dataset. Then we obtain the average parameter estimates over these k trials, as well as their standard errors. For example, we average the linear regression coefficients and obtain their confidence intervals using these repeated trials 4. Some studies considered the missing value problem in the context of neural networks. Tresp et al 5 proposed a solution for the missing data problem by using a maximum likelihood framework which requires a weighted integration over the missing inputs. They applied it to feedforward neural networks, and certain networks with Gaussian basis functions yielded a closed from solution. The same authors 6 proposed another approach for handling missing data for neural networks by taking into account the distribution of the missing values. For the training phase they showed that the backpropagation step for an incomplete pattern can be approximated by a weighted averaged backpropagation step. For the recall phase they obtained closed form solution. Viharos et al 7 proposed a new method for handling missing data for neural network models. The idea of their work is to adapt the neural network structure according to the missing components and perform some kind of projection operation. Ramoni et al 8 introduced a new method called robust Bayesian estimator (RBE) for estimating the conditional probability of missing data. Instead of estimating the value of missing data like in EM Algorithm, they estimate an interval for the missing values using EM concepts and Gibbs sampling. Ghahramani and Jordan 9 considered also the same topic of estimating the distribution of datasets having missing values. They used a mixture model, where they obtained the parameters using an EM algorithm. Other related methods that use a combination of graphical models and the EM algorithm have been proposed in Ref. 10, Ref. 11 and Ref. 12. Some methods considered also the concept of ensembles of neural network. Twala and Cartwright 13 have introduced a new ensemble method that deals with missing data. The method is based on applying two imputation methods: the Bayesian multiple imputation and K-nearest neighbor single imputation. An ensemble of the two methods is created and the results are

Novel Ensemble Techniques For Regression With Missing Data

3

combined using voting. Jiang et al 14 have proposed an ensemble method for handling missing values. They delete any cases that have missing values and create several complete datasets obtained as subsets from the training set. An ensemble is created from the models handling the different datasets. Twala et al 15 have presented a comparison between seven different missing data methods. They have shown how a combination of missing data methods leads to significant improvement in prediction performance for missing values, up to 50%, and significantly better than using any single method. In this paper we present two new methods for handling missing data. The goal is not to estimate the missing record values per se, but to make as full as possible use of these records in the given regression or classification task. In the proposed method we marry the concept of multiple imputations with ensemble networks. So we make use of the concept of repeated generation of missing values, and at the same time utilize the power and success of the ensemble network concept. Ensemble networks have been proposed by various researchers and in various forms. Most notably, Breiman developed the bagging approach, which is based creating a training set for each of the ensemble networks by sampling with replacement from the original training set 16. Another ensemble approach, boosting, is based on placing an adaptively learned higher probability of sampling for harder samples. A group of methods is based on injecting some generated noise into the inputs, outputs or the weights of the constituent networks 17. Ensemble networks in many studies have been shown to consistently beat "single networks", i.e. the case of using one neural network as a model. The main reason for the consistent improvement is the fact that ensemble models such as bagging tend to reduce the prediction variance (due to averaging several networks' outputs). Of course for ensemble networks to be effective, they have to be diverse. The two proposed methods are based on utilizing the inherent uncertainty of the missing records in generating diverse training sets for the ensemble's networks. Specifically, the approaches are based on generating different training sets by keeping clean records as is while generating the missing records according to their distribution. The two methods are: the univariate approach (that uses the unconditional distribution) and the multivariate approach (that uses the distribution conditioned on the other data). Unlike ensemble methods based on injecting noise, which tend to increase the variance of each constituent model due to the extra added noise, the proposed methods do not add any extra variance or uncertainty. They utilize the uncertainty that is there in the data any way to create the extra copies (of the training sets). To help select between the two methods, we prove that under certain conditions the univariate approach yields better mean square error than the mean substitution method. The proposed methods apply to the "missing completely at random (MCAR)" situation. This means that the fact that a record is missing does not depend on the value of any of the data, it is purely by chance. The organization of this work is as follows. In Section 2 we describe the first proposed method (the univariate ensemble method), detailing the steps of the algorithm. In Section 3 we describe the second proposed method (the multivariate ensemble method). In Section 4 we present the simulations results, including a comparison with other competing methods. Finally, in Section 5 gives the conclusions of this work. 2. Univariate Ensemble Method (UVE)

4 Hassan M. Atiya A., El Gayar N. and El-Fouly R.

2.1. Main Idea The main idea of the proposed methods is to utilize the uncertainty in the missing records to create different versions of datasets that we use for training different networks in an ensemble. This is done by modeling the uncertainty of the missing record in terms of its probability distribution. Then we fill in the missing values in the training set by generating numbers using the missing values' probability distributions. We repeat this procedure many times, creating many versions of datasets. Each dataset is fed to a prediction model (or network). After training each network we have an ensemble of networks, where we combine the networks' outputs by averaging to produce the ensemble's final output. By this approach we provide a solution to the problem of handling the issue of missing data, while at the same time making use of the power and efficiency of ensemble networks. In the case of the univariate ensemble method, we consider each input variable separately, and estimate its distribution from the available training data. The advantages of this approach over say the mean substitution method is that the filled in data do not affect the mean and the variance of the data, thus achieving some kind of consistency. In contrast, while the mean substitution method does not affect the mean, it does reduce the variance of the data. The details of the algorithm are given as follows. (Assume the variables in this work are continuous, however, extension to categorical variables is straightforward.) 2.2. Steps of the Algorithm: For each input variable X that has some missing values: (1) Estimate the probability distribution for X from the given data, for example using Parzen window density estimator. (2) Generate values from this estimated distribution to fill in the missing values for input variable X. (3) Repeat the last two steps for all input variables. (4) Repeat the last loop k times. (5) Supply these k different versions of the filled datasets to an ensemble of k networks. Train these k networks. (6) Depending on the problem type (classification or regression) use voting or averaging to obtain ensemble output. 2.3. Generating the Missing Values: A simple approach to generate the missing values is to assume a Gaussian density for the input variable with mean and variance equal to those estimated from the data and use this density to generate the data. This is however too simple to yield adequate results. Another way, and this is the approach we followed, is to estimate the density of the input variable using the Parzen window estimation method 18. Then we generate the missing values for that input variable from that estimated density. The Parzen window estimate is given by:

p( X ) =

1 n  X − X (i )  ∑ ϕ  h  nh i =1 

(2.1)

where X is the input variable for which we estimate the density, the X(i)’s are the training set values for this input variable, n is the size of the training set, and φ(u) is a kernel function, defined as:

Novel Ensemble Techniques For Regression With Missing Data

1

φ (u ) =



e

−u 2

2

5

(2.2)

There have been much research on the optimal window size h, with some researchers proving that it is ~ n-1/5, and some researchers choosing a value ~ n-1/2 that works well in practice 18. In the simulations we used the latter. Once we have estimated the density, we used the following method to generate a random variate from the density estimated using Parzen window method. • We generate a uniform random number R from the set {1,…,n}. This serves to select which Gaussian function in the summation (1) we will generate from. • Generate a random variate X from a Gaussian distribution of mean equals to the mean of the selected Gaussian function of the previous step (i.e X(i)) and standard deviation h. Then X is distributed according to input variable density estimate Eqs. (2.1). 2.4. An Approximate Mathematical Analysis of UVE For simplicity, assume the value of only one variable is missing for a particular training pattern. Without loss of generality assume that the value is missing for input variable X1, and the values of the rest of the variables X2, … Xm are not missing. Let X ≡ (X2, …, Xm). Assume that the training set is large enough and that the network has perfectly learned the function. Consider the performance on a test pattern (with only the value of X1 missing). Consider the mean substitution method MS, the UVE method, and the ideal situation where the value of X1 is non-missing (NM).

E MS = E X 1 (YMS − YNM )

2

(2.3)

where EX1 is the expectation with respect to variable X1, YMS and YNM are the prediction outputs for the MS and the NM cases respectively

YMS = f [ X 1 , X ]

(2.4)

where

X 1 = mean( X 1 ) ≅ E MS = ∫ =

1 n ∑ X 1 (i) n i =1

(2.5)

( f (X , X ) − f ( X , X ) )

2

1

1 n ∑ n i =1

1

p ( X 1 )dX 1

( f (X , X ) − f ( X (i), X ))

2

1

(2.6)

1

where the index i in X1(i) denotes the ith training pattern. In the following analysis we will frequently exchange the summation with the expectation as we assume a large training set. Assume f is mildly nonlinear. Expanding f(X1(i),X) using Taylor series around X 1 , we get: EMS ≅

) (

) (

(

2  2 ∂ f X ,X 1 n  ∂f X 1 , X 1 1 f X 1 , X −  f X 1 , X + X 1 (i ) − X 1 + X 1 (i ) − X 1 ∑ 2 n i =1  ∂X 1 2 ∂X 1 

(

)

(

) (

)

)   

2

6 Hassan M. Atiya A., El Gayar N. and El-Fouly R.

where

(

∂i f X1, X ∂X 1

i

) denotes the i

th

derivative with respect to the first component of f, i.e. the

component X1. For short let us denote.

(

∂f X 1 , X ∂X 1

f′= f ′′ =

)

(

∂ 2 f X1, X ∂X 1

)

2

∆X 1i = X 1 (i ) − X 1

E MS

1 n  1  2 ≅ ∑  ∆X 1i f ′ + (∆X 1i ) f ′′  n i =1  2 

=

2

1 n  1  2 4 3 2 2  (∆X 1i ) f ′ + (∆X 1i ) f ′′ + (∆X 1i ) f f′ ′′  ∑ n i =1  4 

(

)

(

)

1 4 3 E (∆X 1i ) f ′′ 2 + E (∆X 1i ) f ′f ′′ 4

2

≅ σ X1 f ′ 2 +

(2.7)

Considering EUVE

EUVE

 1 n 1 n ≅ ∑  ∑ f ( X 1 ( j ), X ) − f ( X 1 (i ), X ) n i =1  n j =1 



2

1 n 1 n  1  ∑  f X 1 , X + ∆X 1 j f ′ + ∆X 1 j ∑ n i =1  n j =1  2

(

)

(

)

2

 f ′′  

1 2   −  f X 1 , X + ∆X 1i f ′ + ( ∆X 1i ) f ′′   2  

(

=

2

)

1 n  1 2 ∑  f X 1 , X + σ x1 f ′′ n i =1  2

(

)

1 2   −  f X 1 , X + ∆X 1i f ′ + ( ∆X 1i ) f ′′   2  

(

where we used the fact that

n

∑ ∆X

1j

2

)

= 0 , we get

j =1

2

EUVE ≅ σ X 1 f ′ 2 +

(

)

(

)

1 1 4 4 3 E (∆X 1i ) f ′′ 2 − σ x1 f ′′ 2 + E (∆X 1i ) f ′f ′′ 4 4

From (2.7) and (2.8) one can see that

(2.8)

Novel Ensemble Techniques For Regression With Missing Data

7

1 4 EUVE ≅ E MS − σ x1 f ′′ 2 4 which means that the expected error for UVE is smaller than that of MS. 3. Multivariate Ensemble Method (MVE) Taking the univariate approach one step further, the multivariate approach is based on utilizing the information in the other input variables to obtain a more accurate estimate of the density of the missing records. Specifically, we partition any training pattern into two groups: the "missing group" and the "non-missing group". Then we compute the probability density of the missing group conditioned on the values of the non-missing group. We then generate the values of the missing group from this probability density. We generate values for all the missing values of the training set this way, and repeat that procedure many times to obtain many versions of a "clean" training set. We then train a neural network on each version of the training set to obtain an ensemble of networks. The outputs of this ensemble are to be combined to obtain the final output. To obtain the conditional probability density described above, we assume the data to be distributed as a multivariate Gaussian. Due to the curse of dimensionality, we did not consider a more sophisticated estimate such as a Parzen window estimate of the multivariate density. Below are the steps of the algorithm: 3.1. Algorithm Steps: (1) Obtain an estimate of the mean vector µ and the covariance matrix Σ of the multivariate distribution of the vector of variables X (see next subsection for a method to estimate these). (2) For each record X that have some missing values do the following: (a) Separate X into two vectors U (missing) and V (non-missing). (b) Partition the mean vector and covariance matrix into the missing and the non-missing components. The joint probability density can then be written as (where N stands for the normal density with the arguments being the mean vector and the covariance matrix):

 µ   ∑ U    ~ N   U ,  UU T V    µV   ∑ UV

∑ UV ∑ VV

   

(c) Obtain the conditional probability density U|V (for the purpose of generating the missing values in U given the known values in V), using the following:

U | V ~ N ( µ U |V , Σ U |V ) −1 µ U |V = µU + Σ UV Σ VV (V − µV )

(3.1)

−1 T Σ U |V = Σ UU − Σ UV Σ VV Σ UV

(d) Use the last density to generate to fill all missing values of U. (3) Repeat the last loop k times. (4) Supply these k different versions of filled datasets to an ensemble with k different neural networks.

8 Hassan M. Atiya A., El Gayar N. and El-Fouly R.

(5) Depending on the problem type (classification\regression) use averaging to get ensemble output. 3.2. Estimation of the Covariance Matrix: For the multivariate method we need to estimate the covariance matrix ∑ using the data available in the presence of missing data. Let X (l ) ∈ R M denote the lth training pattern vector. Let

I i = {l | X i (l ) is not missing} Then we estimate (i, j ) element of ∑ as: th

1 ∑ ( X i (l ) − µˆ i )( X j (l ) − µˆ j ) N ij − 1 l∈I i ∩ I j

Σ ij =

(3.2)

where N ij ≡ I i ∩ I j denotes the number of training patterns that are simultaneously present (non-missing) for input variable i and for input variable j, and µˆ i is the estimated mean of input variable i. The problem with such an estimate is that it can lead to a negative definite covariance matrix, due to the presence of missing data. As a result of that one can encounter an estimated negative variance in the conditional distribution of Eq.(3.1). To alleviate this situation, we restore the covariance matrix to its positive semi definiteness. This is done as detailed below. Let ∑ be the measured M x M covariance matrix (possibly not positive semidefinite). Define the following set of all positive semi-definite M x M matrices:

{

S = A | X T AX ≥ 0 for all X ∈ R M

}

(3.3)

19 It is well-known that the set ~ S is a closed convex cone (called for short PSD cone) . The restored covariance matrix ∑ is defined as the closest covariance matrix to ∑ that is positive semidefinite:

{

~ ~ ~ ∑ = ∑ | ∑ − ∑ = min A − ∑ , A ∈ S A

}

(3.4)

where the matrix norm || || is the Frobenius norm. The solution for Problem of Eq. (3.4) is given as follows: Theorem 3.1. Consider the following steps • Obtain the eigenvalue decomposition:

∑ = UΛU T



(3.5)

where U is the matrix of eigenvectors, and Λ is the diagonal matrix of eigenvalues of the ~ covariance matrix ∑ . Let Λ be the same as Λ except that the negative~ elements on the diagonal are replaced by zero. Then, the approximated covariance matrix ∑ that solves Problem of Eq. (3.4) is given as:

~ ~ ∑ = UΛU T 19

(3.6)

Proof. This theorem is given in Dattorro (p.126 Section 2.9.2.6). The proof is given there with little detail, so we are here giving the detailed proof. The solution for Problem of Eq. (3.4) is given by the projection of ∑ onto the cone S . The projection of any point y outside a cone onto the cone is given by (see Ref. 19 p.566 Theorem E.9.2.0.1)

Novel Ensemble Techniques For Regression With Missing Data

( y − Py ) ⊥ Py

9

(3.7)

where Py denotes the projection of y onto the cone. Applying this to our problem, we get that the solution for Problem of Eq. (3.4) satisfies:

(∑ − ∑~ )⋅ ∑~ = 0

(3.8)

The dot product is given by taking the sum after element-wise multiplication, given as:

(∑ − ∑~ )⋅ ∑~ = Trace((∑ − ∑~ )∑~ ) T

~ ~ = Trace U Λ − Λ ΛU T

( (

)

)

=0 (3.9) The last equality follows from the fact that for the set of diagonal indices i where Λ ii is ~ negative we find that Λ ii = 0 , while for the set of diagonal indices i where Λ ii is positive we ~ observe that Λ ii − Λ ii = 0 . This completes the proof of the theorem. 3.3. A Variant of the Multivariate Method: The output provides another useful evidence that could help in obtaining a more accurate conditional density for the missing components. In this variant, we utilize the target output, in addition to the non-missing variables, to determine the missing component density. Thus, in the third point of Step 2 in the previous algorithm, we evaluate the conditional probability density p(U|V,d) instead of p(U|V), where d is the target output for the considered record. The advantages of this approach is that the target output could be a useful piece of information, and also the fact that for all records the target output is available (otherwise it is an unlabeled point and will typically be removed in a supervised setting). The disadvantage of this approach is that using the target output to obtain a missing input variable could introduce some bias due to the circular relationship between input and output. We call this method Multivariate Ensemble method with Output Augmentation. 3.3.1. Two More Variants Using Different Ensemble Methods: We experimented also with another ensemble-type method by Navone, Granitto, Verdes and Ceccatto 20 (henceforth called NGVC method) that we used in conjunction with the proposed method. Navone et al pointed out that it is beneficial to be selective concerning the networks to be added to the ensemble. In other words, their method is based on adding a (trained) prediction model to the ensemble only if it leads to improving the performance of the combined ensemble on some validation set. Otherwise, the prediction model is discarded, as its presence could be detrimental to overall performance. The detailed steps of Navone et al’s method are given as follows: Step 1: Generate a bootstrap sample S1 from training set Z, and a validation set V1 by collecting all instances in Z that are not included in S1. Generate a model f1 by training a network on S1 until the error e1 in the validation set V1 reaches a minimum. Step 2: Generate a new training set and a validation set S2 and V2, respectively, using the procedure described in step 1. Produce a model f2 by training a network until the validation error eF2(V2) of the aggregate predictor F2= ½ (f1+ f2) reaches a minimum.

10 Hassan M. Atiya A., El Gayar N. and El-Fouly R.

In this step the parameters in model f1 remain constant and only the model f2 is trained. Step 3: If the previous step eF2(V2) > e1 then disregard the model f2 and start again with step 2 using new random sets S2 and V2. Step 4: If a model f2 is found, incorporate it into the ensemble and proceed again with Steps 2 and 3 seeking a model f3 such that eF3(V3), the minimum validation error on V3 of the aggregate F3=( f1+ f2+ f3)/3, becomes smaller than eF2(V2). Step 5: Iterate the process until NA models are trained (including accepted and rejected ones). The ensemble will then consist of the accepted networks. We also implemented another ensemble variant called Select Best Predictors method (SBP in short). In this method we train a number of prediction models, and select only the best 80% of them based on their performance on the validation set. 4. Implementation and results 4.1. Introduction To test the performance of the proposed methods, we have conducted a comparative study using a number of real-world benchmark problems. In our experiments our final goal is to obtain better output prediction, rather than better estimation of the missing values per se (even though better prediction of the missing values generally helps in obtaining more accurate output prediction). So the criterion that we use is the error in output prediction. We have compared the proposed methods with the case-wise deletion method (CWD), the mean substitution method (MS), and the EM algorithm for missing value estimation. 4.2. Experiments setup We have used a multilayer neural network as the prediction model. Specifically, we used a network with one hidden layer and ten hidden nodes in the hidden layer. The hidden nodes use a tansig transfer function, while the output node uses a linear transfer function. The training algorithm used is the gradient descent with momentum and adaptive learning rate (by using the traindgx training method of Matlab). The number of training iterations is chosen as 1000 epochs. We used the Matlab default values for the initial learning rate and momentum coefficient. In all ensemble-type methods we used an ensemble size of 50. The EM algorithm used is the one implemented by 21, which is an implementation of the version of EM algorithm described in Ref. 22 . The criterion function used is a normalized version of the mean square error, defined as follows ( yˆ i − yi )2 , where yˆ is the target value and y is the predicted value. To gauge the ∑ i ERR = i i 2 ∑i yˆ i effect of training sample size on the performance difference of the different methods, we have tested the methods with different sizes of training sets, specifically 200, 500, and 1000. All the rest of the data is used as a test set. The datasets used have quite a lot of data, so the test set performance will reflect faithfully to a great extent the true or expected performance. We have used five datasets from Internet repositories 23. The first and second datasets are two different versions of the bank dataset, which is a family of datasets synthetically generated from a simulation of how bank-customers choose their banks. The task is to predict the fraction

Novel Ensemble Techniques For Regression With Missing Data

11

of bank customers who quit the bank because of full queues. The bank family of datasets is generated from a simplistic simulator, which simulates the queues in a group of banks 23. The third and fourth datasets are different versions of the housing dataset. The task here is to predict the housing values in areas of the city of Boston 23. The fifth dataset is the robot kinematics dataset, that relates link angles with the torques needed to control the robot. For each dataset five independent runs have been performed. In each run, we perform a random train/test partition. After running all methods, we average each method's test error over the 5 runs. For each run we have randomly selected the missing entries from among the training set (by uniformly drawing the entry to-be-missing from among the training set matrix). We tested the methods on different fractions of missing data: low missing rates of 5% and 10%, medium missing rates of 20% and 30%, and high missing rates of 40% and 50%. Hence, we have 24 different combinations; four different training sizes and six different missing ratios. We did not inject any missing values in the test set in the first set of experiments. Table 1 to Table 5 show the ERR error measure for respectively datasets 1 to 5. The columns indicate the different methods, the casewise deletion method (CWD), the mean substitution method (MS), the Expectation Maximization method (EM), the univariate ensemble method (UVE), the multivariate ensemble method (MVE) and the multivariate ensemble method with output augmentation (MVO). Each row shows the results at the different missing ratios, and using the different training sizes. In some cases for the CWD method the high missing ratios lead to the deletion of the data altogether. We indicate these cases in the table by NA.

12 Hassan M. Atiya A., El Gayar N. and El-Fouly R.

Table 1: Results for Bank Dataset 1 Using Different Training Sizes and Different Missing Ratios. Training Size

200

500

1000

Missing Ratio

CWD

MS

EM

UVE

MVE

MVO

5% 10% 20% 30% 40% 50% 5% 10% 20% 30% 40% 50% 5% 10% 20% 30% 40% 50%

0.6989 3.6221 NA NA NA NA 0.4628 2.8513 0.7176 NA NA NA 0.3542 1.0563 1.1492 NA NA NA

0.3914 0.5888 0.5352 0.5485 0.7007 0.9142 0.3598 0.3744 0.3634 0.4348 0.4005 0.6779 0.2895 0.3333 0.295 0.3857 0.3257 0.4475

0.6347 0.4155 0.4826 0.5358 0.542 0.6619 0.3298 0.3909 0.347 0.4512 0.5124 0.5375 0.3062 0.2847 0.3265 0.3317 0.3185 0.3861

0.2521 0.2559 0.2701 0.2921 0.2915 0.331 0.2365 0.2395 0.2555 0.2763 0.3022 0.3327 0.2378 0.2427 0.2644 0.2813 0.3092 0.3465

0.2524 0.2546 0.2682 0.2941 0.317 0.3347 0.2362 0.2378 0.252 0.2722 0.2977 0.332 0.2359 0.2408 0.2602 0.2789 0.306 0.3373

0.2514 0.2508 0.2577 0.2672 0.2869 0.2918 0.2324 0.2318 0.2357 0.2408 0.249 0.2672 0.2339 0.2336 0.2415 0.2532 0.2606 0.2735

Methods

Table 2: Results for Bank Dataset 2 Using Different Training Sizes and Different Missing Ratios. Training Size

200

500

1000

Missing Ratio

CWD

MS

EM

UVE

MVE

MVO

5% 10% 20% 30% 40% 50% 5% 10% 20% 30% 40% 50% 5% 10% 20% 30% 40% 50%

0.27352 1.7244 NA NA NA NA 0.14313 0.94172 NA NA NA NA 0.069202 0.36696 0.5412 NA NA NA

0.1137 0.1298 0.1363 0.3338 0.312 0.4168 0.0676 0.0721 0.1025 0.1808 0.1704 0.2672 0.0686 0.062 0.0721 0.0851 0.1028 0.1945

0.1487 0.1983 0.1505 0.2076 0.2861 0.2854 0.114 0.0804 0.1158 0.1195 0.1196 0.2718 0.0618 0.0866 0.0905 0.0953 0.124 0.1941

0.0307 0.0352 0.0441 0.0592 0.0781 0.1119 0.0344 0.0389 0.0534 0.0748 0.0997 0.1312 0.0307 0.0368 0.0506 0.0705 0.0995 0.1314

0.0304 0.0359 0.0531 0.074 0.0998 0.1405 0.0337 0.0381 0.0512 0.0725 0.098 0.1262 0.0295 0.0333 0.049 0.0673 0.0976 0.1234

0.0325 0.0329 0.0448 0.0566 0.074 0.094 0.0348 0.0376 0.0493 0.0614 0.0762 0.0844 0.0291 0.0337 0.0463 0.0579 0.0761 0.0924

Methods

Novel Ensemble Techniques For Regression With Missing Data

Table 3: Results for Housing Dataset 1 Using Different Training Sizes and Different Missing Ratios. Training Size

200

500

1000

Missing Ratio

CWD

MS

EM

UVE

MVE

MVO

5% 10% 20% 30% 40% 50% 5% 10% 20% 30% 40% 50% 5% 10% 20% 30% 40% 50%

0.4219 0.4518 0.7927 3.9554 NA NA 0.3552 0.3757 0.4599 0.5973 5.2010 5.6858 0.3848 0.3888 0.4400 0.4938 1.3177 NA

0.4141 0.446 0.5136 0.4872 0.5293 0.6437 0.4058 0.3924 0.4102 0.3916 0.5578 0.4587 0.4007 0.4071 0.3787 0.4058 0.4122 0.4609

0.4121 0.4249 0.4727 0.6794 0.7459 0.6305 0.4018 0.3632 0.3633 0.4007 0.6948 0.5776 0.385 0.379 0.3944 0.4645 0.4293 0.4223

0.3261 0.3341 0.3309 0.3516 0.3695 0.3969 0.3198 0.3246 0.3346 0.3506 0.3579 0.3897 0.3037 0.3097 0.3214 0.3349 0.3521 0.3627

0.325 0.3278 0.3382 0.3372 0.3535 0.383 0.314 0.319 0.3285 0.3437 0.342 0.3851 0.2982 0.3003 0.3125 0.3214 0.3364 0.3476

0.3232 0.3247 0.3231 0.332 0.3525 0.3702 0.3178 0.3188 0.3205 0.3269 0.3157 0.3416 0.2953 0.2971 0.298 0.3074 0.3085 0.313

Methods

Table 4: Results for Housing Dataset 2 Using Different Training Sizes and Different Missing Ratios. Training Size

200

500

1000

Missing Ratio

CWD

MS

EM

UVE

MVE

MVO

5% 10% 20% 30% 40% 50% 5% 10% 20% 30% 40% 50% 5% 10% 20% 30% 40% 50%

0.4957 0.5276 0.8140 2.9916 NA NA 0.4424 0.4362 0.7487 1.3649 NA 3.8381 0.4471 0.4642 0.4993 0.6402 1.3333 NA

0.5038 0.5579 0.5694 0.7569 0.8159 0.6823 0.4896 0.4522 0.4623 0.5183 0.4767 0.5358 0.451 0.4812 0.4803 0.4637 0.5456 0.5482

0.6294 0.5343 0.5217 0.5949 0.6763 0.6824 0.4632 0.4988 0.4978 0.4891 0.5775 0.6369 0.442 0.4823 0.438 0.4864 0.4581 0.5416

0.4378 0.4464 0.4496 0.457 0.4577 0.4709 0.4067 0.4171 0.4275 0.4367 0.4469 0.4633 0.3968 0.4025 0.4186 0.4333 0.444 0.4554

0.4443 0.4418 0.4494 0.4568 0.4602 0.4729 0.408 0.4164 0.4277 0.4365 0.4476 0.4675 0.3955 0.4053 0.4156 0.4318 0.4391 0.4542

0.4385 0.4435 0.4547 0.4482 0.4621 0.463 0.4055 0.4134 0.4196 0.4225 0.4289 0.4397 0.3935 0.397 0.4061 0.4129 0.4185 0.4225

Methods

13

14 Hassan M. Atiya A., El Gayar N. and El-Fouly R.

Table 5: Results for the Kinematics Dataset Using Different Training Sizes and Different Missing Ratios. Training Size

200

500

1000

Missing Ratio

CWD

MS

EM

UVE

MVE

MVO

5% 10% 20% 30% 40% 50% 5% 10% 20% 30% 40% 50% 5% 10% 20% 30% 40% 50%

0.0900 0.0959 0.2650 0.5429 0.7947 NA 0.0775 0.0810 0.1141 0.2633 0.7428 0.7913 0.0765 0.0749 0.0833 0.1477 0.3589 0.6706

0.0826 0.0816 0.0846 0.0956 0.0951 0.1087 0.0768 0.0808 0.0834 0.0842 0.0835 0.0886 0.0723 0.0724 0.0797 0.0781 0.0836 0.0849

0.0771 0.0807 0.0889 0.0952 0.0979 0.1143 0.0712 0.078 0.0791 0.0831 0.0829 0.088 0.0649 0.0703 0.0737 0.0813 0.0823 0.0902

0.0628 0.0654 0.0718 0.077 0.084 0.0906 0.0632 0.067 0.0715 0.0777 0.0825 0.0912 0.0627 0.0655 0.0726 0.0789 0.0849 0.0919

0.0621 0.0654 0.0708 0.076 0.0825 0.0898 0.0638 0.0657 0.0705 0.0767 0.0812 0.0893 0.0614 0.0644 0.0706 0.077 0.0831 0.0899

0.0619 0.0653 0.0704 0.0729 0.077 0.0821 0.064 0.0647 0.0703 0.074 0.0757 0.0801 0.0605 0.0634 0.0683 0.0728 0.0758 0.0782

Methods

4.3. Discussion From the figures and tables one can observe that CWD is an inferior method, while MS and EM give promising results. However, the three proposed methods (UVE, MVE, and MVO) considerably outperform these other three traditional methods. It is also clear that the amount of outperformance is higher for smaller training sets and for larger missing ratios. This is intuitive, because for smaller training sets and larger missing ratios data is insufficient and efficient methods that make up for this insufficiency will be very effective. On the other hand, for the case of large training set the large amount of data make up for the lost information due to missing data. Even though the EM approach is based on optimal estimation of the missing values, it fared worse than our proposed methods. The reason is that it gives only point estimates of the missing values. On the other hand our proposed approaches seem better because of the robustness issue of the ensembles concept. From the figures and tables we also observe that MVO consistently beat UVE and MVE. This is very interesting, it seems the output holds a significant amount of information about the input variables. Another observation is that the proposed methods do not deteriorate as much as the other methods with the increase in missing ratio. They are more robust in that sense. We believe that the network organizes itself in the training process to utilize the correlations among the input variables (in addition to the generated variables) to obtain better function estimates. This is not the case though when missing values exist in the test data. As we will see, performance is bound to deteriorate with increasing missing ratios, which is an expected result. 4.4. Algorithm variants We tested the ensemble variants NGVC and SBP described last section in conjunction with the UVE and MVE methods. To summarize all datasets results into a single table (Table 6), we took

Novel Ensemble Techniques For Regression With Missing Data

15

the liberty of averaging the ERR values of the different datasets. The ERR is a normalized measure (scale independent) and it is okay to take the average of those of different datasets. So for example in Table 6 the ERR entry corresponding to UVE for missing ratio 5% and training set size 200 is the average of the corresponding ERR values for the 5 datasets (with similar missing rations and training set sizes). Table 6: Average Results for the Datasets Using Different Ensemble Variants.

Training Size

Missing Ratio

200

500

1000

5% 10% 20% 30% 40% 50% 5% 10% 20% 30% 40% 50% 5% 10% 20% 30% 40% 50%

UVE (Basic Ensemble) 0.1585 0.1625 0.1666 0.1767 0.183 0.2002 0.1515 0.1553 0.1632 0.1737 0.1842 0.2012 0.1474 0.151 0.1611 0.1713 0.1842 0.1983

NGVC UVE 0.1626 0.1655 0.1751 0.1843 0.1893 0.2065 0.1454 0.1495 0.1591 0.1694 0.1838 0.1987 0.1436 0.1447 0.1549 0.1649 0.1793 0.1919

Methods MVE SBP UVE (Basic Ensemble) 0.1576 0.1592 0.161 0.1608 0.1654 0.1685 0.1756 0.1769 0.1821 0.1876 0.201 0.203 0.1484 0.1508 0.1525 0.1538 0.1603 0.1614 0.1715 0.1717 0.1826 0.1809 0.1999 0.2 0.1445 0.1458 0.1479 0.1492 0.1577 0.1583 0.1682 0.1681 0.1814 0.1803 0.1954 0.1932

NGVC SBP MVE MVE 0.1631 0.1646 0.1744 0.1848 0.1918 0.2064 0.146 0.1497 0.1569 0.1684 0.1802 0.1987 0.1403 0.1444 0.1527 0.1643 0.1766 0.1891

0.1581 0.1593 0.1686 0.1775 0.1868 0.204 0.1485 0.1514 0.1589 0.1697 0.18 0.1991 0.143 0.1465 0.1553 0.1654 0.1779 0.1909

From the table we observe that all variant are comparable in performance on average, with SBP and NGVC a little better than the basic ensemble method. 4.5. Missing in test set We performed another experiment to see the effect of missing data in the test set. The experiment setup is as follows: Five independent runs for each dataset were performed. In each run, we select different train/test partitions. We then averaged the resulting errors over the 5 runs. We used training sets of sizes 200, 500 and 1000, and a test set of size 1000. We tested different missing ratios: 5%, 10%, 20%, 30%, and 50%, where we always used the same missing ratio in the test set as in the training set. Again, as in the past subsection we computed the average ERR values for the 5 datasets for each missing ratio and training set size. For the MVO method, since for test data the output is assumed unknown, we use only the input variables to generate the missing value (like in MVE). So the target output is used only in the training phase. Table 7 shows the results. One can also see that the proposed methods (UVE, MVE and MVO) outperform the other conventional methods. Again outperformance is more significant for the case of smaller training sets. However, unlike the experiments performed before with no missing data in the test set, outperformance is more significant with smaller missing ratios. We believe that this is the case because the new methods get an edge in training. When tested with

16 Hassan M. Atiya A., El Gayar N. and El-Fouly R.

clean data they show this edge. When tested with more corrupt data this edge gets buried in the "noise" introduced by these missing data. Table 7: Average Results for the Datasets for the Case of Missing Data in the Test Set. Training Size

200

500

1000

Missing Ratio

CWD

MS

EM

UVE

MVE

MVO

5% 10% 20% 30% 40% 50% 5% 10% 20% 30% 40% 50% 5% 10% 20% 30% 40% 50%

0.6247 16.7073 NA NA NA NA 0.3694 3.1897 NA NA NA NA 0.283 0.9614 NA NA NA NA

0.3052 0.375 0.3799 0.4018 0.4292 0.4294 0.3044 0.3091 0.3197 0.3551 0.3799 0.3802 0.2617 0.28 0.2946 0.3219 0.3218 0.3494

0.3791 0.3445 0.3429 0.4023 0.4012 0.3983 0.291 0.2941 0.3147 0.3436 0.4039 0.3927 0.2576 0.2689 0.2866 0.3196 0.3225 0.3458

0.238 0.264 0.3007 0.3404 0.3621 0.397 0.2443 0.2649 0.3037 0.339 0.3645 0.3869 0.2323 0.2531 0.296 0.327 0.3523 0.3783

0.2375 0.2561 0.2944 0.3301 0.3501 0.3799 0.2418 0.2661 0.3002 0.3326 0.3569 0.3818 0.2297 0.2485 0.2876 0.32 0.341 0.3643

0.2368 0.2556 0.2949 0.3308 0.3611 0.3933 0.2411 0.2665 0.3002 0.3326 0.3622 0.3907 0.2282 0.2454 0.285 0.3196 0.3418 0.3694

Methods

5. Conclusions: In this work we presented novel methods for using ensemble networks to handle the missing data problem. The idea is to generate the missing values repeatedly to create several clean datasets. These datasets are used to train several neural networks to create an ensemble. We proposed three variants of the method, that differ according to the way we generate the missing values. We presented an analytical analysis that indicates the efficiency of one of the proposed methods. The simulations tests show the outperformance of the proposed methods compared to the existing methods. Acknowledgement: The authors would like to acknowledge the support of the Egyptian Ministry of Communications & Information Technology's Center of Excellence. References 1. F. Harrell, Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis. (Springer, New York, 2001). 2. P. D. Allison, Missing Data (Sage Publications, Inc, 2001). 3. J. L. Schafer and J. W. Graham, Missing Data: Our View of the State of the Art, in Psychol Methods, Vol. 7 (2002) pp.147–177. 4. D. B. Rubin, An overview of multiple imputation, in Survey Research Section, American Statistical Association (1988).

Novel Ensemble Techniques For Regression With Missing Data

17

5. V. Tresp, S. Ahmad, R. Neuneier, Training neural networks with deficient data. In J. D. Cowan, G. Tesauro, and J. Alspector, editors, Advances in Neural Information Processing Systems, Vol. 6 (Morgan Kaufmann Publishers, Inc., 1994) pp.128–135. 6. V. Tresp, R. Neuneier, S. Ahmad, Efficient methods for dealing with missing data in supervised learning. In G. Tesauro, D. Touretzky, and T. Leen, editors, Advances in Neural Information Processing Systems, Vol. 7 (The MIT Press, 1995), pp. 689–696. 7. Z. J. Viharos, L. Monostori, T. Vincze, Training and application of artificial neural networks with incomplete data, in IEA/AIE ’02: Proceedings of the 15th international conference on Industrial and engineering applications of artificial intelligence and expert systems, (Springer-Verlag, London, UK, 2002) pp. 649–659. 8. M. Ramoni, P. Sebastiani, Robust Learning with Missing Data, in Machine Learning, Vol. 45(2) (2001) pp. 147–170. 9. Z. Ghahramani, M. I. Jordan, Supervised learning from Incomplete Data via an EM approach, in: Cowan J.D., Tesauro G., Alspector J. (Eds.), Advances in Neural Information Processing Systems, Vol. 6, (Morgan Kaufman, San Mateo, CA, 1994). 10. A. Weiss and Y. Weiss, Multibody factorization with uncertainty and missing data using the EM algorithm, In CVPR’2004: Proceedings of the International Conference on Computer Vision and Pattern Recognition (2004). 11. Z. Geng, K. Wan, and F. Tao, Mixed graphical models with missing data and the partial imputation EM algorithm, in Scandinavian Journal of Statistics, Vol. 27(3), (2000) pp. 433-444. 12. Z. Geng, K. Li, and W. Ma, Graphical models with missing data and related algorithms, in Bulletin of the Computational Statistics of Japan, Vol. 13(2), (2001) pp.135. 13. B. Twala and M. Cartwright, Ensemble Imputation Methods for Missing Software Engineering Data, in Software Metrics, 11th IEEE International Symposium, (2005). 14. K. Jiang, H. Chen and S. Yuan, Classification for incomplete data using classifier ensembles. International Conference on Neural Networks and Brain, Vol. 1 (2001) pp. 559 – 563. 15. B. Twala, M. Cartwright and M. Shepperd, Ensemble of missing data techniques to improve software prediction accuracy, in ICSE ’06: Proceeding of the 28th international conference on Software engineering, (New York, NY, USA, ACM Press, 2006) pp. 909–912. 16. L. Breiman, Bagging predictor, in Technical Report 421, Department of Statistics, University of California at Berkeley (1994). 17. Y. Raviv, and N. Intrator, Bootstrapping with noise: An effective regularization technique, in Connection Science, Special issue on Combining Estimators, Vol. 8 (1996) pp. 356–372. 18. B. W. Silverman, Density Estimation for Statistics and Data Analysis, in Chapman & Hall/CRC (1986). 19. J. Dattorro, Convex Optimization & Euclidean Distance Geometry, (Meboo Publishers, 2006). 20. H. D. Navone, P. M. Granitto, P. F. Verdes and H. A. Ceccatto, A learning algorithm for neural network ensembles, in Inteligencia Artificial, Revista Iberoamericana de Inteligencia Artificial, Vol. 12 (2001) pp. 70–74. 21. http://www.gps.caltech.edu/~tapio/imputation/index.html 22. T. Schneider, Analysis of incomplete climate data: Estimation of mean values and covariance matrices and imputation of missing values, in Journal of Climate, Vol. 14 (2001) pp. 853–871. 23. http://www.niaad.liacc.up.pt/~ltorgo/Regression/DataSets.htmlx

Suggest Documents