1
Machine Learning Techniques for Solving Classification Problems with Missing Input Data Pedro J. Garc´ıa-Laencina, An´ıbal R. Figueiras-Vidal and Jos´e-Luis Sancho-G´omez.
Abstract— Missing input data is a common drawback that pattern classification techniques need to deal with when solving real-life classification tasks. The ability of dealing with missing data has become a fundamental requirement becasue inappropriate treatment may cause large errors or wrong results on classification. The goal of this paper is to analyze the missing data problem in pattern classification, and to make a descriptive state of the art of machine learning solutions for performing classification tasks with missing values.
I. I NTRODUCTION Pattern classification is the discipline of building machines to classify data (patterns) based on either a priori knowledge or on statistical information extracted from the patterns [1]– [3]. In general [3], a classical definition of pattern† is an entity which can be represented by a set of properties and variables (feature vector). For example, a pattern could be an audio signal, being the corresponding feature vector its frequency spectral components; or a patient, being the feature vector the results of his/her medical tests. Many real-word applications of this interdisciplinary field suffer a common drawback, missing or unknown data (incomplete feature vector). In medical diagnosis [4], [5], some tests are not possible to be done because both the hospital lacks the necessary medical equipment, or some medical tests may not be appropriate for certain patients. In control-based applications [6], [7], missing values appears due to failures of monitoring or data collector equipment, disruption of communication between data collectors and the central management system, etc. Wireless sensor networks [8] also suffers incomplete data due to different reasons such as power outage at the sensor node, random occurrences of local interferences or a higher bit error rate of the wireless radio transmissions. Another example is incomplete observations in business applications like credit assignment (unknown information about the credit petitioner) [9]. The ability of handling missing data has become a fundamental requirement for pattern classification because inappropriate treatment of missing data may cause large errors or false results on classification. Missing data is a subject which has been treated extensively in the literature of statistical analysis [10]–[12], and also, but with less effort, in the machine learning literature. In this work, P. J. Garc´ıa-Laencina and J. L. Sancho-G´omez are with the Departamento de Tecnolog´ıas de la Informaci´on y las Comunicaciones, Universidad Polit´ecnica de Cartagena, 30202, Cartagena-Murcia, Spain (email:
[email protected]). A. R. Figueiras-Vidal is with the Departamento de Teor´ıa de Se˜nal y Comunicaciones, Universidad Carlos III de Madrid, 28911, Legan´es-Madrid, Spain). † In the following, the terms pattern, case, observation, input vector, instance, sample and example are used as synonyms.
we will focus on machine learning solutions for incomplete pattern classification. In general, pattern classification with missing data concerns two different problems, handling missing values and pattern classification. Most of the machine learning techniques can be grouped in four different types of approaches depending on how both problems are solved, • Deletion of missing values (complete cases and available data analysis), and classifier design using only the complete data portion, • Imputation of missing input values, and after that, another machine learns the classification task using the edited complete set, i.e., complete cases and incomplete patterns with imputed values, • Use of Maximum Likelihood (ML) approaches, where the input data distribution is modeled by the ExpectationMaximization (EM) algorithm, and the classification is performed by means of the Bayes rule, • Use of machine learning procedures able to handle missing data without an explicit imputation. Figure 1 resumes different approaches in machine learning for incomplete pattern classification. In the two first types of approaches (data deletion and imputation), the two problems, handling missing values and pattern classification, are solved separately; in contrast, the third type of approaches model the Probability Density Function (PDF) of the input data (complete and incomplete cases), which is used to classify using the Bayes decision theory. Finally, in the last kind of approaches, the classifier has been designed for handling missing values without a previous missing data imputation. This work makes a descriptive survey of machine learning techniques for solving incomplete pattern classification tasks, trying to highlight their advantages and disadvantages. Due to space constraints, we are not able to provide a complete and detailed study of proposed solutions for missing data in pattern classification. The remainder of this paper is structured as follows. Section 2 describes both real applications where missing values are presented and the different missing data mechanisms. In the rest of this paper, several methods for dealing with missing data are discussed. Finally, the conclusions end this work. II. M ISSING DATA M ECHANISMS The appropriate way to handle incomplete input data depend in most cases upon how data attributes became missing. Little and Rubin [10] define several unique types of missing data mechanisms.
2
Fig. 1.
•
•
•
Methods for pattern classification with missing data. This scheme shows the different machine learning procedures that are analyzed in this work.
Missing Completely At Random (MCAR). MCAR situation occurs when the probability that a variable is missing is independent of the variable itself and any other external influence. The available variables contain all the information to make the inferences. A typical examples of MCAR are when a tube containing a blood sample of a study subject is broken by accident (such that the blood parameters can no be measured). The probability that an input value is missing is not related to any other features. Missing At Random (MAR). The missingness is independent of the missing variables but the pattern of data missingness is traceable or predictable from other variables in the database. An example is a sensor that fails occasionally during the data acquisition process due to power outage. The cause of the incomplete data is due to some other external influence. Not Missing At Random (NMAR). The pattern of data missingness is non-random and depends on the missing variable. In contrast to the MAR situation, the missing variable in the NMAR case can not be predicted only from the available variables in the database. If a sensor can not acquire information outside a certain range, its data are missing due to NMAR factors. If missing data are NMAR, valuable information is lost from the data; and there is a no general method of handling missing data properly.
When data are MCAR or MAR, the missing data mechanism is termed ignorable. Ignorable mechanisms are important because when they occur, a researcher can ignore the reasons for missing data in the analysis of the data, and thus simplify the model-based methods used for missing data analysis [11]. For this reason, majority of research covers the cases where missing data is of the MAR or the MCAR type. For the focus of this work, we can also divide the missingness mechanisms in function of the target to be predicted: •
Informative. The fact, that a value is missing, provides
•
information about the class target. Non-Informative. The distribution of missing values is the same for all classes.
An example for informatively missing values would be a customer database, where less values are missing for active customers than for passive ones, because the company knows its active customers better. These categories of missingness mechanisms make a difference for prediction, because informatively missing values should be treated as extra values for the respective feature, as this information provides hints about the target, which can be learned. For non-informatively missing values, missing data treatment methods, such as imputation, may be used instead. Also most missing value handling methods embedded in learning algorithms are only appropriate for the case of non-informatively missing values. When discussing the application of pattern classification algorithms to data sets including missing values, two scenarios can be distinguished: •
•
It may be the case, that the data used for training is complete, and values are missing only in the test data. Values may be missing both in training and test data.
The first case is not a common situation, because the classifier has been designed using complete cases, and any assumption about missing data has not been done during its training. In this situation, there are two possible solutions, incomplete cases have to be excluded from the data set, or unknown data have been filled in by imputed values. This preprocessing step is often necessary to enable the training of models, e.g. when using standard feed-forward neural networks, which cannot deal with missing values. The second itemized case is more natural, when training and test data are gathered and processed in a highly similar way, and so, the classifier is trained considering that input vectors may be incomplete.
3
III. C OMPLETE C ASE AND AVAILABLE DATA A NALYSIS The complete case analysis omits the cases which have missing data for any one variable used in a particular analysis. This procedure can be justified whenever the large quantity of data is available. Its advantage is the possibility of using directly the standard pattern classification methods for complete data, but a test pattern with missing values can not be classified because it will be omitted by the deletion process. In available-case analysis only the cases with available features are used. In general, available case analysis makes better use of the data than complete case analysis (this method uses all the available values); but its disadvantage is that the number of attributes is not the same for all patterns (it depends on what features are incomplete for each pattern), and so, the standard classification methods can not be directly applied. These approaches are simple, but its major drawback is the loss of information that occurs in the deletion process. IV. M ISSING DATA I MPUTATION M ETHODS BASED ON M ACHINE L EARNING A. Artificial Neural Network Imputation: ANNimpute The imputation approaches based on Artificial Neural Networks (ANNs) generally consist of creating an ANN model to estimate values which will substitute the missing data, e.g., a Multi-Layer Perceptron (MLP), is used as a regression model in order to estimate missing values [13], [14]. After a complete edited set is obtained, i.e., complete cases and imputed cases, a classifier system must be trained using this edited set to solve the classification task. ANNimpute can be an useful tool for reconstructing missing values. However, its main disadvantage is that when missing items appears in several attributes, several MLP models have to be designed, one per missing variables combination. Other MLP methods have been proposed. In [15], Yoon and Lee propose the TEST (Training-EStimation-Training) algorithm as way of using MLP to impute missing data. It consist of three steps. Firstly, network is trained with all complete patterns. Secondly, use the parameters (weigths) from the network to estimate the missing data in all incomplete patterns by means of backpropagation in the inputs. Thirdly, train the network again using the whole data set, i.e, complete and imputed patterns. However, this procedure can not estimate missing values in the test set. Recently, an useful neural network approach based on Multi-Task Learning (MTL) has been proposed [16], [17]. This procedure (MTLimpute) combines missing data imputation and pattern classification in one network. MTLimpute utilizes the incomplete features as extra tasks, and learns them in parallel with the main classification task. Outputs that learn incomplete features are used to estimate missing values during learning process. Imputed values are those that contribute to improve the classification accuracy, because the learning of imputation tasks is oriented by the learning of the main task. Doing this, missing data imputation is oriented to solve the classification task. Although not strictly an ANN, Support Vector Machines (SVM) have been used for missing data imputation, where predicted values generated by the SVM model used as imputed ones [18].
B. Recurrent Neural Network Imputation: RNNimpute A Recurrent Neural Network (RNN) is an architecture with feedback connections from its units. Bengio and Gingras [19] propose RNN with feedback into the input units for estimating missing data. Firstly, missing values are initialized with mean imputation, and these values are updated using the feedback connections, while the network is trained to learn the classification task. Missing values are modified as a function of missing input in the past iteration and the weighted sum of a set of recurrent links from the other units (hidden and missing) to the missing unit with an unit delay. Parveen [20] has extended RNNimpute using true values of incomplete features as extra targets during training. C. Auto-Associative Neural Network Imputation: AANNimpute An Auto-Associative Neural Network (AANN), also known as autoencoders, is a set of neurons that are completely connected in the sense that each neuron receives input from, and sends output to, all the other neurons. Some works tackle the missing data imputation by means of this kind of networks [21]–[24]. In [21], [22], an AANN learns from complete cases, in order to duplicate all of the inputs as outputs. After that, when unknown values are detected, the weights are not updated. Instead, missing values are replaced by the network outputs. Marwala et al. [23], [24] propose a method that combines the use of AANN with Genetic Algorithms (GA) to approximate missing data. The GA is used to estimate the missing values by optimizing a cost function between the input vector and the estimated outputs by the AANN. D. K-Nearest Neighbour Imputation: KNNimpute The K-nearest neighbour method select K patterns from the complete neighbours (i.e., the complete cases) such that they minimise some similarity measure. The nearest, most similar, neighbours are found by minimising a distance function [25]. Once the K nearest neighbours (donors) have been found, a replacement value to substitute for the missing attribute value must be estimated. How the replacement value is calculated depends on the type of data; the mode can be used for discrete data and the mean for continuous data. An improved alternative is to weight the contribution of each of the K neighbours according to their distance to the incomplete pattern whose values will be imputed, giving greater contribution to close donors [26]. An advantage over mean imputation and simple hot deck method (in fact, KNNimpute with K = 1) is that the replacement values are influenced only by the most similar cases rather than by all cases or the most similar one, respectively. The main drawback of this approach is that whenever KNNimpute looks for the most similar patterns, the algorithm looks for through all data set (in the complete data portion), what implies a high computational cost. E. Self-Organizing Map Imputation: SOMimpute The basic Self-Organizing Map (SOM) consists of a twodimensional array of nodes which defines a mapping from
4
the input data space onto a latent space, using weight vectors that connects each node of the SOM with input data (these weights has the same dimensionality than the input patterns) [27]. Samad and Harp [28] implement SOM approaches to handle missing values changing how process the input data. In particular, when an observation with missing features is given as input to the map, the missing variables are simply ignored when distances between observation and nodes are computed. This principle is also applied both for selecting the image-node and for updating weights. Fessant and Midenet [29] extend this procedure to missing data imputation. Firstly, when an incomplete pattern is presented to the SOM, its image-node is chosen ignoring the distances in the missing variables; secondly, a activation group composed of imagenode’s neighbours is selected; and finally, each imputed value is computed based on the weights of the activation group’s nodes in the missing dimensions. Following this idea, Piela [30] implements missing data imputation in a Tree Structured SOM (TS-SOM), what is made of several SOMs arranged to a tree structure. The major advantages of this approach over the basic SOM are its faster convergence, and also, its computational benefit when the number of instances is large. Finally, most of the imputation methods can be used following the Multiple imputation (MI) approach [10]–[12]. MI is a general paradigm for the analysis of incomplete data. Each missing datum is replaced by m simulated values, producing m simulated versions of the complete data. Each version is analyzed by standard complete-data methods, and the results are combined using simple rules to produce inferential statements that incorporate missing data uncertainty. V. M AXIMUM L IKELIHOOD BASED A PPROACHES The methods described in this section make assumptions about the joint distribution of all variables in the model, and the model parameters are estimated using a Maximum Likelihood (ML) approach. The Expectation-Maximization (EM) algorithm is an efficient iterative procedure to compute the ML estimate in the presence of missing or hidden data. In ML estimation, we wish to estimate the model parameter(s) for which the observed data are the most likely. In addition to this, mixture models provide a general semi-parametric model for arbitrary densities [1], [2], where the density functions are modeled as a linear combination of component densities. In particular, real valued data can be modeled as mixture of Gaussians (Gaussian Mixture Model, GMM); and for discrete valued data, it can be modeled as a mixture of Bernoulli densities or by a mixture of multinomial densities [31]. EM associates a given incomplete-data problem with a simpler complete-data problem, and iteratively finds the maximum likelihood estimates of the missing data. In a typical situation, EM converges monotonically to a fixed point in the state space, usually a local maximum. Each iteration of the EM algorithm consists of two processes: The E-step, and the Mstep. The Expectation or E-step computes the log likelihood of the data, and the Maximization or M-step finds the parameters that maximize this likelihood [31]. In the case of GMMs
[31], the state space consists of all the possible assignments to the means, covariance matrices, and prior probabilities of the Gaussian distributions. Missing features can be treated naturally in this framework. In the expectation, or E-step, the missing data are estimated given the observed data and current estimate of the model parameters. This is achieved using the conditional expectation. In the M-step, the likelihood function is maximized under the assumption that the missing data are known. The estimate of the missing data from the E-step are used instead of the actual missing data. In a classification problem, the mixture modeling framework can model the class label as a multinomial variable, i.e., the mixture model estimate the joint probability that an input vector has a determined attributes and belongs to a determined class. Once the model is obtained, the most likely label for a particular input pattern may be obtained computing the classposterior probabilities using the Bayes’ theorem [1], [2], [31]. In addition to this, efficient neural network approaches for handling missing data have been developed to incorporate the input data distribution modelling into the network [32]–[35]. Ahmad and Tresp discuss Bayesian techniques for extracting class probabilities given partial data [32]. Considering that the classifier outputs are good estimates of the class-posterior probabilities given the input vector, and splitting up into a vector of complete features and a vector of unknown features, they demonstrate that the optimal solution involves integrating over the missing dimensions weighted the local probability densities. Moreover, in [32], closed-form approximations to the class probabilities are obtained using GMMs, and they extend to an MLP classifier calculating the integrals with Monte Carlo techniques. Instead of doing a numerical approximation of these integrals, Tresp et al. [33], [34] propose an efficient solution using GMMs and Parzen windows to estimate the conditional probability densities that appear in the integral (for more details, see [32]–[34]). Finally, Williams et al. [35] have recently proposed a classification scheme with missing values based on logistic regression. In this approach, the conditional density functions are estimated with a GMM, whose parameters are obtained using both EM and Variational Bayesian EM (VB-EM). According to [35], this method outperforms the GMM trained with EM, mean and multiple imputation, but its largest drawback is the restriction to linear classifier. VI. M ACHINE L EARNING M ETHODS FOR H ANDLING M ISSING DATA Up to now, this work have discussed several machine learning procedures where missing values are replaced by a plausible value, i.e., missing values are imputed. In contrast, other approaches have been proposed for handling missing data in classification problems avoiding explicit imputations. A. Ensemble Approaches Neural network ensemble models have also been used for classification of incomplete data [13], [36], [37]. Sharpe and Solly [13] propose a procedure named Network Reduction. In this method, a set of MLPs is created, and each MLP
5
performs classification on each different possible combination of complete features, in order to cover the complete range of attributes with missing values. The main drawback of this method is that it requires a huge number of neurons when multiple combinations of incomplete attributes are presented. Krause and Polikar [36] develop an ensemble of classifiers trained with random subsets of features for missing feature problem. As each classifier component of the ensemble is generated according to a weighted vector for features, it can not guarantee that all the instances have been trained on the ensemble. Therefore, useful information maybe ignored. Juszczak and Duin [37] propose to form an ensemble of oneclass classifiers trained on each feature. Thus when any input values are missing for a data point to be labeled, the ensemble can still make a reasonable decision based on the remaining classifiers. B. Decision Trees Algorithms In this kind of methods, we stand out two well-know approaches: ID3 and C4.5. These procedures can handle missing values in any attribute for both training and test sets [38], [39]. ID3 is a basic top-down decision tree algorithm that handles an unknown attribute by generating an additional edge for the unknown. Thus “unknown” has been taken as a new possible value for each attribute and treated in the same way as other values. C4.5 is an extension of ID3 proposed by Quinlan [38]. It uses a probabilistic approach to handle missing values in the training and test data set. In this approach, how to deal with missing values in training and testing stage are different. During training, each value for an attribute is assigned a weight [38], [39]. If the value an attribute is known, then the weight is established equal to one, otherwise, the weight of any other value for that attribute is the relative frequency of that attribute. On the testing phase, if a test case is incomplete, it explores all available branches (below the current node) and decides the class label by the most probabilistic value. C. Fuzzy Approaches Several fuzzy procedures have been developed in order to handle missing data [40]–[42]. Ishibuchi et al. [40] propose an MLP classifier where unknown values are represented by interval inputs. For example, if the pattern space of a particular classification problem is the d-dimensional unit cube [0, 1]d , each unknown feature value is represented by the interval input [0, 1] that includes all the possible values of that attribute. When the attribute value is completely known, it is also represented by intervals (e.g., 0.3 is represented by [0.3, 0.3]). This network is trained by means of back-propagation algorithm for fuzzy input vectors [40]. Gabrys [41] develops a General Fuzzy Min-Max (GFMM) neural network using hyperbox fuzzy sets. A hyperbox defines a region of the d-dimensional pattern space, by its min-point and its maxpoint, and all patterns contained within the hyperbox have full class membership. Learning in the GFMM neural network for classification consists of creating and adjusting hyperboxes in pattern space [41]. This procedure handle missing values in a similar way than the interval inputs, i.e., the missing
attribute is modeled by a a real valued interval spanning the whole range of values. Other solution is a fuzzy rule based classifier that consists of a set of rules for each possible category, where each rule can be decomposed into individual one-dimemsional membership functions corresponding to the fuzzy sets (see [42] for more details). When an incomplete pattern has to be classified, the rules are obtained using only one-dimensional membership function of the known attributes, which is computationally very efficient. D. Support Vector Machines In recent years, some works have extended the standard formulation of Support Vector Machines (SVMs) for handling missing values and uncertainty in the input data [43]–[45]. Bhattacharyya et al. [43] propose a mathematical programming method to deal with uncertainty in the observations of a classification problem. They extend the standard SVM classifier to handle missing values replacing the linear classification constraints by a probabilistic one. It is done modeling the missing variables as random variables by means of a Gaussian mixture [43]. Moreover, the model parameters (mean and covariance matrices) are estimated by means of the EM algorithm. Pelckmans et al. [44] propose a modified risk function taking into account the uncertainty of predicted outputs when missing values are involved. It is done by incorporating a probabilistic model for the missing data. This method generalizes the approach of mean imputation in the linear case, and the proposed kernel machine reduces to the standard SVM when no input values are missing [44]. Bi and Zhang [45] develop a novel formulation of support vector classification, which allows uncertainty in input data assuming that inputs are subject to an additive noise that follows certain distribution. VII. C ONCLUSIONS Missing or incomplete data is an usual drawback in many real-world applications of pattern classification. Data may contain unknown features due to different reasons, e.g., sensor failures producing a distorted or unmeasurable value, data occlusion by noise, non-response in surveys. This work reviews several machine learning methods for incomplete pattern classification. The research works on this topic is currently grows wider and it is well known how useful and efficient are most of solutions based on machine learning. The correct choice of a missing data treatment is a hard and complex task, i.e., a method can work well in some problems, and in contrast, its results are not good in other applications. Thus, a previous analysis of the classification problem to be solved is very important in order to select the most suitable missing data treatment. ACKNOWLEDGMENTS This article is partially supported by Ministerio de Educaci´on y Ciencia (MEC) under the grant TEC200613338/TCM, and by Consejer´ıa de Educaci´on y Cultura de Murcia (Fundaci´on S´eneca) under the grant 03122/PI/05.
6
R EFERENCES [1] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification. WileyInterscience, 2000. [2] C. M. Bishop, Neural Networks for Pattern Recognition. Oxford, UK: Oxford University Press, 1995. [3] S. Watanabe, Pattern Recognition: Human and Mechanical. Wiley, 1985. [4] M. K. Markey and A. Patel, “Impact of missing data in training artificial neural networks for computer-aided diagnosis,” in International Conference on Machine Learning and Applications, December 2004, pp. 351–354. [5] J. M. Jerez, I. Molina, J. L. Subirats, and L. Franco, “Missing data imputation in breast cancer prognosis,” in BioMed’06: Proceedings of the 24th IASTED international conference on Biomedical engineering. Anaheim, CA, USA: ACTA Press, 2006, pp. 323–328. [6] K. Lakshminarayan, S. A. Harp, and T. Samad, “Imputation of missing data in industrial databases,” Applied intelligence, vol. 11, pp. 259–275, 2004. [7] L. N. Nguyen and W. T. Scherer, “Imputation techniques to account for missing data in support of intelligent transportation systems applications,” University of Virginia, Tech. Rep. UVACTS-13-0-78, May 2003. [8] M. Halatchev and L. Gruenwald, “Estimating missing values in related sensor data streams.” in COMAD, J. R. Haritsa and T. M. Vijayaraman, Eds. Computer Society of India, 2005, pp. 83–94. [9] G. DiCesare, “Imputation, estimation and mising data in finance,” Ph.D. dissertation, University of Waterloo, 2006. [10] R. J. A. Little and D. B. Rubin, Statistical Analysis with Missing Data, 2nd ed. New Jersey, USA: John Wiley & Sons, 2002. [11] J. L. Schafer, Analysis of Incomplete Multivariate Data. Florida, USA: Chapman & Hall, 1997. [12] P. D. Allison, Missing data. Thousan Oaks, California, USA: Sage University Papers Series on Quantitative Applications in the Social Sciences, 2001. [13] P. K. Sharpe and R. J. Solly, “Dealing with missing values in neural network-based diagnostic systems,” Neural Computing and Applications, vol. 3, no. 2, pp. 73–77, 1995. [14] S. Nordbotten, “Neural network imputation applied to the norweigan 1990 population census data,” Journal of Official Statistics, vol. 12, pp. 385–401, 1996. [15] S.-Y. Yoon and S.-Y. Lee, “Training algortihm with incomplete data for feed-forward neural networks,” Neural Processing Letters, vol. 10, pp. 171–179, 1999. [16] P. J. Garc´ıa-Laencina, J.-L. Sancho-G´omez, and A. R. Figueiras-Vidal, “Pattern classification with missing values using multitask learning,” in International Joint Conference on Neural Networks (IJCNN 2006). Vancouver, BC, Canada: IEEE Computer Society, 16-21 July 2006, pp. 3594–3601. [17] P. J. Garc´ıa-Laencina, A. R. Figueiras-Vidal, J. Serrano-Garc´ıa, and J. L. Sancho-G´omez, “Multi-task neural networks for dealing with missing inputs,” in 2.nd International Work-conference on the Interplay between Natural and Artificial Computation (IWINAC2007), LNCS., vol. 4527, 2007, pp. 282–291. [18] H. Mallinson and A. Gammerman, “Support vector machine imputation methodology,” University of London, Euredit Project WP 5.6, 2003. [19] Y. Bengio and F. Gingras, “Recurrent neural networks for missing or asynchronous data.” in NIPS, D. S. Touretzky, M. Mozer, and M. E. Hasselmo, Eds. MIT Press, 1995, pp. 395–401. [20] S. Parveen, “Connectionist approaches to the deployment of prior knowledge for improving robustness in automatic speech recognition,” Ph.D. dissertation, University of Sheffield, April 2003. [21] S. Narayanan, R. M. II, J. L. Vian, J. Choi, M. El-Sharkawi, and B. B. Thompson, “Seft constraint discovery: Missing sensor data restoration using auto-associative regression machines,” in Proceedings of the 2002 International Joint Conference on Neural Networks, Honolulu, May 2002, pp. 2872–2877. [22] M. Marseguerra and A. Zoia, “The autoassociative neural network in signal analysis: II. Application to on-line monitoring of a simulated BWR component,” Annals of Nuclear Energy, vol. 32, no. 11, pp. 1207– 1223, July 2005. [23] T. Marwala and S. Chakraverty, “Fault classification in structures with incomplete measured data using autoassociative neural networks and genetic algorithm,” Current Science, vol. 90, no. 4, pp. 542–548, 2006. [24] F. V. Nelwamondo, S. Mohamed, and T. Marwala, “Missing data: A comparison of neural network and expectation maximization techniques,” Current Science, vol. 93, no. 11, pp. 1514–1521, 2007.
[25] G. Batista and M. C. Monard, “A study of k-nearest neighbour as an imputation method.” in HIS, ser. Frontiers in Artificial Intelligence and Applications, A. Abraham, J. R. del Solar, and M. K¨oppen, Eds., vol. 87. IOS Press, 2002, pp. 251–260. [26] O. Troyanskaya, M. Cantor, O. Alter, G. Sherlock, P. Brown, D. Botstein, R. Tibshirani, T. Hastie, and R. Altman, “Missing value estimation methods for dna microarrays,” Bioinformatics, vol. 17, no. 6, pp. 520– 525, 2001. [27] T. Kohonen, Self-organizing maps, 3rd ed. Springer, May 2006. [28] T. Samad and S. A. Harp, “Self-organization with partial data,” Network, vol. 3, pp. 205–212, 1992. [29] F. Fessant and S. Midenet, “Self-organising map for data imputation and correction in surveys,” Neural Computing and Applications, vol. 10, no. 4, pp. 300–310, 2002. [30] P. Piela, “Introduction to self-organizing maps modelling for imputation - techniques and technology,” Research in Official Statistics, vol. 2, pp. 5–19, 2002. [31] Z. Ghahramani and M. I. Jordan, “Learning from incomplete data,” Massachusetts Institute of Technology, Cambridge, MA, USA, Tech. Rep. AIM-1509, 1994. [32] S. Ahmad and V. Tresp, “Some solutions to the missing feature problem in vision,” in NIPS. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1993, pp. 393–400. [33] V. Tresp, S. Ahmad, and R. Neuneier, “Training neural networks with deficient data.” in NIPS, J. D. Cowan, G. Tesauro, and J. Alspector, Eds. Morgan Kaufmann, 1993, pp. 128–135. [34] V. Tresp, R. Neuneier, and S. Ahmad, “Efficient methods for dealing with missing data in supervised learning.” in NIPS, G. Tesauro, D. S. Touretzky, and T. K. Leen, Eds. MIT Press, 1994, pp. 689–696. [35] D. Williams, X. Liao, Y. Xue, L. Carin, and B. Krishnapuram, “On classification with incomplete data,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 3, pp. 427–436, 2007. [36] S. Krause and R. Polikar, “An ensemble of classifiers for the missing feature problem,” in International Joint Conference on Neural Networks (IJCNN 2003), vol. 1, Portland, USA, 2003, pp. 553–558. [37] P. Juszczak and R. P. W. Duin, “Combining one-class classifiers to classify missing data.” in Multiple Classifier Systems, ser. Lecture Notes in Computer Science, F. Roli, J. Kittler, and T. Windeatt, Eds., vol. 3077. Springer, 2004, pp. 92–101. [38] J. R. Quinlan, C4.5: Programs for Machine Learning (Morgan Kaufmann Series in Machine Learning). Morgan Kaufmann, January 1993. [39] G. I. Webb, “The problem of missing values in decision tree grafting,” in AI ’98: Selected papers from the 11th Australian Joint Conference on Artificial Intelligence on Advanced Topics in Artificial Intelligence. London, UK: Springer-Verlag, 1998, pp. 273–283. [40] H. Ishibuchi and K. Moriola, “Classification of fuzzy input patterns by neural networks,” in Proceedings of the IEEE International Conference on Neural Networks, Perth, WA, Australia, 1995, pp. 3118–3123. [41] B. Gabrys, “Neuro-fuzzy approach to processing inputs with missing values in pattern recognition problems,” International Journal of Approximate Reasoning, vol. 30, no. 3, pp. 149–179, 2002. [42] M. R. Berthold and K.-P. Huber, “Missing values and learning of fuzzy rules,” International Journal of Uncertainty, Fuzziness and KnowledgeBased Systems, vol. 6, no. 2, pp. 171–178, 1998. [43] C. Bhattacharyya, P. K. Shivaswamy, and A. J. Smola, “A second order cone programming formulation for classifying missing data.” in NIPS, 2004. [44] K. Pelckmans, J. D. Brabanter, J. A. K. Suykens, and B. D. Moor, “Handling missing values in support vector machine classifiers,” Neural Networks, vol. 18, no. 5-6, pp. 684–692, 2005. [45] J. Bi and T. Zhang, “Support vector classification with input data uncertainty,” in Advances in Neural Information Processing Systems 17, L. K. Saul, Y. Weiss, and L. Bottou, Eds. Cambridge, MA: MIT Press, 2005, pp. 161–168.