Handling Data Irregularities in Classification

2 downloads 0 Views 607KB Size Report
Mar 6, 2018 - 203, B. T. Road, Kolkata-700 108, India. .... Table 1: Examples of data irregularities in real-world applications ..... [95], RAMOBoost [96], UnderBagging [97], and SMOTEBagging [98]. .... 3 datasets from UCI ...... of the 2009 International Conference on KDD-Cup 2009 - Volume 7, KDD-CUP'09, JMLR.org, ...
Handling Data Irregularities in Classification: Foundations, Trends, and Future Challenges Swagatam Dasa , Shounak Dattaa , Bidyut B. Chaudhurib,1,∗ a Electronics

and Communication Sciences Unit, Indian Statistical Institute, 203, B. T. Road, Kolkata-700 108, India. b Computer Vision and Pattern Recognition Unit, Indian Statistical Institute, 203, B. T. Road, Kolkata-700 108, India.

Abstract Most of the traditional pattern classifiers assume their input data to be well-behaved in terms of similar underlying class distributions, balanced size of classes, the presence of a full set of observed features in all data instances, etc. Practical datasets, however, show up with various forms of irregularities that are, very often, sufficient to confuse a classifier, thus degrading its ability to learn from the data. In this article, we provide a bird’s eye view of such data irregularities, beginning with a taxonomy and characterization of various distribution-based and feature-based irregularities. Subsequently, we discuss the notable and recent approaches that have been taken to make the existing stand-alone as well as ensemble classifiers robust against such irregularities. We also discuss the interrelation and co-occurrences of the data irregularities including class imbalance, small disjuncts, class skew, missing features, and absent (non-existing or undefined) features. Finally, we uncover a number of interesting future research avenues that are equally contextual with respect to the regular as well as deep machine learning paradigms. Keywords: Data Irregularities, Class Imbalance, Small Disjuncts, Class-Distribution Skew, Missing Features, Absent Features.

1. Introduction We are living in a time when machine learning is continually transforming the daily lives of people from nearly all social strata and has become more of a household term, than a mere research jargon. Thanks to machine learning, over the past few years we have noticed a quantum leap in the quality of several everyday technologies; for example, the speech recognition functions in our smart phones, machine translation and other forms of language processing functions (Amazon's Alexa, Apple's Siri, Microsoft's Cortana just to ∗ Corresponding

author Email addresses: [email protected] (Swagatam Das), [email protected] (Shounak Datta), [email protected] (Bidyut B. Chaudhuri ) 1 Phone Number: (+91) (33) 2575 2852 Preprint submitted to Pattern Recognition

March 6, 2018

name a few), image recognition and scene interpretation based applications, and so on. Most of these machine learning based systems include an efficient classifier. Given a set of labeled examples (the training dataset) S = {(xi , ci )|xi ∈ P ⊂ Rd ; ci ∈ C = {1, 2, · · · , C}}, consisting of data points xi in the training set P and their corresponding labels ci , the classification problem is to identify the mapping f : Rd → C so that f (xi ) = ci ∀xi ∈ P . Then, the label for a new data point y ∈ Q ⊂ Rd (Q being the test set) can be inferred to be f (y). To cope with the dramatic increase in volume and complexity of real world data and owing to the no free lunch theorem in machine learning [1], a significant proportion of machine learning research has been devoted to the design of robust, efficient, and adaptive classifiers as well as finding the most useful set of features (variables) to represent each object or data point (that is to be classified) for over the last seven decades. Such research spans from the celebrated works of R. A. Fischer in 1930s [2] to the latest innovations in deep learning [3]. Performance of most of the well-known classifiers can considerably degrade if the data to be handled (as well as the training examples) contain irregularities of various types. By data irregularity we essentially mean situations when the distribution of data points, the sampling of data space for generating the training set, and also the features describing each data point deviate from what could have been ideal, being biased, skewed, incomplete and/or misleading. Traditionally, the classification algorithms make a few assumptions about the training data [4], such as the following. 1. All the classes are equally represented. 2. All the sub-concepts within a given class are equally represented. 3. All the classes have similar class-conditional distributions. 4. The values of all the features are defined for all the data instances in the dataset. 5. The values of all the features are known for all the data instances in the dataset. Violations of such ideal conditions, which hinder the normal learning process of a classifier, are categorized as data irregularities. Violation of each of these assumptions corresponds to a well-known learning problem. Indeed, violations of the assumptions 1-5, listed above, respectively give rise to class imbalance, small disjuncts, class distribution skew, absent features, and missing features respectively. However, more than one of these assumptions may be violated together by a given dataset. Moreover, traditional classifiers are often sensitive to violations of more than one of these assumptions, as is itemized below. • Max-margin Classifiers - sensitive to class imbalance, small disjuncts, class distribution skew, absent features, missing features. • Neural Networks - sensitive to class imbalance, small disjuncts, absent features, missing features. 2

• k-Nearest Neighbours - sensitive to class imbalance, small disjuncts, absent features, missing features; immune to class distribution skew as it does not make any assumptions regarding the class-conditional distributions. • Bayesian Inference - sensitive to class imbalance, small disjuncts, class distribution skew, absent features, missing features. • Decision Trees - sensitive to class imbalance, small disjuncts, class distribution skew; inherently immune to feature missingness as branching is based only on the observed features. Therefore, it is important to study the interrelations between these problems. It is easy to see that assumptions 1-3 are concerned with the distributions of the classes in the dataset, while assumptions 4-5 are concerned with the features of the data. Assumptions 1-3 are often observed to be violated together by datasets. These assumptions are interrelated in the sense that the violation of one may have important effects on that of another. For example, the long-standing problem of cost tuning for class imbalanced learning problems occurs as a side-effect of the small disjuncts and/or class distribution skew problems. Hence, it is important to realize that the assumptions 1-3 (and violations thereof) are different aspects of the same phenomenon, together referred to as the distribution-based data irregularities. Similarly, violations of assumptions 4-5 can be collectively referred to as the feature-based data irregularities. In essence, data irregularities are a collection of peculiarities of datasets that are likely to render traditional classifiers either biased or even inapplicable. We would also like to mention here that label noise, when present in a dataset, is likely to make learning difficult for the classifier. Hence, much research has been undertaken to devise noise-immune classifier variants (see, for example, [5] and the references therein). However, noise (assuming that it equally affects all the classes), due to its inherently random nature, is neither likely to bias the classifier towards a particular class nor does it hinder the direct application of the classifier to the data. Moreover, most modern classifiers can learn effectively in the presence of noise, without having to undergo major changes. Thus, we refrain from placing noise (label or otherwise) under the umbrella of data irregularities. Before proceeding further, let us consolidate more on the issue of data irregularity. In Figure 1 we show the taxonomy of different types of data irregularities. Class imbalance is a very common form of the distribution-based data irregularity where one or more classes remain under-represented in the training set. This can happen when the representative examples of some classes appear more frequently than the others, for example, in an automated credit card fault detection system, the representatives of fraud applications is significantly less than that of the genuine applications, although the system is expected to stress on classifying the fraudulent applications correctly, thus making the ’minority’ class very important. In the 3

Figure 1: A taxonomy for different kinds of data irregularities

past few years, several nice surveys were written on the classification of imbalanced datasets and related areas; see, for example, works like [6, 7, 8, 9, 10, 11]. The small disjunct problem occurs when there are small (i.e. under-represented) sub-concepts within classes. Originally, having arisen within the rule-based learning literature, small disjuncts were considered to correspond to rules related to a small group of data points [12]. However, outside the rule-based learning literature, small disjuncts are now understood to be sub-clusters within the classes. The problem of class distribution skew arises when the different classes possess grossly disparate class-conditional distributions [13]. Many classifiers, such as k-NN, SVM (Support Vector Machine), RBFNN (Radial Basis Function Neural Network) etc. assume some form of symmetry between the class distributions. This situation may pose serious problems, especially in the presence of overlap between the classes as, depending on the local properties of the distributions, one of the classes will be over-regularized. Unlike imbalanced classification, there does not exist a large volume of works devoted to studying the performance of classifiers in the face of skewed class distributions, especially when the classes may have an equal number of representatives. The problem of small disjuncts and skewed class distributions can arise during learning from imbalanced data and they can pose more serious challenges to an imbalanced classifier. Sometimes the class skewed distributions can induce locally imbalanced structure in the region of overlap between the classes. In Figure 2, we illustrate examples of the three kinds of distribution-based data irregularities. Coming to the feature-based irregularities, while handling data from real-world, we very often face the problem of missing features where the unstructured missingness is caused by corruption of feature values due to noise, equipment malfunction, secrecy, transmission error etc. [14]. Missing feature problems are usually handled by marginalization (discard points with missing features), imputation (fill in missing feature values) [15], or the more recent dissimilarity based schemes (such as the Penalized Dissimilarity Measure (PDM)) [16]. On the other hand, the problem of absent features is caused by structured missingness and 4

occurs when some of the features are undefined for some of the data points due to the very nature of the data points [17]. The existence of missingness may itself be a significant feature in these cases. In Table 1 we show a few examples of various data analysis scenarios and the related data irregularities.

(a) An example of class imbalanced data distribution.

(b) An example of small disjuncts (being underrepresented sub-concepts) within each class.

(c) An example of skewed class distribution.

(d) Legends

Figure 2: Examples of distribution-based data irregularities for two-class classification problems. The ideal decision boundary corresponds to an optimal compromise between the two classes while the actual decision boundary corresponds to the one likely to be learnt by a linear classifier.

Table 1: Examples of data irregularities in real-world applications Scenario

Type of Data Irregularity

Credit card fraud detection Breast cancer diagnosis Market segmentation Facial and emotion recognition Survey data Phylogeny problem Gene expression data

class imbalances, class skew [18] class imbalance, class skew, small disjuncts [19, 20] class imbalance, class skew [21] small disjuncts [22] unstructured missingness [23] unstructured missingness [24] unstructured missingness [25]

Visual object recognition Software effort prediction

structural missingness or absent features [17] unstructured and structural missingness [26]

Due to the independent status enjoyed in the literature by most of the data irregularities, we proceed to separately review the state-of-the-art for each of the problems while also making important observations concerning the connections with the other problems, whenever relevant. Thereafter, we proceed to explicitly discuss the interrelations between each of these problems and the consequences thereof. Thus, the current 5

article will not only lead the reader to understand all of these problems in a common light but will also pave the way for more effective joint solutions which exploit the interrelations between the problems. Finally, we uncover some important open research problems that need urgent attention of the researchers and practitioners in machine learning. Organization of the paper is in order. In Section 2, we review the most recent developments on designing classifiers resilient to class imbalance. Besides reviewing the most prominent conventional approaches, we also focus on the very recent deep learning based and multi-objective approaches which are not yet covered in the existing surveys. Similarly, in the subsequent Sections 3 and 4, we discuss various approaches to handle small disjuncts and class distribution skew. Among the feature-based irregularities, classification approaches for data with missing and absent features are comprehensively outlined in Sections 5 and 6 respectively. Section 7 elaborates on the combination of two or more irregularities (like imbalanced data coming with missing features) and the methods to address them. Section 7 also presents some of the pertinent future research directions. Finally, the paper is concluded in Section 8.

2. Imbalanced Data Classification: Overview and Recent Approaches 2.1. Characterization of the Problem Class imbalance refers to the problem where not all the classes in a dataset are equally represented. There are three principal properties of the imbalanced classification problems which together determine the difficulty of correctly classifying the minority class, viz. Imbalance Ratio (IR), i.e. the ratio of the number of representatives from the majority class and the minority class respectively, overlap between the classes, and size of the dataset. A high IR value indicates that a large fraction of the representatives belongs to the majority class. This generally poses a problem to the correct classification of the minority class, as the error to be minimized by the classifier is overwhelmed by the majority instances [27]. Hence, high IR alone (for a sufficiently large data set with no overlap between classes) may not pose a problem for classifiers which make use of the error functions such as the max-margin criterion which is independent of the number of representatives from the classes. However, in the presence of overlap between the classes, even such classifiers tend to wrongly classify the minority class instances [28]. This is because the max-margin criterion can only be applied after regularizing the overlapping instances by using a regularization error which is sensitive to the fractions of representation from the classes. Hence, a combination of high IR and significant overlap between the classes generally results in high misclassification rate from the minority class, irrespective of the type of classifier employed.

6

If the size of the data set is large enough, both the classes are well-represented and hence the boundaries learned are likely to generalize well to new instances [29]. As the size of the data set decreases, the generalization error increases with the minority class suffering more compared to the majority class. When coupled with high IR, a small data set size may mean that there is not enough representation from the minority class to enable proper learning, resulting in complete miscalibration for the said class. This has been termed as the rare class problem by He and Garcia [6]. Such miscalibration over the minority class may also occur, despite the minority class not being rare, if there is substantial overlap between the classes. Therefore, the various combinations of the three factors noted above may give rise to a gamut of class imbalanced learning scenarios with varying degrees of difficulty. In Table 2 we summarize such scenarios in a structured form. Table 2: Various types of class imbalanced cases

Imbalance Ratio (IR)

Overlap Nonoverlapping

Low IR Slight overlap

Size Small dataset Large dataset Small dataset Large dataset Small dataset

High overlap

Large dataset

Small dataset Nonoverlapping High IR

Slight overlap High overlap

Large dataset Small dataset Large dataset Small dataset Large dataset

Nature of Learned Decision Both classes are generally correctly classified. New minority instances may be misclassified. Both classes are generally correctly classified. Good generalization to new instances. Slight misclassification of minority training instances. Generalization to new minority instances is poorer. Slight misclassification of minority instances. Similar performance on new instances. High misclassification from both classes. Learned boundary is unstable and aligned in favor of the majority class. Generalization to new minority instances is poorer than to new majority instances. High misclassification from both classes. Slightly better performance on majority instances. Similar generalization to new instances. Performance varies with the type of classifier used. Max-margin classifiers are likely to properly classify both classes while other classifiers like decision trees, k -Nearest Neighbor (k -NN), etc. are likely to misclassify some of the minority instances near the boundary. However, the boundary learned by a max-margin classifier may still misclassify some new minority instances. Performance varies with the type of classifier used as noted above. Generalization to new instances is superior than the above case. Most of the minority overlapping instances are misclassified. Some new minority instances form the adjoining non-overlapping areas are also misclassified. Most of the minority overlapping instances are misclassified. Similar performance on new instances. Most of the minority instances are misclassified. Most of the minority instances are misclassified.

7

2.2. Traditional Approaches to Handle Class Imbalance Krawczyk [11] distinguished three main directions of learning from imbalanced data as follows: • Data-level methods - which change the training set to compensate for imbalanced distribution between the majority and minority classes alongside removing the examples that may confuse the classifiers. • Algorithm-level methods - which make a modification to the conventional learning algorithms to reduce bias towards the majority class. • Hybrid methods - which enjoy a combination of the advantages of these two earlier groups. In what follows, we will also outline the approaches to handle imbalanced data distributions under these three heads. 2.2.1. Data-level Methods Majority of these methods pre-process a dataset by under-sampling examples from the majority class or over-sampling examples from the minority class such that finally, the number of labeled examples from both the classes become comparable. However, such under or over-sampling may lead to the removal of better discriminating points from majority class and the inclusion of meaningless new points in the minority class. Thus, improved extensions for these pre-processing techniques are being proposed regularly. • Under-sampling techniques - A very simple and (consequently) popular under-sampling method is the Random Under-Sampling (RUS) [30], which as the name suggests, randomly discards examples from the majority classes until the effect of imbalance is significantly mitigated. CoNN (Condensed Nearest Neighbor) [31] is another baseline under-sampling technique that discards the majority class examples which are away from the decision boundary (using a 1-NN rule) since such examples can be considered as less relevant for learning. Another popular baseline under-sampling technique is Tomek links (TL) [32], which opposite to CoNN, discards the noisy and borderline examples from the majority class. In fact, TL treats the borderline samples as unsafe since slight changes in the decision boundary can cause them to be assigned to a wrong class. Yen and Lee [33] proposed a clustering-based under-sampling method to preserve the data distributions in the majority and minority classes after pre-processing. Garc´ıa and Hererra [34] introduced a set of efficient under-sampling methods based on the Evolutionary Algorithms (EAs) for efficient selection of prototypes for the majority class based on fitness function related to classification accuracy and reduction rates. Among the more recent approaches, Wong et al. [35] put forward a fuzzy rule-based system to select the majority class examples for under-sampling on large imbalanced datasets. Ng et al. [36] proposed 8

a diversified sensitivity-based under-sampling method where the majority class examples are clustered to indicate the within-class distribution, thus, enhancing the diversity of resampling. Samples from clusters of both majority and minority classes were selected on the basis of a stochastic sensitivity measure. Fu et al. [37] proposed a PCA (Principal Component Analysis) guided under-sampling method that also uses a comprehensive evaluation model to discard the redundant examples from majority class while preserving examples that mostly indicate the majority class characteristics. Ha and Lee [38] presented a Genetic Algorithm (GA) based under-sampling technique which attempts to maximize the performance of a prototype classifier such that the prototypes minimize a loss function between distributions of original and under-sampled examples from the majority class. Devi et al. [39] extended the TL under-sampling procedure by incorporating detection of outliers, redundant and noisy instances having least contribution in estimating accurate class labels. Recently Bunkhumpornpat and Sinapiromsaran [40] used a cluster graph to determine a density function based on the distance along the shortest path between each example from the majority class and a minority-cluster “pseudocentroid”. The authors thus proposed a new under-sampling algorithm to eliminate majority examples with low distances as such examples can be negligible and can obscure the class boundary in the overlapping region. • Over-sampling techniques - Arguably the most popular over-sampling technique is the Synthetic Minority Over-sampling Technique (SMOTE) [41] which randomly creates new minority class points by interpolating between the existing minority points and their neighbors. The intuition behind the construction of the synthetic samples in SMOTE is that over-sampling by repeated instances causes over-fitting by tightening the decision boundary. Instead, SMOTE creates similar examples and to a learning algorithm, the newly constructed instances are not exact copies, thus, softening the decision boundary as a result. However, SMOTE neglects the positions of the neighboring examples from other classes while creating the synthetic examples. Some of its prominent extensions that attempted to mitigate the tendency to increase overlapping of the classes are Borderline-SMOTE [42], ADASYN [43], LN-SMOTE [44], and safe-level SMOTE [45]. Sez et al. [46] recently undertook a detailed study of the over-sampling techniques for multi-class imbalanced datasets and also proposed a methodology of using over-sampling techniques for such classification problems that rely on the extracted knowledge about class and imbalance distribution types. More recently, Abdi and Hashemi [47] proposed a Mahalanobis distance-based over-sampling procedure. This technique, while generating the synthetic samples, maintains their Mahalanobis distance from the considered class mean to a value same as the other minority class examples. This method compensates for the risk of overlapping between the classes by taking into account the mean of each 9

class and generating synthetic samples in dense areas of the minority class regions. Bunkhumpornpat et al. [48] designed an oversampling method called DBSMOTE, based on a density-based concept of clusters to over-sample an arbitrarily shaped cluster discovered by DBSCAN (by generating synthetic instances along the shortest path from each majority example to a pseudo-centroid of a minority-class cluster). • Hybrids between under-sampling and over-sampling - Hybrid approaches involving both over and undersampling have also been proposed by researchers like Ramentol et al. (over-sampling with SMOTE and cleaning based on rough set based editing) [49], Wang (undersampling based on distance of an example from the hyperplane generated by a trained SVM combined with SMOTE) [50], Prachuabsupakij (k−means clustering based under-sampling of majority class combined with SMOTE) [51], Jian et al. (SMOTE to re-sample the SVs (Support Vectors) in the minority class and RUS to re-sample the NSVs (Non Support Vectors) in the majority class) [52]. A slightly different approach is to oversample the minority class while removing the noisy instances from both classes. For example, besides over-sampling the borderline minority points, the SMOTE-iterative-partitioning filter (IPF), proposed in [53], also incorporates a noise filter to remove the noisy examples in the majority and minority classes. In a similar spirit, Kang et al. [54] integrated a k-NN noise filter with the under-sampling for the majority classes. A combination of neighbor cleaning rule-based under-sampling and SMOTE techniques was suggested in [55] to compensate for the class imbalance in medical data analysis. A slightly different hybrid paradigm called class switching functions by removing points from majority class and placing them in the minority class (the former step is akin to under-sampling and the latter to over-sampling) [56]. 2.2.2. Algorithm-level methods These methods attempt to reduce the bias of the existing learning methods towards the majority classes. Below we outline a few representative subsets of these methods. • Cost-sensitive learning - In these methods, the minority class is assigned a higher cost of missclassification compared to the majority class [57, 58, 59, 60, 61, 62]. Depending on the type of classifier used, the cost set may either be absolute (the exact weights for all classes must be specified, C degrees of freedom, e.g. - SVMs) or relative (only the ratio of the class-wise costs must be specified, (C − 1) degrees of freedom, e.g. kNN classifier). Since the parameter tuning space increases exponentially with the number of classes in the dataset, such approaches are usually not suitable for multi-class imbalanced cases. Among some of the notable recent works in this direction, taking a cue from the analysis of the constraints on the cost-sensitive parameters, Cheng et al. [63] proposed a Large Cost-Sensitive 10

margin Distribution Machine (LCSDM) where, to enlarge the margin distribution of the minority class, the cost-sensitive parameters of the margin mean of the minority class are enhanced and the cost-sensitive parameters of the margin variance of the minority class are reduced, while increasing the misclassification penalty of the minority class. In [64], a class-specific cost regulation scheme and its kernel extensions were investigated for imbalanced classification with Extreme Learning Machine (ELM) classifiers. For the cost-sensitive boosting algorithms, recently Nikolaou et al. [65] made an in-depth investigation and presented some unifying views by identifying 15 distinct algorithmic variants from a time-span of 1997 to 2016 on the basis of four theoretical frameworks and in context of the imbalance learning problems. Unlike the conventional cost-sensitive learning methods, Ohsaki et al. [66] proposed a confusion-matrix based kernel logistic regression classifier which can directly enhance the values of the evaluation criteria by including them in an objective function without any user intervention. The objective function actually equals the harmonic mean of various evaluation criteria derived from a confusion matrix. • Boundary shifting methods - Such methods attempt to artificially move the decision boundary towards the majority class by using disparate costs. It is distinct from simple cost set tuning in that the decision boundary is modified post-learning, using costs. For example, for the z-SVM classifier [67], the traditional SVM decision function with αi and ci respectively being the Lagrange multiplier and the class label corresponding to the data point xi , K(., .) being the concerned kernel function, and b being the bias term, the objective function

f (x) =

X

αi ci K(x, xi ) + b,

i

is modified to attach greater weight to the minority class terms as

fz (x) =

X

z(αi ci K(x, xi )) +

i

X

αk ck K(x, xk ) + b,

k

where i denotes the minority point indices and k denotes majority class indices. On the other hand, it can be implemented in the kNN classifier by assigning greater weight to the neighbors from the minority class. These strategies also suffer from the same drawback like the cost sensitive learning strategies in that they require cost tuning. More recently, Datta and Das [68] proposed a Near-Bayesian Support Vector Machine (NBSVM) by combining the philosophies of decision boundary shift and unequal regularization costs. They also extended this approach to the multi-class scenario and adapted it for cases with unequal miss11

classification costs for the different classes. • Single class learning - In these methods, the classifier is trained on the minority class only and the majority class data is ignored [69]. These methods work much like a novelty-detection mechanism. These methods also go by the name of one class classifiers and tend to recognize one specific class, known as the target concept, from a larger set of examples. Krawczyk et al. [70] proposed a weighted one class SVM classifier which assigns weights to minority class examples depending on their type (like safe / borderline/ rare / outlier). A similar approach was taken in [71] to classify the imbalanced breast cancer malignancy data. However, these methods are only suitable if the imbalance ratio is very high, or if a single minority class is present in the dataset. • Active learning - Active learning forms a part of the semi-supervised machine learning paradigm where the learner is allowed to interact with the user (or some equivalent source of information) to obtain the desired outputs at new data points, under the assumption that labeling can be expensive for large unlabelled datasets. Active learning strategies have been used for imbalanced classification and they lay more stress on the misclassified minority class instances, see for example works like [72, 73, 74, 75]. The classifier is first trained using the entire dataset, then the misclassified minority class instances are used to update the classifier iteratively. However, various issues like cost set tuning or the extent of resampling may crop up here as well. Ferdowsi et al. [76] proposed an online active learning framework that alternates between different example selection strategies. Instead of the serial mode of active learning, where the instances are selected one at a time, You et al. [77] suggested a batchmode active learning framework which targets at forming a diverse set of queries. The authors also incorporated a heuristic approach to address the multi-class imbalanced distributions and applied the method to the semantic understanding of images and videos. Guo and Zhang [78] combined active learning with SVM to develop a multi-class imbalanced classifier, which starts by selecting a number of most informative unlabeled data points through active learning. The method then determines the labels of these samples by computing the differences between the labeled and unlabeled data points. The active learning component also helps to detect uncertain, rejected and compatible samples. Zhang et al. [79] proposed an active learning-based multiple noisy labeling frameworks for imbalanced classification of crowdsourcing data, where both labeled and unlabeled instances can be selected to obtain more labels. The authors also combined the label integration and instance selection procedures into a single method. Zhang et al. [80] studied different active querying strategies for classification of the imbalanced streaming data under a query budget. A GP (Genetic Programming, where small computer programs are encoded as genes and then they are evolved using the principles of evolutionary computation) based active learning framework was 12

proposed in [81] for classification of imbalanced streaming data. • Kernel perturbation techniques - These methods are specific to classifiers which use the well-known Gaussian RBF (Radial Basis Function) kernel to learn, such as SVMs with RBF kernels. The idea is to perturb the kernel so that the minority class instances are drawn close together to mitigate the effect of imbalance by according greater density to the minority class. However, these methods may require absolute resolution coefficient tuning much like absolute cost set tuning. The approach of Li et al. [82] suggests that kernel perturbation techniques like those of [83, 84, 85, 86, 87, 88, 89, 90] can be used to diversify SVMs for ensemble learning. Incidentally, there is no guarantee that the performance improvements achieved by these kernel perturbation methods will be monotonic. Therefore, for such techniques, it will be more useful to combine (using ensemble approaches like boosting) the SVMs trained in the different iterations rather than to only use the SVM trained in the final iteration. Furthermore, since the existing perturbation techniques either perturb the entire kernel [82, 83, 87] or apply class-specific perturbation [86, 87, 88, 89, 90] aimed at handling class imbalance, these are not equipped to handle local data irregularities like small disjuncts. • Discriminative regression based supervised learning models - In the recent works by Cheng et al. [91] and Peng et al. [92], the authors proposed new supervised learning models, which, though not directly designed to handle class imbalance, can offer sufficient robustness to the disparate sizes of the training classes. In particular, these papers attempt to approximate the test example with a potentially nonlinear combination of the training examples, and then predict the label of the test example based on the approximation coefficients. The approaches of these papers can be considered as the improved k -NN classifier, in that a standard k -NN uses 0/1 weights for the training examples, while these approaches use more refined weights. In addition, Peng et al. [92] uses the group-level sparse regularizations to penalize the approximation to the test example, and the sizes of the groups do not appear to affect the classification performance much. Thus, the approaches described in [91] and Peng et al. [92] are less afflicted by the imbalanced class issue.

2.2.3. Hybrid methods - These methods mostly combine various sampling-based approaches with the algorithm level methods aiming to reinforce their advantages while curbing their downsides. Below we outline some typical ways of hybridization, as applied to imbalanced data classification. • Sampling and data balancing approaches with classifier ensembles - these approaches integrate various under-sampling, over-sampling and other data balancing techniques with classifier ensembles (mostly 13

bagging and boosting). Some classic examples are SMOTEBoost [93], JOUS-Boost [94], RUSBoost [95], RAMOBoost [96], UnderBagging [97], and SMOTEBagging [98]. More recent approaches include an evolutionary ensemble construction approach based on RUSBoost [99], generalized imbalance ratio based ensemble (over and under) sampling approaches [100], a parallel selective sampling approach to select examples from the majority class to reduce imbalance in large datasets [101], bootstrap resampling (to generate synthetic data points near class boundaries) combined with Adaboost neural network for imbalanced classification [102], RUS and SMOTE adapted for imbalanced big data using MapReduce aided with the Random Forest (RF) classifier [103], clustering and random splitting based data balancing with ensemble classifier with distance-based weighting [104]. • Sampling based approaches with cost-sensitive learning - In such methods, the imbalanced data is first pre-processed using under or over-sampling and then a classifier with cost-sensitive tuning is used. Some notable examples of this trend can be found in the works of Akbani et al. [105] (SMOTE combined with cost-sensitive SVM), L´opez et al. [60] (over-sampling and under-sampling combined with cost-sensitive learning classifiers like C4.5 and SVM), and Hsu et al. [106] (pre-processing the data with under-sampling followed by classification with a cost-sensitive random forest algorithm). Through empirical experiments on a number of datasets, L´opez et al. [60] observed that the performance of the sampling techniques is not significantly different from that of cost-sensitive approaches over a wide variety of imbalanced datasets. The hybrid approaches can be fairly competitive against the individual ones only in some isolated cases, thus necessitating more works to design synergistic techniques where re-sampling and cost-sensitive learning can significantly reinforce each other. 2.3. Multi-objective Optimization Approach to Imbalanced Learning Multi-objective Optimization (MO) approaches [107] form an important part of the multi-criteria decision making and they attempt to find an optimal trade-off among two or more objectives (defined on a common domain), which are usually in conflict. An MO algorithm yields a (possibly infinite) number of Pareto optimal solutions. A solution is called nondominated or Pareto optimal if none of the objective functions can be further improved in value without degrading some of the other objective values. Recently there have been some attempts to address the imbalanced data classification problems by using MO approaches. Below we briefly outline these approaches. • Bi-objective optimization of accuracy and gmean : This approach [108] tries to simultaneously maximize both overall classification accuracy acc as well as the geometric mean of the class-wise accuracies gmean. Evidently, for sufficiently imbalanced datasets, acc and gmean will be in conflict as gmean

14

increases only if accuracies of all classes increase while being balanced. Irrespective of the number of classes, the number of objectives here is restricted to 2. • Bi-objective optimization of precision and recall : This approach was presented in [109] and in its current form is only applicable for two-class problems. This approach is likely to perform badly on highly imbalanced datasets as the precision index starts favoring the majority class with an increase in IR. • Multi-objective optimization of accuracies of each class, i.e. the i-th objective fi = acc on class Ci : This approach was taken in works like [110, 111, 112] and the idea is to simultaneously maximize (minimize) the individual class-wise accuracies (misclassification errors). Although here the number of objectives scales linearly with the number of classes in the dataset, this approach is likely to be more accurate as it lends equal importance to all classes. • Multi-objective GP: Bhowanet al. [113] proposed a Multi-Objective GP (MOGP) approach for classifying imbalanced data, where the accuracy of the minority and majority classes are treated as the conflicting objectives. The authors also used two measures in the fitness function, namely Negative Correlation Learning (NCL) and Pairwise Failure Crediting (PFC) to enhance the diversity of the evolved ensembles. In a follow-up work, the same authors [114] combined the MOGP approach with an ensemble selection procedure that employs a GP to automatically pick up the best individuals for the ensemble. 2.4. Deep Learning Approaches to Imbalanced Classification Unlike the traditional task-specific learning algorithms, deep learning methods stemmed from the concept of learning data representations. They gained a huge momentum since 2006 when a fast learning algorithm for the deep belief networks was developed. A brief but thorough survey of the well-known deep learning architectures including autoencoder, Convolutional Neural Network (CNN), deep belief network, and the restricted Boltzmann machine can be found in [115]. In this section, we outline some recent deep learning approaches to address the problem of class imbalance in challenging datasets. To the best of our knowledge, the earlier surveys on imbalanced classification did not cover these approaches. The main approaches to handle class imbalance in a deep learning framework can be itemized as shown below: • Learning a large margin representation of the data to minimize the effect of class imbalance: To mitigate the problem of unbalanced class distributions in vision data, Huang et al. [116] took a deep representation approach which assumes that the minority class examples hardly exhibit a high 15

degree of visual variation. Based on instance selection using a quintuplet sampling scheme and the corresponding triple-header hinge loss, the authors trained a CNN to learn an embedding which could generate features that can significantly discriminate between classes. The quintuplet loss imposes a strict constraint to reduce the effect of imbalance in the local proximities of the data points. • Cost-sensitive learning: The idea of introducing class-specific costs in deep neural networks have been recently explored. Replacing the conventional softmax with a regression loss, Chung et al. [117] designed a loss function for training cost sensitive deep nets. The same loss function can also be used during the pre-training stage for conducting the cost-sensitive feature extraction more effectively. Loss functions which impose an equal emphasis on the misclassification of examples from both minority and majority classes were proposed in works like [118, 119]. Unlike Chung et al. [117], Khan et al. extended the conventional cost functions (based on surrogate losses like SVM hinge loss, MSE loss, and cross entropy loss) for CNNs in a cost-sensitive framework. The resulting cost-sensitive deep neural network is able to learn robust feature representations for both majority and minority classes and is directly applicable to multi-class scenarios. Zhang et al. proposed a cost-sensitive Deep Belief Network (DBN) with unequalized misclassification costs between the classes. A powerful continuous parameter evolutionary algorithm, well known as DE (Differential Evolution) was used to optimize the cost matrix of the DBN. A cost-sensitive CNN architecture was proposed in [120] for vehicle localization and categorization in high-resolution aerial images. In this approach, the loss function for training the CNN demarcates the back-propagated values for the side network of the proposed architecture between majority and minority classes. • Bootstrapping: Yan et al. [121] proposed a CNN architecture integrated with the bootstrapping based resampling approach for multimedia data classification. In this approach, the data is partitioned into balanced subsets of the data and the learner is separately trained on each of the subsets. Finally, the results are arrived at by a kind of voting among the trained deep learners.

2.5. Summary of existing empirical results on Imbalanced Classification To acquaint the reader to the existing results reported in the current literature on imbalanced classification, we present a summary of some empirical results in Table 3. The reported results are quoted from the recent and/or popular articles chosen so as to cover all the major approaches currently used to handle class imbalance. The quoted results are all concerned with two-class datasets and do not cover multi-class tasks for the sake of simplicity and consistency. Since achieving good classification accuracy on both the majority as well as the minority class is important for class imbalanced classification tasks, the results are presented in terms 16

of the Geometric Mean (GMean) of the individual class-wise accuracies and the Area Under the ROC Curve (AUC) which essentially amounts to the arithmetic mean of the same quantities. Since the quoted articles do not all make use of the same datasets, the reported figures should not be compared with each other. However, the results are individually relevant in that they indicate the level of performance that new research in the field of class imbalanced classification must strive to out-do.

3. Small Disjunct Problem: Overview and Extant Approaches It is evident from the discussion in Section 2.1 that the two principal factors contributing to the difficulty of proper learning in the presence of class imbalance are 1) the rarity of data (arising from small dataset size and high IR) and 2) the overlap between classes. Irrespective of whether there exists global imbalance between the classes in a dataset, certain factors may give rise to the rarity of some of the sub-concepts within a class or may give rise to the dominance of one of the classes in the region of overlap. The former situation gives rise to the well-known small disjuncts problem in classification and is typical of cases where the classes are constituted of smaller sub-concepts (either due the class-conditional distributions consisting of smaller sub-clusters or due to the rules learned by rule-based learners corresponding to few instances). The small disjunct problem was formally introduced by Holte et al. [12]. Small disjuncts, as the name suggests, refer to the rules which account for a small fraction of the data points in a rule-based learner. Alternatively, small disjuncts are also understood to be underrepresented sub-concepts within classes [6, 142]. Holte et al. [12] observed that the small disjuncts, though individually covering only a small fraction of the data points, collectively covered a large portion of the dataset (for rule-based learners). Moreover, due to the small amount of data available for each of the small disjuncts, the sub-concept definitions learned were not specific enough. This resulted in a disproportionately large fraction of errors occurring due to the small disjuncts. Inspired by the taxonomy presented in [143], we categorize the existing research on small disjuncts into two broad sub-sets: 3.1. Understanding and measuring the effect of small disjuncts in learning • Identification of small disjuncts - Rules corresponding to small disjuncts are usually identified as the rules pertaining to a small number of training instances (below a certain threshold) [12, 144, 145, 146]. Ting [146] suggested additional methods of identifying small disjuncts based on the relative size of the disjuncts (as opposed to the absolute size) and on the error rate of the disjuncts. There does not seem to exist any methods to identify small disjuncts for classifiers which do not make use of rules (such as SVM, kNN, etc.). • Measuring the performance on small disjuncts - Quinlan [147] described a new measure to improve the accuracy estimates of small disjuncts by taking into account the prior probabilities of the classes in 17

Table 3: Summary of existing empirical results for Imbalanced Classification

Reference

Base Classifier

Oversampling

He et al. [43]

unspecified decision trees

Under-sampling Cost-sensitive learning Boundary shifting Kernel scaling

Ng et al. [36] Xiao et al. [64] Datta and Das [68] Wu and Chang [86]

Approach

Datasets used

Average performance GMean† AUC‡

RBF-Net [123]

5 datasets from the University of California, Irvine (UCI) repository [122] 14 UCI datasets

ELM [124]

17 UCI datasets

0.8999

N/A

SVM [125]

10 UCI datasets

0.8884

N/A

SVM

6 UCI datasets

0.8772

0.9670

0.9010

0.9896

0.8047

N/A∗

0.7858

0.8358

Active Learning

Ertekin et al. [72]

LASVM [126]

3 datasets from UCI and CiteSeer [127], USPS [128] and MNIST [129] datasets

Single Class Classifiers Hybrid of Over- and Under-sampling Hybrid of Over-sampling and Cost-sensitive learning Hybrid of Over-sampling and Boosting Hybrid of Under-sampling and Boosting Multi-Objective Optimization (MOO) with classification accuracy and GMean MOO with Precision and Recall MOO with individual class-wise accuracies Learning a Large-margin representation of the data using Deep Learning

Krawczyk et al. [69]

SVM

10 UCI datasets

0.7819

N/A

Ramentol et al. [49]

C4.5 [130]

44 datasets from the KEEL repository [131]

N/A

0.8402

Akbani et al. [105]

SVM

10 UCI datasets

0.9226

N/A

Chen et al. [96]

MLP [132]

19 datasets from UCI and ELENA Project [133]

0.8488

0.9333

Seiffert et al. [95]

C4.5

15 UCI datasets

N/A

0.8729

Soda [108]

SVM

10 UCI datasets

0.7088

N/A

Chira and Lemnaru [109]

C4.5

8 UCI datasets

N/A

0.7778

Bhowan et al. [113]

Genetic programming based classifiers

6 UCI datasets

0.8217

0.8233

Huang et al. [116]

ConvNet [129]

MNIST-rot-back-image [134] and CelebA [135]

N/A

0.7684

Cost-sensitive Deep Learning

Khan et al. [136]

ConvNet

MNIST [129], CIFAR-100 [137], CalTech-101 [138], DIL [139] MLC [140], and MIT-67 [141]

0.8285

N/A

† Geometric mean of individual class-wise accuracies. ‡ Area under the ROC curve. * Not available in the referenced article.

the vicinity of the disjunct. The vicinity of the disjunct is taken to be the set of data instances which violate at most one of the conditions pertaining to the concerned disjunct. Weiss [144] proposed the Error Concentration (EC) index to measure the classification performance achieved by a classifier on the small disjuncts. EC is the area under the curve formed by plotting the percentage of the error 18

suffered by a disjunct against the percentages of instances covered by it (disjuncts being sorted in increasing order of size). A high EC value indicates that most of the errors are concentrated within the small disjuncts whereas a low EC suggests that most of the errors occur within the larger disjuncts. However, one should note the EC is not an absolute measure of performance and must be interpreted in conjunction with other indices such as accuracy, Area Under the ROC curve (AUROC), etc. • Noise and small disjuncts - Danyluk and Provost [148] studied the NYNEX MAX, an expert system that diagnoses the local loop in a telephone network, and found that a large number of small disjuncts were critical for the said system. They also investigated the effects of the different types of noise peculiar to their problem on the small disjuncts. They concluded that the presence of the different sources of noise for their particular problem makes it quite hard to properly learn the small disjuncts. Theirs was the first in a series of studies on the effects of noise (attribute noise as well as class/label noise) on small disjuncts [20, 143, 149, 144]. All these studies agree that the presence of attribute noise, class noise, or some combination thereof makes it difficult for the classifier to learn the small disjuncts. This is due to the formation of faux disjuncts as a result of the presence of noise. Weiss and Hirsh [20] also observed that low amounts of class noise only affect the accuracy of the small disjuncts while the presence of high degrees of noise also affects the larger disjuncts. • Training set size and small disjuncts - The effect of varying training set size on the small disjuncts has been studied in [143, 144, 149]. The investigations show that the EC values for most datasets get increased as the training set size increases. This may seem surprising at first glance as it is well-known that increase in training set size leads to the formation of better learners. However, the increase in EC values is indeed consistent with this fact [143]. As the training set size is increased, the larger disjuncts are learned much more accurately, resulting in a lower overall error. The performance on small as well as large disjuncts is improved. However, due to the large disjuncts becoming almost error-free, most of the errors that do occur are restricted to the small disjuncts, resulting in the higher EC values. • Pruning and small disjuncts - Holte et al. [12], Prati et al. [28] and Weiss [143, 144] have studied the effects of pruning on small disjuncts. The consensus seems to be that pruning results in the removal of the rules corresponding to the small disjuncts. As a result, the data points corresponding to the small disjuncts are classified by the larger disjuncts of by a default rule. Due to the removal of the error-prone small disjunct rules, EC values increase. However, such an approach would be ill-advised in applications where the small disjuncts are of significance (such as that of [148]). • Missing attributes and small disjuncts - Like noise, the missingness of attribute values also aggravates the performance on small disjuncts, as observed in [149]. 19

• Class imbalance and small disjuncts - Quinlan [147] showed that the small disjuncts belonging to the minority class were more error-prone compared to those of the majority class. Prati et al. [28] as well as Weiss [143] have studied the effect of oversampling the smaller classes (so that all classes have an equal number of representatives in the training set) on small disjuncts. Prati et al. [28] are of the opinion that oversampling leads to an increase in the number of error-prone small disjuncts while Weiss [143] is of the opposite opinion the oversampling improves the performance on small disjuncts and lowers the EC values. 3.2. Designing classifiers capable of addressing the small disjuncts problem • Appropriate inductive bias - Holte et al. [12] demonstrated that employing a maximum specificity bias for learning the small disjunct rules results in more accurate learning of the small disjuncts. However, such an approach seemed to adversely affect the larger disjuncts. Ting [146] and Van den Bosch et al. [150] attempted to address this issue by employing instance-based learning for the small disjuncts (as an extreme case of maximum specificity bias). D´ejean [151] also tackled this problem by designing a learner which automatically learns more specific rules for the smaller disjuncts. • Dissimilarity space learning - Garc´ıa et al. [152] proposed to learn classifiers in a dissimilarity space where each data point is characterized by its dissimilarity to all the training data points. Since the learning in the dissimilarity space inherently lends some specificity, such an approach was expected to benefit small disjuncts. However, they do not separately report the results for small disjuncts as rule-based learners were not employed in their study. • Global search techniques - Carvalho and Freitas [145, 153, 154] suggested the use of global search techniques to better learn the rules pertaining to small disjuncts, instead of employing maximum specificity bias. They used the rules learned by the C4.5 decision tree to classify the large disjuncts while the rules corresponding to small disjuncts were learned using Genetic Algorithm (GA) variants. The results suggested that the variant of the proposed approach which utilized all small disjunct data instances to learn the rules is able to learn simpler rules while maintaining the level of performance achieved by Ting [146]. • Assign less importance to small disjunct rules - Holte [12] investigated the effect of removing the small disjunct rules from the learned set of rules. They observed that this results in the small disjunct data instances being classified by the larger disjuncts or some default rules. Ali and Pazzani [155] also investigated along similar lines, suggesting that lower importance be accorded to the small disjunct rules. However, such approaches are not suitable for applications where the small disjuncts correspond to critical subconcepts (such as that of [148]). 20

• Place small disjuncts into separate classes - Weiss opined in [142] that labelling the small disjuncts as separate classes along the lines of Japkowicz [156] could benefit the learning on small disjuncts. This would essentially transform the small disjunct problem into a multi-class imbalanced problem, thus enabling the use of established methods from that domain to handle small disjuncts. However, there does not seem to exist any such practical attempt till date.

3.3. Summary of existing empirical results on Classification with Small Disjuncts It follows form the above discussion that the two principal approached to handle small disjuncts in classification tasks are to use maximum specificity bias for the small disjuncts and to employ global search techniques like Genetic Algorithm (GA) to formulate the rules corresponding to the small disjuncts. Therefore, we quote the results for one representative for each of the two approach in Table 4. For the sake of consistency, for both the methods, we only quote the results corresponding to the experiments where disjuncts covering upto 5 instances are considered to be small. The average classification accuracies on the small disjnucts and the large disjuncts are reported separately. The results for the two approaches should not be compared with each other due to the use of different datasets. However, the quoted results do show that the small disjuncts are indeed prone to greater misclassification and also provide a benchmark that new research in the field must strive to out-perform. Table 4: Summary of existing empirical results for Classification with Small Disjuncts

Approach

Reference

Base Classifier

Datasets used

Maximum Specificity Chess endgame CN2 decision Holte et al. for Small trees [12] dataset [157] Disjunct rules Train Small C4.5 and Disjunct Carvalho and Genetic 2 UCI datasets rules using Freitas [145] algorithm Global Search † Disjuncts covering upto 5 instances are considered to be small.

21

Average performance Average accuracy on small disjuncts†

Average accuracy on large disjuncts

0.9000

0.9210

0.6150

0.7925

4. Class Distribution Skew Problem in Classification: Some Perspectives As explained in Section 3, the problem of the dominance of one class in the region of overlap can arise in learning problems irrespective of the presence of global class imbalance. In the absence of class imbalance, such a situation may arise when the class distributions are disparate so that one class is sparse in the region of overlap while the other is abundant. The term ‘class skew’ has been used in the class imbalance literature as a less common synonym for class imbalance [6, 11, 13, 19]. However, the technical meaning of the term ‘skew’ is a structural bias, obliqueness or deformity. Therefore, we use the slightly different yet apt term ‘class distribution skew’ to denote the situation where the structural peculiarities of the individual class distributions and/or the structural disparities between them cause one of the classes to dominate the other in and around the region of overlap. Let us revisit the illustration is Figure 2c to gain a better understanding of the class distribution skew problem. The disparity in the orientation of the two elliptical classes is evident from the figure. As a result of the disparate orientations, the star class (despite being the minority class) has a greater number of representatives around the region of overlap, resulting in the borderline instances from the circle class being misclassified. Such situations can occur irrespective of the presence of global class imbalance. This results in poor learning in the vicinity of the region of overlap for one or more of the classes. However, this phenomenon has received relatively little attention in the literature, except the discussion in [100] about the ’ generalized imbalance’ between classes due to disparate class structures. Perhaps, this is due to the difficulty of visualizing and quantifying the structural peculiarities of datasets having more than three attributes. Yet, the effects of this phenomenon have been evident in the literature [27, 158, 159], especially in relation to the proper classification of borderline instances. Of particular interest are the findings of Weiss and Provost [158] concerning the effects of oversampling on class-imbalanced datasets having varying degrees of imbalance. Ideally, the best performance should be achieved when all the classes are equally represented in the training set. However, the results in [158] show that the best performance is often achieved at some other level of oversampling where either the majority or the minority class outnumbers the other. This finding hints towards the presence of class distribution skew in most real-world datasets. Though there has been no investigation solely concerned with class distribution skew in the absence of global imbalance, some possible solutions to the problem have been proposed in the context of class imbalance problems. However, the reader should note that these solutions are generally aimed towards better learning on the minority class and there exists a need to generalize these methods to cases where there is no global imbalance and to cases where the majority class suffers due to distribution skew. • Adaptive kNN variants - Due to the local nature of the imbalance resulting from class distribution skew, adaptive variants of the kNN classifier which are tailored to deal with the local imbalance of classes are likely to be effective. Zhang et al. [160, 161] proposed classifiers which determine dynamic 22

neighborhood for each query instance and then compensate for the difference between the global and local prior probabilities of the classes. Wang et al. [162] estimate the disparity between the local and global IR to compensate for the local variations in the imbalance in an evidence theory based variant of kNN. The work of Dubey and Pudi [163] also accounts for local variations in class imbalance. • Adaptive oversampling methods - Adaptive oversampling methods which account for the local IR in the vicinity of data points or that are tailored to benefit the borderline instances such as BorderlineSMOTE [42], ADASYN [43], LN-SMOTE [44], and safe-level SMOTE [45] may also prove to be useful for handling class distribution skew, if either the minority or the majority class is oversampled, based on the local IR of the neighborhood under consideration.

5. Classification with Missing Features: Summary and Recent Approaches 5.1. Characterization of the missing features problem Missing features, variables or attributes have always been a challenge for researchers because most traditional learning methods (specifically the ones which assume all data instances to be fully observed, i.e. all the features are observed) cannot be directly applied to such incomplete data, without suitable pre-processing. The initial models for feature missingness are due to Rubin and Little [164]. Schafer [165] also provides theories and analyses of missing data. The three principal types of mechanisms [164] giving rise to the missing features problem are as follows: • Missing Completely At Random (MCAR): MCAR refers to the case where missingness is entirely random, i.e. the likelihood of a feature being unobserved for a certain data instance depends neither on the observed nor on the unobserved characteristics of the instance. For example, in an annual income survey, a citizen is unable to participate, due to unrelated reasons such as traffic or schedule problems. • Missing At Random (MAR): MAR eludes to the cases where the missingness is conditional to the observed features of an instance but is independent of the unobserved features. Suppose, college-goers are less likely to report their income than office-goers. But, whether a college-goer will report his/her income is independent of the actual income. • Missing Not At Random (MNAR): MNAR is characterized by the dependence of the missingness on the unobserved features. For example, people who earn less are less likely to report their incomes in the annual income survey.

23

Schafer and Graham [166] and Zhang et al. [167] have observed that MCAR is a special case of MAR and that MNAR can also be converted to MAR by appending a sufficient number of additional features. Therefore, most learning techniques are based on the validity of the MAR assumption. 5.2. Traditional methods of handling missing features Garc´ıa-Laencina et al. [15], in a somewhat older survey, distinguished four main types of learning with missing features as shown below • Marginalization - which discards the data points having missing features. • Imputation - which attempts to fill in the missing features by making reasonable estimates based on the values observed for the corresponding features over the rest of the dataset. • Model-based methods - which makes parametric estimates of the distributions of each feature based on the corresponding observed values and the estimated distributions are then used to draw inferences. • Direct methods - which consist of a subgroup of machine learning methods which can be directly applied to datasets having missing features. In what follows, we will outline the traditional approaches to handle classification with missing features under these heads. 5.2.1. Marginalization methods Marginalization methods consist of removing the incomplete data points from the dataset. Garc´ıaLaencina et al. [15] mention two distinct methods of marginalization, viz. 1. Complete case analysis - When the rate of missingness is low, such as 1 − 5%, all the incomplete data instances can be ignored. This approach which enables the use of traditional machine learning methods is known as complete case analysis. However, this approach also suffers from the drawback that incomplete test points cannot be classified and have to be discarded. 2. Available case analysis - An alternative is to only use those points for learning which are completely observed in the observed subspace of the test point. However, such methods require the system to be trained separately for different patterns of missingness. One such approach is that of Sharpe and Solly [168].

24

5.2.2. Imputation methods Marginalization cannot be applied to data having 5 − 15% missing values, as it may lead to the loss of a sizable amount of information. Therefore, an alternative is to fill-in or impute the vacancies in the data, so that traditional learning methods can be applied thereafter. Common imputation methods [169] involve filling the missing features of data instances with zeros (Zero Imputation (ZI)), or the means of the corresponding features over the entire dataset (Average Imputation (AI)). Class Mean Imputation or Concept Mean Imputation (CMI) is a slight modification of AI that involves filling the missing features with the average of all observations having the same label as the instance being filled. Yet another common imputation method is k-Nearest Neighbour Imputation (kNNI) [170, 171], where the missing features of a data instance are filled-in by the averages of corresponding features over its k-Nearest Neighbours (kNN) (the kNN being identified on the observed subspace). Grzymala-Busse and Hu [172] suggested different approaches for imputing the missing feature values, viz. selecting the most common feature value, selecting the most common value of the feature within the same class or concept, C4.5 based imputation, assigning all possible values of the feature, assigning all possible values of the feature restricted to the given concept or class, event-covering method, etc. Rubin’s book [173] on Multiple Imputation (MI) proposes a technique where the missing values are imputed by a typically small (e.g. 5-10) number of simulated versions, depending on the percentage of missing data. This method of repeated imputation incorporates the uncertainty inherent in imputation. Techniques such as Markov Chain Monte Carlo (MCMC) [174] (which simulates random draws from nonstandard distributions via Markov chains) have been used for MI by making a few independent estimates of the missing data from a predictive distribution; and these estimates are then used for MI [175, 176]. Some more sophisticated techniques have been developed, especially by the bioinformatics community, to impute the missing values by exploiting the correlations between data. Troyanskaya et al. [177] proposed a weighted variant of kNNI and also put forth the Singular Value Decomposition based Imputation (SVDI) technique, which performs regression-based estimation of the missing values using the k most significant eigenvectors of the dataset. Two variants of the Least Squares Imputation (LSI) technique were proposed by Bo et al. [178]. Sehgal et al. [179] further combined LSI with Non-Negative LSI (NNLSI) in the Collateral Missing Value Estimation (CMVE) technique. These methods have also been shown to vastly improve the results [180, 181]. Meng et al. [182] proposed a bi-clustering based Bayesian PCA approach for missing variable estimation on microarray data. More recently, sophisticated imputation methods like those based on Self-Organizing Map (SOM) [183], Multi-Layer Perceptron (MLP) [168], Recurrent Neural Networks (RNN) [184], auto-associative neural networks [185], and multi-task learning approaches [186] have been proposed. However, imputation methods may introduce noise and create more problems than they solve, as documented by Little and Rubin [164] and others [187, 188, 189].

25

5.2.3. Model-based methods Inferences, drawn from data having more than 15% missingness may be severely warped, despite the use of such sophisticated imputation methods [190]. Model-based methods that make parametric estimates of the joint distribution of all the features in the data (using of the inter-relationships among the features) have shown vast improvements over traditional imputation approaches [191, 192, 193]. These procedures are more efficient than imputation because they often achieve better estimates of the missing feature values. A detailed description of the general philosophy underlying these methods can be found in [15]. Generally, in incomplete datasets, Maximum Likelihood Estimation (MLE) is used when we can maximize the likelihood function. The likelihoods are separately calculated for cases with unobserved features and for cases with complete data on all features. Then, these two likelihoods are maximized together to obtain the estimates. Dempster and Rubin [189] proposed the use of an iterative solution, based on the Expectation Maximization (EM) algorithm, when closed form solutions to the maximization of the likelihoods are not possible. Gahramani and Jordan [194] trained Gaussian Mixture Models (GMM) on incomplete data using the EM algorithm. Ahmad and Tresp [191] proposed Bayesian techniques for estimating class probabilities from incomplete data using neural networks. Tresp et al. [195] further improved this method using GMM. Williams et al. [196] used both GMM and EM to estimate the class-conditional density functions from datasets with missing features. Ramoni and Sebastani [197] proposed the Robust Bayesian Estimator (RBE) to learn class-conditional distributions from incomplete data using probability intervals spanning all possible values of a missing feature. Other interesting model-based approaches for handling missing features are those of Bhattacharyya et al. [198] and Smola et al. [199] who train SVMs using probabilistic constraints. Pelckmans et al. [200], on the other hand, used a probabilistic risk function to train SVM on incomplete data. 5.2.4. Direct methods Model-based approaches are often computationally expensive. Moreover, most imputation and modelbased methods assume that the pattern of missingness is ignorable. When data are MCAR or MAR, the reasons for missing data may be ignored and simpler models can be used for missing data analysis. Heitjan and Basu [201] provided a thorough discussion of this topic. On the other hand, MNAR has non-ignorable response mechanism because the mechanism governing the missingness of data itself has to be modeled to deal with the missing data. In the analysis of incomplete data, the mechanism and extent of missingness, both are crucial in determining the methods to process them. Hence, other methods have been developed to tackle incomplete data due to MNAR [202]. Krause and Polikar [203] proposed a modification of the Learn++ incremental learning algorithm which can work around the need for imputation by using an ensemble of multiple classifiers learned on random subspaces of the dataset. Juszczak and Duin [204] trained single class classifiers on each of the features (or combinations thereof) so that an inference about the class to which a 26

particular data instance belongs can be drawn even when some of the features are missing, by combining the individual inferences drawn by each of the classifiers pertaining to the observed features. Random subspace learning was also used by Nanni et al. [205] and compared with other approaches such as MI, MtI, etc. It is also worth reiterating here the observation made in Section 1 that the traditional decision tree learners such as ID3 [206], C4.5 [130] and CN2 [207] can be readily applied to incomplete data. Recently, Datta et al. [16] proposed a Feature-Weighted Penalized Dissimilarity (FWPD) measure to enable the kNN classifier to handle data with missing variables. FWPD adds a penalty to a weighted sum of the distance between two data-instances computed on the basis of the observed (non-missing) features. The penalty term is proportionately higher if an instance misses those features which are observed for a majority of the data instances. Struski et al. [208] took an approach of representing incomplete data as an affine subspace with a distinguished base point. Such representation allows various affine transformations of the incomplete data including whitening and dimensionality reduction. Hazan et al. [209] devised a SVM-like classifier for incomplete data residing in lower-dimensional manifolds, which they claim performs as good as a traditional classifier on the fully observed data. 5.3. Deep learning methods for handling missing features Due to growing popularity of deep learning paradigms, there have been recent attempts to handle missingness in deep learners as well. Che et al. [210] proposed a deep learning framework based on the Gated Recurrent Unit (GRU) (an RNN architecture) which informs the learner about the missing (or observed) inputs through masking and condenses the input missing patterns through time interval. Gondara and Wang [211] suggested the use of deep denoising autoencoders for multiple imputations of continuous, categorical and mixed type data with various missingness patterns. Denoising autoencoders incorporate noise into the input data and compel the network to reconstruct clean output, thus enforcing the hidden layers to learn more robust features. Zhong et al. [212] proposed a deep learning model called Field Effect Bilinear Deep Networks (FEBDN) for image recognition with missing features. Interestingly, their deep architecture uses a Restricted Boltzman Machine (RBM) in a three-stage learning framework emulating the operational characteristics of a Field Effect Transistor (FET) and is able to optimize the class boundaries while estimating the missing features simultaneously. A few other recent approaches involving deep learning based estimation of missing features can be found in [213, 214]. 5.4. Summary of existing empirical results on Classification with Missing Features In this section, we quote empirical results from recent and/or popular representative methods from each of the major types of approaches used to handle missing features in classification tasks.

27

One representative each is chosen from the classes of marginalization, model-based methods, and direct methods. However, three representatives are chosen from the class of imputation methods, due to the diverse range of imputation techniques prevalent in the literature. We choose one simple imputation method and two machine learning based imputation methods, one based on a shallow network and another based on a state-of-the-art deep autoencoder network. Since the purpose of handling missingness of features in classifcation tasks is to achieve good classification performance despite the missingness, we summarize the results in Table 5 in terms of the classification accuracies achieved by handling the missingness of features using the respective approaches. Future research on classification in the presence of missing features must strive to improve upon these results, by devising better ways to handle the missingness of features. Table 5: Summary of existing empirical results for Classification with Missing Features

Approach

Reference

Base Classifier

Marginalization (Available case) k Nearest Neighbor Imputation

Sharpe and Solly [168]

MLP

Dixon [170]

weighted kNN [216]

Garc´ıa-Laencina et al. [186]

MLP

Gondara and Wang [211]

Random Forest

Ramoni and Sebastiani [197]

Naive Bayes [218]

Datta et al. [16]

kNN

Multi-task Neural Network Imputation Deep Autoencoder Imputation Model-based (Robust Bayesian Estimator) Direct method (Penalized Dissimilarity based)

Type of missing -ness

Average accuracy

unknown

0.9480

MCAR

0.6079

Pima dataset from UCI

unknown

0.8042

12 datasets from MLBench [217] Congressional Voting Records dataset from UCI

MCAR, MNAR MCAR, MAR, MNAR

17 UCI datasets

MCAR, MAR, MNAR

Datasets used Thyroid dataset [215] 5 datasets from UCI and other sources

0.8672 0.9021

0.8031

6. Classification with Absent Features: An Outline In contrast to the missing feature problem, classification tasks often face situations where certain features can be simply undefined or non-existing for certain data instances rather than having an unobserved or unrecorded value. In such cases, estimating the values for the undefined features is not really meaningful and this motivates the development of some approaches for classification with such absent features.The first work that systematically addressed such a classification task with non-existing features (”structurally absent”) for some of the samples was due to Chechik et al. [17] who showed how such incomplete data can be classified without filling in the structurally missing attributes with their imputed values by using a maxmargin learning system. The authors hypothesized the existence of each data instance in a lower dimensional 28

subspace of the full feature space, defined by its won existing features. Taking a cue from the geometrical interpretation of the max-margin classifiers, they tried to maximize the margin of the separating hyperplane in the worst case, when the margin for each data instance is determined in its own lower dimensional subspace. For the linearly separable case, the authors used a second-order cone programming approach, while for the non-separable case, where the objective turns out to be non-convex, they proposed an iterative procedure to converge to a local optimum of the objective. Zhaowei et al. [219] used Chechnik et al.’s approach for forecasting incomplete time series. However, their application was essentially on missing feature classification, where, instead of filling the missing variables in advance, they solve the related classification task more directly. Following a similar line of thought, Zhang et al. [26] proposed a max-margin regression procedure to estimate the real effort of software projects from data with structural missingness or absent features, without using the conventional imputation methods. Multiple Kernel Learning (MKL) assumes that all the base kernels for each data instance are fully observed. However, this can be untrue when the data comes with missing or absent features. Motivated by the approach of Chechnik et al., Liu et al. [220] developed an MKL approach by directly classifying the samples with absent base kernels. As can be perceived, unlike the plethora of works devoted to solving unstructured missing feature problems by imputation, not much research has been undertaken to directly address the absent feature problem for various stand-alone and ensemble classifiers and the field remains quite open till date.

7. Interrelations among Data Irregularities and Open Issues 7.1. Interrelations among Distribution-based Irregularities The interrelations b=among the three types of distribution-based irregularities are summarized in Table 6 and elucidated below. Table 6: Current state of the investigations on the interrelations among distribution-based irregularities Combination

Class Imbalance & small disjuncts

Class Imbalance & class distribution skew Small disjuncts & class distribution skew Class Imbalance, small disjuncts & class distribution skew

Current state of research Jo and Japkowicz [221] showed that severe class imbalance can give rise to small disjuncts. Quilan [147] showed that the small disjuncts belonging to the minority class are more error prone compared to those of the majority class. Present investigations have only concentrated on making the minority class immune to the effects of class distribution skew. Yet to be investigated Yet to be investigated

1. Class imbalance and small disjuncts - The interrelationship between the class imbalance problem and the small disjuncts problem is apparent from the investigations conducted in [147] and [221]. Quinlan 29

[147] demonstrated that the small disjuncts belonging to the minority class are more error-prone than those belonging to the majority class. Jo and Japkowicz [221], on the other hand, found that increase in class imbalance is responsible for the creation of a greater number of small disjuncts from the minority class resulting in a poorer performance on the minority class. They seem to be of the opinion that class imbalance, without the presence of small disjuncts, does not result in lower performance on the minority class. However, their experiments are incomplete in the sense that they do not study the effects of overlap between classes. Yet, both studies suggest that achieving good performance on the small disjuncts may automatically mitigate the class imbalance problem. 2. Class imbalance and class distribution skew - As noted in Section 4, all the research effort has been concentrated on tackling the effects of class distribution skew on the minority class, despite the experiments in [158] suggesting that class distribution skew can potentially affect both classes. This is likely what gives rise to the cost tuning problem in class imbalance problems. Therefore, there exists a need to study the interrelationships between class distribution skew and the cost-tuning problem in class imbalanced problems, so as to benefit both the classes. 3. Small disjuncts and class distribution skew - In situations where class distribution skew gives rise to fine structures (such as a narrow tail) in one or more of the classes, the existence of the fine structures may give rise to small disjuncts if rule-based learners are used for classification. Thus, there seems to be an interesting relationship between these two phenomena. 4. Class imbalance, small disjuncts and class distribution skew - It is now clear that the three types of distribution-based irregularities are indeed closely interrelated and there is a possibility that tackling one may help handle the other. Class imbalance, as well as class distribution skew, may give rise to the creation of small disjuncts. On the other hand, small disjuncts seem to be responsible for much of the poor performance in the minority class. Additionally, class distribution skew seems to be responsible for the cost-tuning problem in class imbalanced classification tasks. 7.2. Interrelations among Feature-based Irregularities Unstructured missingness or the missing features problem is essentially independent of structured missingness or the absent features problem in that the former is a phenomenon extrinsic to the dataset (resulting from external mechanisms such as data corruption, instrument failure, etc.) while the latter is intrinsic to the dataset. However, there may be cases where both forms of missingness occur simultaneously. In such a situation, the principal challenge would be to distinguish between the two forms of missingness.

30

7.3. Interrelations between Distribution-based and Feature-based Irregularities The interrelations among distribution-based as well as feature-based irregularities are listed in the following. 1. Class imbalance and missing features - Classification tasks, such as medical diagnosis, which suffer from class imbalance also often suffer from the missing features problem. Hence, a few studies have been conducted on handle both irregularities together in classification tasks. Most of these studies such as [222, 223, 224] deal with class imbalance separately after imputing the missing features. 2. Class imbalance and absent features - Chen and Mani [225] dealt with a problem characterized by both class imbalance and missingness. They considered missingness of a feature to be a legitimate state of the feature, making the approach suitable for handling structural missingness. 3. Class-specific missingness - Datta et al. [16] defined biased missingness as the phenomenon where not all classes suffer from the same extent of missingness. Such biased missingness, when encountered with class imbalance, may make the learning task very difficult. Takum and Bunkhumpornpat [226] studied the scenario where only minority class instances have missing feature values. 4. Disjunct-specific missingness - Like class-specific missingness, the extent or type of missingness may also be unique to the disjuncts in a dataset. This will especially pose a problem for handling the missing or absent features of small disjuncts. 7.4. Open avenues leading to possible future research Having acquainted the reader with the rich variety of research taking place in the field of data irregularities, and having noted the sub-fields and niches which require further attention, we bring to the attention of the reader the following open issues concerning the individual data irregularities and combinations thereof. We expect that this will motivate the reader to conduct future studies to further enrich the research on data irregularities. 1. Class imbalance - (a) Since deep learning has significantly advanced the state-of-the-art for learning feature representations from data, an interesting area of research can be to study the effects of class imbalance on deep representations learned by neural networks. Are the learned feature representations favourable for both the classes or just the majority class(es)? (b) The multi-objective optimization based class imbalance handing techniques are mostly restricted to two-class tasks. It may turn out to be very rewarding to extend these methods to multi-class scenarios, as these methods are inherently capable to handle the trade-off between the classes without having to resort to tuning procedures. 31

(c) Yet another interesting direction of research is to investigate (theoretically or empirically) how the generalization performance of classifiers is affected by the presence of class imbalance. To the best of our knowledge there is no significant work that extends the theoretical studies on bounds of the testing error of popular classifiers like SVM [227] or kNN [228] to see how the bounds vary with the degree of imbalance. 2. Small disjuncts - (a) As noted in Section 3, the methods used to measure the presence of small disjuncts in datasets are restricted to rule-based classifiers. An important question then is: how to identify the small disjuncts in a dataset when the classifier used is not rule-based? The knowledge that, except in the case of rule-based learning techniques, small disjuncts essentially amount to small sub-clusters within the classes may help in this regard. (b) Huang et al. [116] recently proposed a technique to preserve the inter-cluster and inter-class margins while learning deep representations of data. It seems that such an approach will make it easier to learn the smaller disjuncts. Hence, it will be interesting to investigate what effect such a representation actually has on the small disjuncts. (c) Since the small disjunct problem can be thought of as a trade-off between various disjuncts belonging to the different classes, perhaps the problem can be solved by formulating the classification task as a many-objective optimization problem where the performance on each of the disjuncts must be simultaneously maximized. 3. Class distribution skew - (a) As noted in Section 4, all the past as well as current research effort has been focused on making the minority class immune to the effects of skewed class distributions. Therefore, investigations may be conducted to develop adaptive learning methods which can mitigate the effects of class distribution skew on all classes, irrespective of the presence of class imbalance. (b) Since datasets having more than three dimensions cannot be readily visualized, it is a challenge to devise indices or visualization techniques for detecting the presence of class distribution skew for datasets having more than three dimensions. (c) Since the class distribution skew problem is characterized by different shapes and orientations for the distributions of the individual classes, it may be useful to employ class-wise feature learning (where features are generated or selected separately for each of the classes) to help mitigate the class distribution skew problem. 4. Distribution-based irregularities - (a) It may so happen that in the presence of class distribution skew, the peculiarities of the class distributions may be learned by a rule-based classifier as disjuncts separate

32

from the rest of the corresponding classes. It may prove to be useful to study the extent to which this is the case in real-world applications. (b) An interesting area of research is to automatically deduce the amount of oversampling, undersampling or the ratio of relative costs required to learn class from class imbalanced datasets. These issues generally arise as a side-effect of the small disjuncts and class distribution skew. Hence, can the extent of oversampling, undersampling or the relative costs be determined based on the number of small disjuncts present in the dataset or the type of class distribution skew characterizing the dataset? (c) A large gamut of indices have been proposed over the past few years to measure the performance of classifiers in the face of class imbalance. However, there is a dearth of indices to properly measure the performance of classifiers in the presence of other forms of distribution-based irregularities. Research endeavors to develop such performance indices can indeed prove to be very useful. 5. Missing features - (a) The complex deep feature representations learned by deep neural networks seems to be largely responsible for the success of deep learning methods. However, the presence of missing features in the dataset may affect the efficacy of the learned representations. Hence, researchers must gain an understanding of the effects that the presence of missing features has on deep learning. 6. Absent features - (a) The most important question to advance the field of learning with absent features is whether there can be ways to handle structural missingness other than the direction explored in [17]? For example, will it help to treat the absence of features as legitimate values for the features, ` a la Chen et al. [225]? 7. Missing and absent features - (a) The most important question that arises when a dataset is plagued simultaneously by both forms of missingness is: how to distinguish between structural and unstructured missingness? Perhaps the correlation between features can provide some hints to enable such an endeavor. (b) Another interesting way to address the same problem is to devise a multi-objective framework to deal simultaneously with missing as well as absent features. The inherently different nature of the two types of feature-based irregularities may make it possible to devise two different performance measures. Both performance measures may be simultaneously maximized thereafter, using a multiobjective optimizer. 8. Distribution-based and feature-based irregularities - (a) To the best of our knowledge, there has not been any experimental study till date to investigate the effect of missingness on the performance of classifiers tailored for handling distribution-based irregularities and vice-versa. Such investigations will enrich the research communities understanding of both types of irregularities. 33

(b) Yet another interesting question is how class imbalance is related to the biased missingness problem discussed in [16]? (c) As observed in Section 5, subspace learning methods are often used to learn in the presence of missingness. Therefore, it is important to understand when and to what extent does missingness give rise to distribution-based irregularities like small disjuncts and/or class distribution skew in the learnable subspaces? Such understanding will help researchers address their irregularities as and when they arise.

8. Conclusions More often than not, real world data is plagued with various distribution and feature-based data irregularities. Although some of these irregularities (for example, class imbalance and classification with missing features) were discussed in the past under more than one dedicated surveys and book chapters, to the best of our knowledge, this is the first article to systematically condense the diversified notions of data irregularities under one umbrella. We qualitatively explore the effects of such irregularities on the performance of classifiers and also provided a brief but comprehensive outline of the most recent and notable approaches to tackle them. Extending beyond the coverage of the existing surveys of imbalanced classification, we discuss various deep learning and multi-objective optimization approaches to address data irregularities. We distinguished the notion of class distribution skew from the usual way it is used in the literature as a synonym for class imbalance. We also emphasize on the co-occurrence of such irregularities, like imbalanced classification along with missing/absent features which can significantly deteriorate the performance of a conventional classifier. We sincerely hope that the future research directions, provided in Section 7, will help the machine learning researchers to design better classifiers with more robustness and generalizability in face of one or more data irregularities discussed in this article. References [1] D. H. Wolpert, The lack of a priori distinctions between learning algorithms, Neural Computation 8 (1996) 1341–1390. [2] R. A. Fisher, The use of multiple measurements in taxonomic problems, Annals of Eugenics 7 (1936) 179–188. [3] L. Deng, D. Yu, Deep learning: Methods and applications, Foundations and Trends in Signal Processing 7 (2014) 197–387. [4] S. B. Kotsiantis, Supervised machine learning: A review of classification techniques., Informatica (Slovenia) 31 (2007) 249–268. [5] B. Frenay, M. Verleysen, Classification in the presence of label noise: A survey, IEEE Transactions on Neural Networks and Learning Systems 25 (2014) 845–869. [6] H. He, E. A. Garcia, Learning from imbalanced data, IEEE Trans. on Knowl. and Data Eng. 21 (2009) 1263–1284. [7] H. He, Y. Ma, Imbalanced Learning: Foundations, Algorithms, and Applications, Wiley-IEEE Press, 1st edition, 2013. [8] S. Garc´ıa, J. Luengo, F. Herrera, Data Preprocessing in Data Mining, Springer Publishing Company, Incorporated, 2014. [9] P. Branco, L. Torgo, R. P. Ribeiro, A survey of predictive modeling on imbalanced domains, ACM Comput. Surv. 49 (2016) 31:1–31:50.

34

[10] M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, F. Herrera, A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42 (2012) 463–484. [11] B. Krawczyk, Learning from imbalanced data: open challenges and future directions, Progress in Artificial Intelligence 5 (2016) 221–232. [12] R. C. Holte, L. E. Acker, B. W. Porter, Concept learning and the problem of small disjuncts, in: Proceedings of the 11th International Joint Conference on Artificial Intelligence - Volume 1, IJCAI’89, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1989, pp. 813–818. [13] M. C. Monard, G. E. Batista, Learning with skewed class distributions, Advances in Logic, Artificial Intelligence, and Robotics: LAPTEC 2002 85 (2002) 173. [14] M. Saar-Tsechansky, F. Provost, Handling missing values when applying classification models, J. Mach. Learn. Res. 8 (2007) 1623–1657. [15] P. J. Garc´ıa-Laencina, J.-L. Sancho-G´ omez, A. R. Figueiras-Vidal, Pattern classification with missing data: a review, Neural Computing and Applications 19 (2010) 263–282. [16] S. Datta, D. Misra, S. Das, A feature weighted penalty based dissimilarity measure for k-nearest neighbor classification with missing features, Pattern Recognition Letters 80 (2016) 231 – 237. [17] G. Chechik, G. Heitz, G. Elidan, P. Abbeel, D. Koller, Max-margin classification of data with absent features, J. Mach. Learn. Res. 9 (2008) 1–21. [18] A. D. Pozzolo, O. Caelen, Y.-A. L. Borgne, S. Waterschoot, G. Bontempi, Learned lessons in credit card fraud detection from a practitioner perspective, Expert Systems with Applications 41 (2014) 4915 – 4928. [19] N. Wahab, A. Khan, Y. S. Lee, Two-phase deep convolutional neural network for reducing class skewness in histopathological images based breast cancer detection, Computers in Biology and Medicine 85 (2017) 86 – 97. [20] G. M. Weiss, H. Hirsh, The problem with noise and small disjuncts, in: ICML 1998, 1998, p. 574. [21] V. Nikulin, G. J. McLachlan, Classification of imbalanced marketing data with balanced random sets, in: Proceedings of the 2009 International Conference on KDD-Cup 2009 - Volume 7, KDD-CUP’09, JMLR.org, 2009, pp. 89–100. [22] Z. Liu, H. Wang, Y. Yan, G. Guo, Effective facial expression recognition via the boosted convolutional neural network, in: H. Zha, X. Chen, L. Wang, Q. Miao (Eds.), Computer Vision: CCF Chinese Conference, CCCV 2015, Xi’an, China, September 18-20, 2015, Proceedings, Part I, Springer Berlin Heidelberg, Berlin, Heidelberg, 2015, pp. 179–188. [23] R. Young, D. R. Johnson, Handling missing values in longitudinal panel data with multiple imputation, Journal of Marriage and Family 77 (2015) 277–294. [24] B. Kirkpatrick, K. Stevens, Perfect phylogeny problems with missing values, IEEE/ACM Transactions on Computational Biology and Bioinformatics 11 (2014) 928–941. [25] Q. Xiang, X. Dai, Y. Deng, C. He, J. Wang, J. Feng, Z. Dai, Missing value imputation for microarray gene expression data using histone acetylation information, BMC Bioinformatics 9 (2008) 252. [26] W. Zhang, Y. Yang, Q. Wang, A comparative study of absent features and unobserved values in software effort data, International Journal of Software Engineering and Knowledge Engineering 22 (2012) 185–202. [27] V. L´ opez, A. Fern´ andez, S. Garc´ıa, V. Palade, F. Herrera, An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics, Information Sciences 250 (2013) 113 – 141. [28] R. C. Prati, G. E. A. P. A. Batista, M. C. Monard, Class imbalances versus class overlapping: An analysis of a learning system behavior, in: R. Monroy, G. Arroyo-Figueroa, L. E. Sucar, H. Sossa (Eds.), MICAI 2004: Advances in Artificial Intelligence: Third Mexican International Conference on Artificial Intelligence, Mexico City, Mexico, April 26-30, 2004. Proceedings, Springer Berlin Heidelberg, Berlin, Heidelberg, 2004, pp. 312–321. [29] A. Fern´ andez, S. del R´ıo, N. V. Chawla, F. Herrera, An insight into imbalanced big data classification: outcomes and challenges, Complex & Intelligent Systems 3 (2017) 105–120. [30] N. Japkowicz, The class imbalance problem: Significance and strategies, in: In Proceedings of the 2000 International Conference on Artificial Intelligence (ICAI, 2000, pp. 111–117. [31] P. Hart, The condensed nearest neighbor rule (corresp.), IEEE Transactions on Information Theory 14 (1968) 515–516. [32] I. Tomek, Two Modifications of CNN, IEEE Transactions on Systems, Man, and Cybernetics 7(2) (1976) 679–772. [33] S.-J. Yen, Y.-S. Lee, Cluster-based under-sampling approaches for imbalanced data distributions, Expert Systems with Applications 36 (2009) 5718 – 5727. [34] S. Garc´ıa, F. Herrera, Evolutionary undersampling for classification with imbalanced datasets: Proposals and taxonomy, Evol. Comput. 17 (2009) 275–306. [35] G. Y. Wong, F. H. F. Leung, S. H. Ling, An under-sampling method based on fuzzy logic for large imbalanced dataset, in: 2014 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), 2014, pp. 1248–1252. [36] W. W. Y. Ng, J. Hu, D. S. Yeung, S. Yin, F. Roli, Diversified sensitivity-based undersampling for imbalance classification problems, IEEE Transactions on Cybernetics 45 (2015) 2402–2412.

35

[37] Y. Fu, H. Zhang, Y. Bai, W. Sun, An under-sampling method: Based on principal component analysis and comprehensive evaluation model, in: 2016 IEEE International Conference on Software Quality, Reliability and Security Companion (QRS-C), 2016, pp. 414–415. [38] J. Ha, J.-S. Lee, A new under-sampling method using genetic algorithm for imbalanced data classification, in: Proceedings of the 10th International Conference on Ubiquitous Information Management and Communication, IMCOM ’16, ACM, New York, NY, USA, 2016, pp. 95:1–95:6. [39] D. Devi, S. K. Biswas, B. Purkayastha, Redundancy-driven modified tomek-link based undersampling: A solution to class imbalance, Pattern Recognition Letters 93 (2017) 3 – 12. Pattern Recognition Techniques in Data Mining. [40] C. Bunkhumpornpat, K. Sinapiromsaran, Dbmute: Density-based majority under-sampling technique, Knowl. Inf. Syst. 50 (2017) 827–850. [41] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, Smote: Synthetic minority over-sampling technique, J. Artif. Int. Res. 16 (2002) 321–357. [42] H. Han, W.-Y. Wang, B.-H. Mao, Borderline-smote: A new over-sampling method in imbalanced data sets learning, in: D.-S. Huang, X.-P. Zhang, G.-B. Huang (Eds.), Advances in Intelligent Computing: International Conference on Intelligent Computing, ICIC 2005, Hefei, China, August 23-26, 2005, Proceedings, Part I, Springer Berlin Heidelberg, Berlin, Heidelberg, 2005, pp. 878–887. [43] H. He, Y. Bai, E. A. Garcia, S. Li, Adasyn: Adaptive synthetic sampling approach for imbalanced learning, in: IN: IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), IJCNN 2008, 2008, pp. 1322–1328. [44] T. Maciejewski, J. Stefanowski, Local neighbourhood extension of smote for mining imbalanced data, in: 2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM), 2011, pp. 104–111. [45] C. Bunkhumpornpat, K. Sinapiromsaran, C. Lursinsap, Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem, in: T. Theeramunkong, B. Kijsirikul, N. Cercone, T.-B. Ho (Eds.), Advances in Knowledge Discovery and Data Mining: 13th Pacific-Asia Conference, PAKDD 2009 Bangkok, Thailand, April 27-30, 2009 Proceedings, Springer Berlin Heidelberg, Berlin, Heidelberg, 2009, pp. 475–482. [46] J. A. S´ aez, B. Krawczyk, M. Wo´ zniak, Analyzing the oversampling of different classes and types of examples in multi-class imbalanced datasets, Pattern Recognition 57 (2016) 164 – 178. [47] L. Abdi, S. Hashemi, To combat multi-class imbalanced problems by means of over-sampling techniques, IEEE Transactions on Knowledge and Data Engineering 28 (2016) 238–251. [48] C. Bunkhumpornpat, K. Sinapiromsaran, C. Lursinsap, technique, Applied Intelligence 36 (2012) 664–684.

Dbsmote: Density-based synthetic minority over-sampling

[49] E. Ramentol, Y. Caballero, R. Bello, F. Herrera, Smote-rsb *: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using smote and rough sets theory., Knowl. Inf. Syst. 33 (2012) 245–265. [50] Q. Wang, A hybrid sampling svm approach to imbalanced data classification, Abstract and Applied Analysis 2014 (2014) 33–44. [51] W. Prachuabsupakij, Clus: A new hybrid sampling classification for imbalanced data, in: 2015 12th International Joint Conference on Computer Science and Software Engineering (JCSSE), 2015, pp. 281–286. [52] C. Jian, J. Gao, Y. Ao, A new sampling method for classifying imbalanced data based on support vector machine ensemble, Neurocomputing 193 (2016) 115 – 122. [53] J. A. S´ aez, J. Luengo, J. Stefanowski, F. Herrera, Smoteipf: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering, Information Sciences 291 (2015) 184 – 203. [54] Q. Kang, X. Chen, S. Li, M. Zhou, A noise-filtered under-sampling scheme for imbalanced classification, IEEE Transactions on Cybernetics PP (2017) 1–12. [55] N. Junsomboon, T. Phienthrakul, Combining over-sampling and under-sampling techniques for imbalance dataset, in: Proceedings of the 9th International Conference on Machine Learning and Computing, ICMLC 2017, ACM, New York, NY, USA, 2017, pp. 243–247. [56] S. G´ onzalez, S. Garc´ıa, M. L´ azaro, A. R. Figueiras-Vidal, F. Herrera, Class switching according to nearest enemy distance for learning from highly imbalanced data-sets, Pattern Recognition 70 (2017) 12 – 24. [57] K. Veropoulos, C. Campbell, N. Cristianini, et al., Controlling the sensitivity of support vector machines, in: Proceedings of the International Joint Conference on AI, IJCAI, pp. 55–60. [58] C. X. Ling, V. S. Sheng, Cost-sensitive Learning and the Class Imbalanced Problem, in: C. Sammut (Ed.), Encyclopedia of Machine Learning, 2007. [59] Z.-H. Zhou, X.-Y. Liu, Training cost-sensitive neural networks with methods addressing the class imbalance problem, IEEE Trans. on Knowl. and Data Eng. 18 (2006) 63–77. [60] V. L´ opez, A. Fern´ andez, J. G. Moreno-Torres, F. Herrera, Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. open problems on intrinsic data characteristics, Expert Systems with Applications 39 (2012) 6585 – 6608.

36

[61] Y. Sun, M. S. Kamel, A. K. Wong, Y. Wang, Cost-sensitive boosting for classification of imbalanced data, Pattern Recognition 40 (2007) 3358 – 3378. [62] G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, G. Bing, Learning from class-imbalanced data: Review of methods and applications, Expert Systems with Applications 73 (2017) 220 – 239. [63] F. Cheng, J. Zhang, C. Wen, Z. Liu, Z. Li, Large cost-sensitive margin distribution machine for imbalanced data classification, Neurocomputing 224 (2017) 45 – 57. [64] W. Xiao, J. Zhang, Y. Li, S. Zhang, W. Yang, Class-specific cost regulation extreme learning machine for imbalanced classification, Neurocomputing 261 (2017) 70 – 82. [65] N. Nikolaou, N. Edakunni, M. Kull, P. Flach, G. Brown, Cost-sensitive boosting algorithms: Do we really need them?, Machine Learning 104 (2016) 359–384. [66] M. Ohsaki, P. Wang, K. Matsuda, S. Katagiri, H. Watanabe, A. Ralescu, Confusion-matrix-based kernel logistic regression for imbalanced data classification, IEEE Transactions on Knowledge and Data Engineering 29 (2017) 1806–1819. [67] T. Imam, K. M. Ting, J. Kamruzzaman, z-svm: An svm for improved classification of imbalanced data, in: Proceedings of the 19th Australian Joint Conference on Artificial Intelligence: Advances in Artificial Intelligence, Springer-Verlag, Berlin, Heidelberg, 2006, pp. 264–273. [68] S. Datta, S. Das, Near-bayesian support vector machines for imbalanced data classification with equal or unequal misclassification costs, Neural Networks 70 (2015) 39 – 52. [69] B. Krawczyk, M. Wo´ zniak, Diversity measures for one-class classifier ensembles, Neurocomputing 126 (2014) 36 – 44. Recent trends in Intelligent Data Analysis Online Data Processing. [70] B. Krawczyk, M. Wo´ zniak, F. Herrera, Weighted one-class classification for different types of minority class examples in imbalanced data, in: 2014 IEEE Symposium on Computational Intelligence and Data Mining (CIDM), 2014, pp. 337–344. [71] B. Krawczyk, L. Jele´ n, A. Krzy˙zak, T. Fevens, One-class classification decomposition for imbalanced classification of breast cancer malignancy data, in: L. Rutkowski, M. Korytkowski, R. Scherer, R. Tadeusiewicz, L. A. Zadeh, J. M. Zurada (Eds.), Artificial Intelligence and Soft Computing: 13th International Conference, ICAISC 2014, Zakopane, Poland, June 1-5, 2014, Proceedings, Part I, Springer International Publishing, Cham, 2014, pp. 539–550. [72] S. Ertekin, J. Huang, L. Bottou, L. Giles, Learning on the border: active learning in imbalanced data classification, in: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, CIKM ’07, ACM, New York, NY, USA, 2007, pp. 127–136. [73] S. Doyle, J. Monaco, M. Feldman, J. Tomaszewski, A. Madabhushi, An active learning based classification strategy for the minority class problem: application to histopathology annotation, BMC Bioinformatics 12 (2011). [74] Y. Chen, S. Mani, Active learning for unbalanced data in the challenge with multiple models and biasing, in: I. Guyon, G. Cawley, G. Dror, V. Lemaire, A. Statnikov (Eds.), Active Learning and Experimental Design workshop In conjunction with AISTATS 2010, volume 16 of Proceedings of Machine Learning Research, PMLR, Sardinia, Italy, 2011, pp. 113–126. [75] J. Attenberg, . Ertekin, Class Imbalance and Active Learning, John Wiley & Sons, Inc., pp. 101–149. [76] Z. Ferdowsi, R. Ghani, R. Settimi, Online active learning with imbalanced classes, in: 2013 IEEE 13th International Conference on Data Mining, 2013, pp. 1043–1048. [77] X. You, R. Wang, D. Tao, Diverse expected gradient active learning for relative attributes, IEEE Transactions on Image Processing 23 (2014) 3203–3217. [78] H. Guo, W. Wang, An active learning-based svm multi-class classification model, Pattern Recognition 48 (2015) 1577 – 1597. [79] J. Zhang, X. Wu, V. S. Shengs, Cybernetics 45 (2015) 1095–1107.

Active learning with imbalanced multiple noisy labeling,

IEEE Transactions on

[80] X. Zhang, T. Yang, P. Srinivasan, Online asymmetric active learning with imbalanced data, in: Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, ACM, New York, NY, USA, 2016, pp. 2055–2064. [81] S. Khanchi, M. I. Heywood, A. N. Zincir-Heywood, Properties of a gp active learning framework for streaming data with class imbalance, in: Proceedings of the Genetic and Evolutionary Computation Conference, GECCO ’17, ACM, New York, NY, USA, 2017, pp. 945–952. [82] X. Li, L. Wang, E. Sung, Adaboost with svm-based component classifiers, Intelligence 21 (2008) 785–795.

Engineering Applications of Artificial

[83] S. Wu, S.-I. Amari, Conformal transformation of kernel functions: A data-dependent way to improve support vector machine classifiers, Neural Processing Letters 15 (2002) 59–67. [84] G. Wu, E. Y. Chang, Adaptive feature-space conformal transformation for imbalanced-data learning, in: ICML, 2003, pp. 816–823. [85] G. Wu, E. Y. Chang, Aligning boundary in kernel space for learning imbalanced dataset, in: Data Mining, 2004. ICDM’04. Fourth IEEE International Conference on, 2004, pp. 265–272.

37

[86] G. Wu, E. Y. Chang, Kba: Kernel boundary alignment considering imbalanced data distribution, IEEE Transactions on knowledge and data engineering 17 (2005) 786–795. [87] P. Williams, S. Li, J. Feng, S. Wu, Scaling the kernel function to improve performance of the support vector machine, in: International Symposium on Neural Networks, 2005, pp. 831–836. [88] A. Maratea, A. Petrosino, Asymmetric kernel scaling for imbalanced data classification, in: International Workshop on Fuzzy Logic and Applications, 2011, pp. 196–203. [89] A. Maratea, A. Petrosino, M. Manzo, Adjusted f-measure and kernel scaling for imbalanced data learning, Information Sciences 257 (2014) 331–341. [90] Y. Zhang, P. Fu, W. Liu, G. Chen, Imbalanced data classification based on scaling kernel-based support vector machine, Neural Computing and Applications 25 (2014) 927–935. [91] Q. Cheng, H. Zhou, J. Cheng, H. Li, A minimax framework for classification with applications to images and high dimensional data, IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (2014) 2117–2130. [92] C. Peng, J. Cheng, Q. Cheng, A supervised learning model for high-dimensional and large-scale data, ACM Trans. Intell. Syst. Technol. 8 (2016) 30:1–30:23. [93] N. V. Chawla, A. Lazarevic, L. O. Hall, K. W. Bowyer, Smoteboost: Improving prediction of the minority class in boosting, in: N. Lavraˇ c, D. Gamberger, L. Todorovski, H. Blockeel (Eds.), Knowledge Discovery in Databases: PKDD 2003: 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, Cavtat-Dubrovnik, Croatia, September 22-26, 2003. Proceedings, Springer Berlin Heidelberg, Berlin, Heidelberg, 2003, pp. 107–119. [94] D. Mease, A. Wyner, A. Buja, Cost-weighted boosting with jittering and over/under-sampling: Jous-boost, J. Machine Learning Research 8 (2007) 409–439. [95] C. Seiffert, T. M. Khoshgoftaar, J. V. Hulse, A. Napolitano, Rusboost: A hybrid approach to alleviating class imbalance, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans 40 (2010) 185–197. [96] S. Chen, H. He, E. A. Garcia, Ramoboost: Ranked minority oversampling in boosting, IEEE Transactions on Neural Networks 21 (2010) 1624–1642. [97] R. Barandela, R. Valdovinos, J. S´ anchez, New applications of ensembles of classifiers, Pattern Analysis & Applications 6 (2003) 245–256. [98] S. Wang, X. Yao, Diversity analysis on imbalanced data sets by using ensemble models, in: 2009 IEEE Symposium on Computational Intelligence and Data Mining, 2009, pp. 324–331. [99] M. Galar, A. Fern´ andez, E. Barrenechea, F. Herrera, Eusboost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling, Pattern Recognition 46 (2013) 3460 – 3471. [100] B. Tang, H. He, Gir-based ensemble sampling approaches for imbalanced learning, Pattern Recognition 71 (2017) 306 – 319. [101] A. D´Addabbo, R. Maglietta, Parallel selective sampling method for imbalanced and large data classification, Pattern Recognition Letters 62 (2015) 61 – 67. [102] P. Thanathamathee, C. Lursinsap, Handling imbalanced data sets with synthetic boundary data generation using bootstrap re-sampling and adaboost techniques, Pattern Recognition Letters 34 (2013) 1339 – 1347. [103] S. del Ro, V. Lpez, J. M. Bentez, F. Herrera, On the use of mapreduce for imbalanced big data using random forest, Information Sciences 285 (2014) 112 – 137. [104] Z. Sun, Q. Song, X. Zhu, H. Sun, B. Xu, Y. Zhou, A novel ensemble method for classifying imbalanced data, Pattern Recognition 48 (2015) 1623 – 1637. [105] R. Akbani, S. Kwek, N. Japkowicz, Applying support vector machines to imbalanced datasets, in: J.-F. Boulicaut, F. Esposito, F. Giannotti, D. Pedreschi (Eds.), Machine Learning: ECML 2004: 15th European Conference on Machine Learning, Pisa, Italy, September 20-24, 2004. Proceedings, Springer Berlin Heidelberg, Berlin, Heidelberg, 2004, pp. 39–50. [106] J.-L. Hsu, P.-C. Hung, H.-Y. Lin, C.-H. Hsieh, Applying under-sampling techniques and cost-sensitive learning methods on risk assessment of breast cancer, Journal of Medical Systems 39 (2015) 40. [107] A. Zhou, B.-Y. Qu, H. Li, S.-Z. Zhao, P. N. Suganthan, Q. Zhang, Multiobjective evolutionary algorithms: A survey of the state of the art, Swarm and Evolutionary Computation 1 (2011) 32 – 49. [108] P. Soda, A multi-objective optimisation approach for class imbalance learning, Pattern Recognition 44 (2011) 1801 – 1810. [109] C. Chira, C. Lemnaru, A multi-objective evolutionary approach to imbalanced classification problems, in: 2015 IEEE International Conference on Intelligent Computer Communication and Processing (ICCP), 2015, pp. 149–154. [110] S. Garc´ıa, R. Aler, I. M. Galv´ an, Using evolutionary multiobjective techniques for imbalanced classification data, in: Proceedings of the 20th International Conference on Artificial Neural Networks: Part I, ICANN’10, Springer-Verlag, Berlin, Heidelberg, 2010, pp. 422–427. [111] A. A¸skan, S. Sayın, Svm classification for imbalanced data sets using a multiobjective optimization framework, Annals of Operations Research 216 (2014) 191–203.

38

[112] H. H. Maheta, V. K. Dabhi, Classification of imbalanced data sets using multi objective genetic programming, in: 2015 International Conference on Computer Communication and Informatics (ICCCI), 2015, pp. 1–6. [113] U. Bhowan, M. Johnston, M. Zhang, X. Yao, Evolving diverse ensembles using genetic programming for classification with unbalanced data, IEEE Transactions on Evolutionary Computation 17 (2013) 368–386. [114] U. Bhowan, M. Johnston, M. Zhang, X. Yao, Reusing genetic programming for ensemble selection in classification of unbalanced data, IEEE Transactions on Evolutionary Computation 18 (2014) 893–908. [115] W. Liu, Z. Wang, X. Liu, N. Zeng, Y. Liu, F. E. Alsaadi, A survey of deep neural network architectures and their applications, Neurocomputing 234 (2017) 11 – 26. [116] C. Huang, Y. Li, C. C. Loy, X. Tang, Learning deep representation for imbalanced classification, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 5375–5384. [117] Y.-A. Chung, H.-T. Lin, S.-W. Yang, Cost-aware pre-training for multiclass cost-sensitive deep learning, in: Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI’16, AAAI Press, 2016, pp. 1411–1417. [118] S. Wang, W. Liu, J. Wu, L. Cao, Q. Meng, P. J. Kennedy, Training deep neural networks on imbalanced data sets, in: 2016 International Joint Conference on Neural Networks (IJCNN), 2016, pp. 4368–4374. [119] V. Raj, S. Magg, S. Wermter, Towards effective classification of imbalanced data with convolutional neural networks, in: F. Schwenker, H. M. Abbas, N. El Gayar, E. Trentin (Eds.), Artificial Neural Networks in Pattern Recognition: 7th IAPR TC3 Workshop, ANNPR 2016, Ulm, Germany, September 28–30, 2016, Proceedings, Springer International Publishing, Cham, 2016, pp. 150–162. [120] F. Li, S. Li, C. Zhu, X. Lan, H. Chang, Cost-effective class-imbalance aware cnn for vehicle localization and categorization in high resolution aerial images, Remote Sensing 9 (2017). [121] Y. Yan, M. Chen, M. L. Shyu, S. C. Chen, Deep learning for imbalanced multimedia data classification, in: 2015 IEEE International Symposium on Multimedia (ISM), 2015, pp. 483–488. [122] M. Lichman, UCI machine learning repository, 2013. [123] D. Broomhead, D. Lowe, Multivariable functional interpolation and adaptive networks, Complex Systems 2 (1988) 321–355. [124] G.-B. Huang, Q.-Y. Zhu, C.-K. Siew, Extreme learning machine: Theory and applications, Neurocomputing 70 (2006) 489 – 501. [125] C. Cortes, V. Vapnik, Support-vector networks, Machine learning 20 (1995) 273–297. [126] A. Bordes, S. Ertekin, J. Weston, L. Bottou, Fast kernel classifiers with online and active learning, Journal of Machine Learning Research 6 (2005) 1579–1619. [127] CiteSeer Data, Citeseer data, http://citeseer.ist.psu.edu, (accessed 09-January-2018). [128] USPS OCR Data, USPS OCR Data, https://cs.nyu.edu/~roweis/data/usps_all.mat, (accessed 09-January-2018). [129] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE 86 (1998) 2278–2324. [130] J. R. Quinlan, C4. 5: programs for machine learning, Elsevier, 2014. [131] I. Triguero, S. Gonz´ alez, J. M. Moyano, S. Garc´ıa, J. Alcal´ a-Fdez, J. Luengo, A. Fern´ andez, M. J. del Jesus, L. S´ anchez, F. Herrera, Keel 3.0: An open source software for multi-stage analysis in data mining, International Journal of Computational Intelligence Systems 10 (2017) 1238–1249. [132] D. E. Rumelhart, G. E. Hinton, R. J. Williams, Learning representations by back-propagating errors, nature 323 (1986) 533. [133] J. A. Lee, ELENA Project, https://www.elen.ucl.ac.be/neural-nets/Research/Projects/ELENA/elena.htm, 2000 (accessed 09-January-2018). [134] H. Larochelle, D. Erhan, A. Courville, J. Bergstra, Y. Bengio, An empirical evaluation of deep architectures on problems with many factors of variation, in: Proceedings of the 24th International conference on Machine learning, ACM, pp. 473–480. [135] Z. Liu, P. Luo, X. Wang, X. Tang, Deep learning face attributes in the wild, in: Proceedings of the IEEE International Conference on Computer Vision, pp. 3730–3738. [136] S. H. Khan, M. Hayat, M. Bennamoun, F. A. Sohel, R. Togneri, Cost-sensitive learning of deep feature representations from imbalanced data, IEEE Transactions on Neural Networks and Learning Systems PP (2017) 1–15. [137] A. Krizhevsky, G. Hinton, Learning multiple layers of features from tiny images, Technical Report, University of Toronto, 2009. [138] L. Fei-Fei, R. Fergus, P. Perona, Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories, Computer vision and Image understanding 106 (2007) 59–70. [139] L. Ballerini, R. B. Fisher, B. Aldridge, J. Rees, A color and texture based hierarchical k-nn approach to the classification of non-melanoma skin lesions, in: Color Medical Image Analysis, Springer, 2013, pp. 63–86.

39

[140] O. Beijbom, P. J. Edmunds, D. I. Kline, B. G. Mitchell, D. Kriegman, Automated annotation of coral reef survey images, in: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, IEEE, pp. 1170–1177. [141] A. Quattoni, A. Torralba, Recognizing indoor scenes, in: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, IEEE, pp. 413–420. [142] G. M. Weiss, Mining with rarity: a unifying framework, ACM Sigkdd Explorations Newsletter 6 (2004) 7–19. [143] G. M. Weiss, The impact of small disjuncts on classifier learning., in: Data Mining, volume 8, 2010, pp. 193–226. [144] G. M. Weiss, H. Hirsh, A quantitative study of small disjuncts, in: AAAI/IAAI 2000, 2000, pp. 665–670. [145] D. R. Carvalho, A. A. Freitas, A genetic algorithm-based solution for the problem of small disjuncts, in: European Conference on Principles of Data Mining and Knowledge Discovery, 2000, pp. 345–352. [146] K. M. Ting, The problem of small disjuncts: Its remedy in decision trees., in: Proceedings of the Tenth Canadian Conference on Artificial Intelligence, 1994, pp. 91–97. [147] J. R. Quinlan, Improved estimates for the accuracy of small disjuncts, Machine Learning 6 (1991) 93–98. [148] A. Danyluk, F. Provost, Small disjuncts in action: learning to diagnose errors in the local loop of the telephone network, in: Proc. of Tenth International Conference on Machine Learning 1993, 1993, pp. 81–88. [149] G. M. Weiss, Learning with rare cases and small disjuncts, in: ICML 1995, 1995, pp. 558–565. [150] A. Van Den Bosch, A. Weijters, H. J. Van Den Herik, W. Daelemans, When small disjuncts abound, try lazy learning: A case study, in: Proceedings of the Seventh Belgian-Dutch Conference on Machine Learning, 1997, pp. 109–118. [151] H. D´ ejean, Learning rules and their exceptions, Journal of Machine Learning Research 2 (2002) 669–693. [152] V. Garc´ıa, J. S. S´ anchez, H. O. Dom´ınguez, L. Cleofas-S´ anchez, Dissimilarity-based learning from imbalanced data with small disjuncts and noise, in: Iberian Conference on Pattern Recognition and Image Analysis, 2015, pp. 370–378. [153] D. R. Carvalho, A. A. Freitas, A genetic-algorithm for discovering small-disjunct rules in data mining, Applied Soft Computing 2 (2002) 75–88. [154] D. R. Carvalho, A. A. Freitas, Evaluating six candidate solutions for the small-disjunct problem and choosing the best solution via meta-learning, Artificial Intelligence Review 24 (2005) 61–98. [155] K. M. Ali, M. J. Pazzani, Reducing the small disjuncts problem by learning probabilistic concept descriptions, in: T. Petsche (Ed.), Computational Learning Theory and Natural Learning Systems, volume 3, 1992. [156] N. Japkowicz, Supervised learning with unsupervised output separation, intelligence and soft computing, volume 3, 2002, pp. 321–325.

in: International conference on artificial

[157] A. D. Shapiro, Structured induction in expert systems, Addison-Wesley Longman Publishing Co., Inc., 1987. [158] G. M. Weiss, F. Provost, Learning when training data are costly: The effect of class distribution on tree induction, Journal of Artificial Intelligence Research 19 (2003) 315–354. [159] K. Napierala, J. Stefanowski, S. Wilk, Learning from imbalanced data in presence of noisy and borderline examples, in: Rough sets and current trends in computing, 2010, pp. 158–167. [160] X. Zhang, Y. Li, R. Kotagiri, L. Wu, Z. Tari, M. Cheriet, Krnn: k rare-class nearest neighbour classification, Pattern Recognition 62 (2017) 33–44. [161] X. Zhang, Y. Li, A positive-biased nearest neighbour algorithm for imbalanced classification, in: Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2013, pp. 293–304. [162] L. Wang, L. Khan, B. Thuraisingham, An effective evidence theory based k-nearest neighbor (knn) classification, in: Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent TechnologyVolume 01, 2008, pp. 797–801. [163] H. Dubey, V. Pudi, Class based weighted k-nearest neighbor over imbalance dataset, in: Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2013, pp. 305–316. [164] R. J. A. Little, D. B. Rubin, Statistical Analysis with Missing Data, John Wiley & Sons, Inc., New York, 1987. [165] J. L. Schafer, Analysis of Incomplete Multivariate Data, CRC Press, 1997. [166] J. L. Schafer, J. W. Graham, Missing data: Our view of the state of the art, Psychological Methods 7 (2002) 147–177. [167] W. Zhang, Y. Yang, Q. Wang, A comparative study of absent features and unobserved values in software effort data, International Journal of Software Engineering and Knowledge Engineering 22 (2012) 185–202. [168] P. K. Sharpe, R. J. Solly, Dealing with missing values in neural network-based diagnostic systems, Neural Computing & Applications 3 (1995) 73–77. [169] A. R. T. Donders, G. J. M. G. van der Heijden, T. Stijnen, K. G. M. Moons, Review: A gentle introduction to imputation of missing values, Journal of Clinical Epidemiology 59 (2006) 1087–1091. [170] J. K. Dixon, Pattern recognition with partly missing data, Systems, Man and Cybernetics, IEEE Transactions on 9 (1979) 617–621. [171] G. Tutz, S. Ramzan, Improved methods for the imputation of missing data by nearest neighbor methods, Computational

40

Statistics & Data Analysis 90 (2015) 84 – 99. [172] J. W. Grzymala-Busse, M. Hu, A comparison of several approaches to missing attribute values in data mining, in: Rough Sets and Current Trends in Computing, 2001, pp. 378–385. [173] D. B. Rubin, Multiple Imputation for Nonresponse in Surveys, John Wiley & Sons, 1987. [174] W. R. Gilks, S. Richardson, D. J. Spiegelhalter, Introducing markov chain monte carlo, Markov chain Monte Carlo in practice 1 (1996) 1–19. [175] F. Chen, Missing no more: Using the mcmc procedure to model missing data, in: Proceedings of the SAS Global Forum 2013 Conference, SAS Institute Inc., 2013, pp. 1–23. [176] N. J. Horton, S. R. Lipsitz, Multiple imputation in practice: Comparison of software packages for regression models with missing variables, The American Statistician 55 (2001) 244–254. [177] O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein, R. B. Altman, Missing value estimation methods for dna microarrays, Bioinformatics 17 (2001) 520–525. [178] T. H. Bo, B. Dysvik, I. Jonassen, Lsimpute: accurate estimation of missing values in microarray data with least squares methods, Nucleic Acid Research 32 (2004). [179] M. S. B. Sehgal, I. Gondal, L. S. Dooley, Collateral missing value imputation: a new robust missing value estimation algorithm fpr microarray data, Bioinformatics 21 (2005) 2417–2423. [180] M. S. B. Sehgal, I. Gondal, L. S. Dooley, k-ranked covariance based missing values estimation for microarray data classification, in: Hybrid Intelligent Systems, 2004. HIS’04. Fourth International Conference on, IEEE, 2004, pp. 274– 279. [181] M. Ouyang, W. J. Welsh, P. Georgopoulos, Gaussian mixture clustering and imputation of microarray data, Bioinformatics 20 (2004) 917–923. [182] F. Meng, C. Cai, H. Yan, A bicluster-based bayesian principal component analysis method for microarray missing value estimation, IEEE Journal of Biomedical and Health Informatics 18 (2014) 863–871. [183] F. Fessant, S. Midenet, Self-organising map for data imputation and correction in surveys, Neural Computing & Applications 10 (2002) 300–310. [184] Y. Bengio, F. Gingras, Recurrent neural networks for missing or asynchronous data, in: Advances in neural information processing systems (NIPS) 1996, 1996, pp. 395–401. [185] S. Narayanan, R. J. Marks, J. L. Vian, J. J. Choi, M. A. El-Sharkawi, B. B. Thompson, Set constraint discovery: missing sensor data restoration using autoassociative regression machines, in: Neural Networks, 2002. IJCNN ’02. Proceedings of the 2002 International Joint Conference on, volume 3, 2002, pp. 2872–2877. [186] P. J. Garc´ıa-Laencina, J. Serrano, A. R. Figueiras-Vidal, J.-L. Sancho-G´ omez, Multi-task neural networks for dealing ´ with missing inputs, in: J. Mira, J. R. Alvarez (Eds.), Bio-inspired Modeling of Cognitive Tasks: Second International Work-Conference on the Interplay Between Natural and Artificial Computation, IWINAC 2007, La Manga del Mar Menor, Spain, June 18-21, 2007, Proceedings, Part I, Springer Berlin Heidelberg, Berlin, Heidelberg, 2007, pp. 282–291. [187] C. Barcel´ o, The impact of alternative imputation methods on the measurement of income and wealth: Evidence from the spanish survey of household finances, in: Working Paper Series, Banco de Espa˜ na, 2008. [188] I. Myrtveit, E. Stensrud, U. H. Olsson, Analyzing data sets with missing data: An empirical evaluation of imputation methods and likelihood-based methods, Software Engineering, IEEE Transactions on 27 (2001) 999–1013. [189] A. P. Dempster, D. B. Rubin, Part i: Introduction, in: W. G. Madow, I. Olkin, D. B. Rubin (Eds.), Incomplete Data in Sample Surveys, volume 2, New York: Academic Press, 1983, pp. 3–10. [190] E. Acu˜ na, C. Rodriguez, The treatment of missing values and its effect on classifier accuracy, in: D. Banks, F. R. McMorris, P. Arabie, W. Gaul (Eds.), Classification, Clustering, and Data Mining Applications, Studies in Classification, Data Analysis, and Knowledge Organisation, Springer Berlin Heidelberg, 2004, pp. 639–647. [191] S. Ahmad, V. Tresp, Some solutions to the missing feature problem in vision, in: S. Hanson, J. Cowan, C. Giles (Eds.), Advances in Neural Information Processing Systems 5, Morgan-Kaufmann, 1993, pp. 393–400. [192] Q. Wang, J. N. K. Rao, Empirical likelihood-based inference in linear models with missing data, Scandinavian Journal of Statistics 29 (2002) 563–576. [193] Q. Wang, J. N. K. Rao, Empirical likelihood-based inference under imputation for missing response data, The Annals of Statistics 30 (2002) 896–924. [194] Z. Ghahramani, M. I. Jordan, Supervised learning from incomplete data via an em approach, in: Advances in neural information processing systems, 1994, pp. 120–127. [195] V. Tresp, S. Ahmad, R. Neuneier, Training neural networks with deficient data, in: Advances in neural information processing systems, 1994, pp. 128–135. [196] D. Williams, X. Liao, Y. Xue, L. Carin, B. Krishnapuram, On classification with incomplete data, IEEE transactions on pattern analysis and machine intelligence 29 (2007) 427–436. [197] M. Ramoni, P. Sebastiani, Robust learning with missing data, Machine Learning 45 (2001) 147–170.

41

[198] C. Bhattacharyya, P. K. Shivaswamy, A. J. Smola, A second order cone programming formulation for classifying missing data, in: Advances in neural information processing systems, 2005, pp. 153–160. [199] A. J. Smola, S. Vishwanathan, T. Hofmann, Kernel methods for missing variables., in: AISTATS, 2005. [200] K. Pelckmans, J. De Brabanter, J. A. Suykens, B. De Moor, Handling missing values in support vector machine classifiers, Neural Networks 18 (2005) 684–692. [201] D. F. Heitjan, S. Basu, Distinguishing ”missing at random” and ”missing completely at random”, The American Statistician 50 (1996) 207–213. [202] B. M. Marlin, Missing Data Problems in Machine Learning, Ph.D. thesis, University of Toronto, 2008. [203] S. Krause, R. Polikar, An ensemble of classifiers approach for the missing feature problem, in: Proceedings of the International Joint Conference on Neural Networks, 2003, volume 1, 2003, pp. 553–558. [204] P. Juszczak, R. P. W. Duin, Combining one-class classifiers to classify missing data, in: Multiple Classifier Systems, Springer, 2004, pp. 92–101. [205] L. Nanni, A. Lumini, S. Brahnam, A classifier ensemble approach for the missing feature problem, Artificial Intelligence in Medicine 55 (2012) 37–50. [206] J. R. Quinlan, Induction of decision trees, Machine learning 1 (1986) 81–106. [207] P. Clark, T. Niblett, The cn2 induction algorithm, Machine learning 3 (1989) 261–283. [208] L. Struski, M. Smieja, J. Tabor, Incomplete data representation for SVM classification, CoRR abs/1612.01480 (2016). [209] E. Hazan, R. Livni, Y. Mansour, Classification with low rank and missing data., in: ICML, 2015, pp. 257–266. [210] Z. Che, S. Purushotham, K. Cho, D. Sontag, Y. Liu, Recurrent neural networks for multivariate time series with missing values, CoRR abs/1606.01865 (2016). [211] L. Gondara, K. Wang, Multiple imputation using deep denoising autoencoders, CoRR abs/1705.02737 (2017). [212] S.-H. Zhong, Y. Liu, K. A. Hua, Field effect deep networks for image recognition with incomplete data, ACM Trans. Multimedia Comput. Commun. Appl. 12 (2016) 52:1–52:22. [213] Y. Duan, Y. Lv, W. Kang, Y. Zhao, A deep learning based approach for traffic data imputation, in: 17th International IEEE Conference on Intelligent Transportation Systems (ITSC), 2014, pp. 912–917. [214] C. Leke, T. Marwala, Missing data estimation in high-dimensional datasets: A swarm intelligence-deep neural network approach, in: Y. Tan, Y. Shi, B. Niu (Eds.), Advances in Swarm Intelligence: 7th International Conference, ICSI 2016, Bali, Indonesia, June 25-30, 2016, Proceedings, Part I, Springer International Publishing, Cham, 2016, pp. 259–270. [215] T. Schioler, J. Nolan, P. McNair, Transferability of knowledge based systems, in: Medical Informatics Europe 1991, Springer, 1991, pp. 394–398. [216] S. A. Dudani, The distance-weighted k-nearest-neighbor rule, IEEE Transactions on Systems, Man, and Cybernetics (1976) 325–327. [217] F. Leisch, E. Dimitriadou, Machine learning benchmark problems, http://ftp.auckland.ac.nz/software/CRAN/doc/ packages/mlbench.pdf, 2006 (accessed 09-January-2018). [218] D. J. Hand, K. Yu, Idiot’s bayesnot so stupid after all?, International statistical review 69 (2001) 385–398. [219] S. Zhaowei, Z. Lingfeng, M. Shangjun, F. Bin, Z. Taiping, Incomplete time series prediction using max-margin classification of data with absent features, Mathematical Problems in Engineering 2010 (2010). [220] M. G¨ onen, E. Alpayd, Multiple kernel learning algorithms, J. Mach. Learn. Res. 12 (2011) 2211–2268. [221] T. Jo, N. Japkowicz, Class imbalances versus small disjuncts, ACM Sigkdd Explorations Newsletter 6 (2004) 40–49. [222] L. E. Zarate, B. M. Nogueira, T. R. Santos, M. A. Song, Techniques for missing value recovering in imbalanced databases: Application in a marketing database with massive missing data, in: Systems, Man and Cybernetics, 2006. SMC’06. IEEE International Conference on, volume 3, 2006, pp. 2658–2664. [223] D. Davis, M. Rahman, Missing value imputation using stratified supervised learning for cardiovascular data, Journal of Informatics and Data Mining (2016). [224] N. Poolsawad, C. Kambhampati, J. Cleland, Balancing class for performance of classification with a clinical dataset, in: Proceedings of the World Congress on Engineering, 2014, volume 1, 2014. [225] Y. Chen, S. Mani, Active learning for unbalanced data in the challenge with multiple models and biasing, in: Active Learning and Experimental Design workshop In conjunction with AISTATS 2010, 2011, pp. 113–126. [226] J. Takum, C. Bunkhumpornpat, Parameter-free imputation for imbalance datasets, in: International Conference on Asian Digital Libraries, 2014, pp. 260–267. [227] V. Vapnik, O. Chapelle, Bounds on error expectation for support vector machines, Neural Comput. 12 (2000) 2013–2036. [228] E. Bax, Validation of k-nearest neighbor classifiers, IEEE Transactions on Information Theory 58 (2012) 3225–3234.

42