Modelling a Stable Classifier for Handling Large Scale Data with Noise and Imbalance Akila Somasundaram
U. Srinivasulu Reddy
Department of Computer Applications National Institute of Technology Tiruchirappalli, India 620015
[email protected]
Department of Computer Applications National Institute of Technology Tiruchirappalli, India 620015
[email protected]
Abstract— Classifier performance is often impaired by the presence of anomalies like noisy and borderlines samples, and due to the inherent imbalance in data. This is due to the fact that classifier models are usually constructed on the basis of ideal data conditions which is often not the case. In reality, these anomalies occur at varying intensities and in most cases they are an integral part of the problem domain. This requires that the classifier models be fine-tuned to accommodate such anomalies thereby resulting in data dependent models. This work analyses the effectiveness of various classifier models in handling noisy, borderline and imbalanced data. This dictates that, the right set of metrics must first be identified, as most of the usual metrics are not affected by such anomalies, though it affects the reliability, robustness and practical efficacy of such classifiers. To ensure the scalability of the resulting models, classifiers were implemented using Spark. A characterized examination of the results elucidates the effective prediction zones of each model, facilitating the identification of stable classifier models. It is found that a single model is inadequate in real time scenarios, due to the complex interplay among the various anomalies. This work is concluded with modelling a heterogeneous cost based ensemble model for a domain based prediction model. Keywords—Data Imbalance; Noise; Borderline data; Classification; Performance Metrics; Ensemble Models; Boosting; Stacking; Big Data
I. INTRODUCTION Machine Learning is the process of examining, cleaning, transforming and applying models on data to uncover hidden patterns, correlations and other useful insights that are of business value. The performance of machine learning models, especially those of supervised learning is impeded by intrinsic properties of data, caused due to its varied distribution. Intrinsic properties of data such as imbalance and noise influence the operational nature of the algorithmic models thereby affecting their performance levels [1]. Literature shows conception of several modified forms of algorithms to counter these issues. However, the effects of such properties are multidimensional, i.e. imbalance and noise creates undesirable effects that increasingly expands to several metrics with varying levels of interplay between them. Borderline samples are considered to be variants of noisy data and suffers from similar side-effects. This leads to the creation of data driven modeling, where the models are fine-tuned specifically to suit the needs of the problem in hand. Such models are suitable for
processing machine generated data or data without much variance, however they fail on human generated data due to varying behavioral patterns. Data centric models are limited in terms of their practical efficacy and are not suitable for real time systems. Domain centric models tend to be more robust and reliable in comparison to data centric models and hence desirable to a greater extent. This work concentrates on creating a scalable and stable domain centric model that will handle noise and imbalance effectively. This paper discusses three of the major issues in real-time data namely; data imbalance, noise and presence of borderline samples. Although these issues are independently handled in literature, contributions on handling them in large scale composite data (with multiple issues) is sparse. Since, such properties are typical to real-time scenarios, they cannot be avoided. Data elimination happens to be the “go to” solution for handling these issues. In certain domains, eliminating data leads to loss of precious training entities, which is not desirable. According to literature, data preservation irrespective of issues is a special case, whereas in real-time this is a common condition, often overlooked during research. This paper aims to draw attention to this context by explicating the significance of such conditions by conducting a domain based analysis. Further, models and metrics are analyzed and their behavior towards large scale data containing such anomalies are highlighted. Research directions enunciates strategies in the form of ensemble classifier models that can be best used in parallel distributed environments, to tackle such issues effectively without the need for data elimination. II. MODELS AND DATA CHARACTERISTICS: AN OVERVIEW Classifier models used for this work are divided into six broad sections based on their prediction process; they are probability based (Naïve Bayes), function based (Multinomial Logistic Regression), tree-based (Decision Tree), machine learning (Multi-Layer Perceptron/ANN), Bagging (Random Forest) and Boosting algorithms (Gradient Boosted Trees). These models are implemented using PySpark and executed on a distributed environment to ensure scalability. Datasets containing varied levels of Imbalance, Noise and Borderline Enhanced samples are used for model analysis. The proposed work concentrates on binary classification, hence the datasets are intentionally selected with two classes.
Data imbalance refers to the level of dominance one class (majority class) has, over the other classes (minority classes) in a dataset. This difference in ratio between the available classes leads to class imbalance. Individual entries from one class occurring in the safe zones of other classes is labelled as noise. Such entries could not be eliminated in data involving human behavior as its major component. According to a study on data imbalance by Somasundaram et al. [2], performance degradations are not observed in data with low imbalance levels (