Effect of Missing Values on Data Classification - Journal of Emerging

0 downloads 0 Views 400KB Size Report
source data mining tool will be used for the experimental study. In our approach, we have used. ReplaceMissingValues unsupervised Filter in WEKA to replace ...
Journal of Emerging Trends in Engineering and Applied Sciences (JETEAS) 4(2): 311-316 © Scholarlink Research Institute Journals, 2013 (ISSN: 2141-7016) jeteas.scholarlinkresearch.org Journal of Emerging Trends in Engineering and Applied Sciences (JETEAS) 4(2):311-316 (ISSN: 2141-7016)

Effect of Missing Values on Data Classification Tapas Ranjan Baitharu and Subhendu Kumar Pani, Department of CSE. Orissa Engineering College, Odisha, India Corresponding Author: Tapas Ranjan Baitharu ___________________________________________________________________________ Abstract Data classification is an important task in KDD (knowledge discovery in databases) process. It has several potential applications. The performance of classifiers is strongly dependent on the data set used for learning. In practice, a data set may contain noisy or redundant data items and large number of features, many of them may not be relevant for the objective function at hand. Thus noise data may degrade the accuracy and performance of the classification models. Thus, dealing with missing values in data pre-processing is an important step in building an effective and efficient classifier. It is a process by which missing values are replaced by suitable values according an objective function or the noisy data may be filtered. It leads to better performance of the classification models in terms of their predictive or descriptive accuracy, diminishing of computing time needed to build models as they learn faster, and better understanding of the models. In this paper, the effect of missing values on data classification is studied. A comparative analysis of data classification accuracy in different scenarios is presented. Several search techniques are considered in the study for feature selection and are applied to pre-process the dataset. The predictive performances of popular classifiers are compared quantitatively.After analysing the experimental results, the paper establishes the general concept of improved classification accuracy using missing values replacement.The purpose of this research is to maintain the highest accuracy classification rate in missing values. __________________________________________________________________________________________ Keywords: data mining, feature selection, missing values, knowledge discovery databases. The task of recognition and classification (Provost, F et.al, 2001) is one of the most frequently encountered decision making problems in daily activities. A classification problem occurs when an object needs to be assigned into a predefined group or class based on a number of observed attributes, or features, related to that object. Humans constantly receive information in the form of patterns of interrelated facts, and have to make decisions based on them. When confronted with a pattern recognition problem, stored knowledge and past experience can be used to assist in making the correct decision ( SAS Institute Inc,2002). Indeed, many problems in various domains such as financial, industrial, technological, and medical sectors, can be cast as classification problems(Fuchs G et.al,2004, Langdell S,Mobasher B et.al,2002). Examples include bankruptcy prediction, credit scoring, machine fault detection, medical diagnosis, quality control, handwritten character recognition, speech recognition etc(Goldschmidt et.al.,2006, Bace Ret.al,2000). Pattern recognition and classification has been studied extensively in the literature. In general, the problem of pattern recognition can be posed as a two-stage process:  Feature selection which involves selecting the significant features from an input pattern  Classification which involves devising a procedure for discriminating the measurements taken from the selected

INTRODUCTION Data and information have become major assets for most of the organizations. The success of any organisation depends largely on the extent to which the data acquired from business operations is utilised. In other words, the data serves as an input into a strategic decision making process, which could put the business ahead of its competitors. Also, in this era, where businesses are driven by the customers, having a customer database would enable management in any organisation to determine customer behaviour and preference in order to offer better services and to prevent losing them resulting better business(Klosgen et.al,2002, Berry M J A et.al,2004, Delmater R et.al,2002). The data needed that will serve as an input to organizational decisionmaking process is generated and warehoused. It is being collected via many sources, such as the point of sales transactions, surveys, through the internet logs – cookies, etc. This has resulted in huge databases which have valuable knowledge hidden in them and may be difficult to extract. Data mining as stated by(Larose D T et.al,2005) has been identified as the technology that offers the possibilities of discovering the hidden knowledge from these accumulated databases. Techniques such as pattern recognition and classification are the most important in data mining(Kantardzic M et.al,2003).

311

Journal of Emerging Trends in Engineering and Applied Sciences (JETEAS) 4(2):311-316 (ISSN: 2141-7016) features, and assigning the input pattern into one of the possible target classes according to some decision rule.

designed considering different methods dealing with missing values. Quantitative data will be generated for the selected classifiers on their predictive performance in different scenarios keeping in mind appropriate evaluation metrics. A popular open source data mining tool will be used for the experimental study. In our approach, we have used ReplaceMissingValues unsupervised Filter in WEKA to replace all missing values using means and modes.

Research efforts dedicated to data mining, which focussed on improving the classification and prediction accuracy, have recently been undergoing a tremendous change (Smyth P et. al,2001). The continuous development of more and more sophisticated classification models through commercial and software packages have turned out to provide some benefits only in specific problem domains where some prior background knowledge or new evidence can be exploited to further improve classification performance. In general however, related research proves that no individual data mining technique has been shown to deal well with all kinds of classification problems. Awareness of these imperfections of individual classifiers has called for the emergence of careful development and evaluation strategies of data mining classification models. The rest of the paper is organized as follows: Section 2 presents the methodological approach proposed in the paper. Section 3 describes briefly the dataset used and the pre-processing undertaken. Section 4 presents the design of the experiment. The results are discussed in section 5. Finally the paper concludes.

Description of Dataset We performed computer simulation on a breastcancer dataset available UCI Machine Learning Repository(UCI Machine Learning Repository). It contains 286 samples and 9 input features as well as 1 output feature. The features describe different factor for breast-cancer reoccurrence. The output feature is the decision class which has value no reoccurrenceevents and reoccurrence-events. The dataset contains 201 instances shown as no reoccurrence-events while 85 instances as reoccurrence-events. There are eight instances having missing values.s. A snap shot of the dataset is shown in Figure-1.

The Proposed Approach Sometimes there are attributes that are incomplete or missing. A common method of representing missing data, is inputting values that cannot be found in the data e.g. represent missing data as “-1” or “?”. If an attribute is empty usually one may think that the case is less useful than the rest of the cases in the data set. This is not true as each of the other attributes contributes useful information towards the set of attribute category. When there are missing values, instead of leaving them as missing, there are a number of methods that can be used for filling these missing attributes. Having efficient methods to fill up missing values extends the applicability in terms of accuracy for many data mining methods. The accuracy of the tool is increased and with a larger training set better rules and decision trees can be developed which contributes towards better classification of the data (Wang J T L et.al,2005). The most common method of filling the attributes quickly and without too much computation is to replace all the missing values with the arithmetic mean or the mode with respect to that attribute. The other methods are to run a clustering algorithm and replace the missing attributes with the attributes of cases that appear close in an ndimensional space. A set of commonly used classifiers will be selected for comparative study based on their qualitative considerations. A dataset will be taken from well-known public repository for machine learning. A couple of data scenarios will be

Figure-1: A snap shot of the dataset EXPERIMENT DESIGN We use WEKA [6], an open source software tool for our experiment. It contains a large number of algorithms for data mining applications. Classification Algorithms There are a large number of data mining algorithms available for different tasks. We select five candidate algorithms based on their popularity for classification task available in WEKA (see Table-1).

Table-.1: WEKA names of selected classifiers 312

Journal of Emerging Trends in Engineering and Applied Sciences (JETEAS) 4(2):311-316 (ISSN: 2141-7016) Generic Name Bayesian Network Neural Network (NN) Support Vector Machine C4.5 Decision Tree K-Nearest Neighbour

WEKA Name Naïve Bayes (NB) Multilayer Perceptron SMO J48 1Bk

EXPERIMENT DESIGN We formulate different feature-subset scenarios to analyse the performance of the classifiers in case of missing values. Scenario-1: {The original dataset with missing values} Scenario-2: {The dataset after removing instances having missing values} Scenario-3: {The dataset after removing features having missing values} Scenario-4: {The dataset after filling the missing values} Then we apply the selected classifiers on these scenarios and analyse their performance. We use Recall, Precision and F-Measure to quantitatively evaluate the classifiers. Figure-2: A sample snapshot of a classifier in WEKA EXPERIMENTAL RESULT We have used ReplaceMissingValues unsupervised Filter in WEKA to replace all missing values to generate scenario-4. We run selected classifiers in different scenarios of the dataset and record the performance in Table-2, Table-3, Table-4 and Table5. A sample snapshot of a classifier in WEKA is shown in Figure.2. Table-2: Classifier Performance in Scenario-1 Classifier NB

NN

SMO

1Bk

J48

Confusion Matrix

Avg. Precision

a

B

Suggest Documents