the impact of classification evaluation methods on ... - Semantic Scholar

1 downloads 0 Views 120KB Size Report
Keywords: Data Mining, Rough Set Theory, Classification Evaluation Methods, Classifier. 1. ... decision tree based classification, statistical classification,.
THE IMPACT OF CLASSIFICATION EVALUATION METHODS ON ROUGH SETS BASED CLASSIFIERS QASEM A. AL-RADAIDEH Department of Computer Information Systems, Faculty of Information Technology and Computer Sciences, Yarmouk University Irbid 21163, Jordan, [email protected]

ABSTRACT It has been our experience that in order to obtain a fair comparison between supervised learning approaches, it is necessary to perform the comparison using a unified classification approach. This paper presents an experimental study regarding the issue of classification algorithms evaluation approaches where two approaches are evaluated, the Houldout and the Cross Validation methods. The Rough Set Theory based classification is used as a classification technique. The impact of the evaluation approach on the classification results is discussed and at the end, some guidelines for classification algorithms comparisons are recommended. Keywords: Data Mining, Rough Set Theory, Classification Evaluation Methods, Classifier.

1. INTRODUCTION Among researchers, it is commonly to use an approach to compare their algorithms with other known algorithms. Two main approaches are usually used to conduct the comparison. These two approaches are the Holdout approach and the K-fold Cross Validation approach. A deep surf of the classification methodologies used in the literature, it could be noticed that there is mostly no one methodology that can be served as a recipe for all comparisons of computational classification methods. This is also noticed and argued by [1] [2] [3]. This paper presents an experimental study regarding the issue of classification algorithms evaluation approaches mainly for the Rough Set Theory [4] based classification. The impact of the evaluation approach on the classification results is discussed. Some guidelines for classification algorithms comparisons are recommended.

2. CLASSIFICATION IN DATA MINING There are several tasks in data mining and the most usual in the literature is classification, which is a form of data analysis that can be used to extract models describing important data classes. The classification task

concentrates on predicting the value of the decision class for an object among a predefined set of classes values given the values of some given attributes for the object. In the literature many classification approaches have been proposed and implemented by researchers, such as, decision tree based classification, statistical classification, neural network based classification, genetic algorithms classifiers and rough set based classification [5]. In general, data classification is a two-step process. In the first step, which is called the learning step, a model that describes a predetermined set of classes or concepts, will be built by analyzing a set of training database objects. Each object is assumed to belong to a predefined class. In the second step, the model is tested using a different data set. The classification accuracy is estimated using one of several proposed techniques. If the accuracy of the model is considered acceptable, the model can be used to classify future data objects for which the class label is not known. The model will act as a classifier in the decision making process. Before the classification started some preprocessing steps may be needed and applied to the data in order to improve the accuracy and efficiency of the classification model. (1) Data cleaning: This step includes the process of removing or reducing noisy data and the process of

treatment of missing data values. In the literature several approaches have been proposed for the purpose of data cleaning [7]. (2) Feature Selection: In many practical situations there are far too many attributes for learning step to handle and some of them are irrelevant or redundant. In this step irrelevant and redundant attributes are removed from the data set. A reduced version of the data set, which contains only the relevant attributes, is used to build the classifier. For this purpose, several feature selection approaches have been proposed and implemented [6]. (3) Data Discretization: Some classification algorithms only can deal with nominal or discrete attributes and cannot handle ones measured on numerical scale. In this step data can be generalized or transformed to higher-level concepts. More on discretization techniques can be found in [11].

3. ROUGH SET PRELIMINARIES Rough set theory [4] was developed in Poland in the early 1980s as a mathematical tool for knowledge discovery and data analysis, and concerns itself with the classificatory analysis of imprecise, uncertain or incomplete expressed in terms of data acquired from experience. The notion of classification is central to the approach; the ability to distinguish between objects, and consequently reason about partitions of the universe. Rough set theory has been adopted in many researches for several data mining tasks including classification [5] [13]. In rough set theory, objects are perceived through the information that is available about them, that is, through their values for a predetermined set of attributes. In the case of inexact information, one has to be able to give and reason about rough classifications of objects. The structure of data is represented in the form of Decision System/Table (DS). The decision system is a pair of the form DS = (U, A∪{d}), where U is a nonempty finite set of objects called the Universe, while A is a nonempty finite set of attributes. Every attribute a∈A is a total function a:U→Va, where Va is the set of allowable values for the attribute a (i.e, its values range). The attributes belonging to A are called conditional attributes while d is called decision attribute. For each possible subset of attributes B ⊆ A, a decision table gives rise to an equivalence relation called an Indiscernibility Relation IND(B), where two objects (x,y) are members of the same equivalence class if and only if they cannot be discerned from each other on the basis of the set of attributes B. The formal definition of IND(B) can be expressed as: IND(B)={(x,y) ∈ |U|×|U| : a(x) = a(y) ∀ a ∈ B}. The discernibility matrix (M) of a decision system is a symmetric |U| × |U| matrix with

entries cij defined as {a ∈ A| a(xi) ≠ a(xj)} if d(xi) ≠ d(xj), Φ otherwise. A Reduct (R) of A refers to the minimal selection of attributes that can be used to represent all classes of the decision system. A reduct has two main properties: (1) C(R) = C(A), i.e., R produces the same classification (C) of objects as the collection A of all attributes. (2) for any attribute a ∈ R, C(R-{a}) ≠ C(R), i.e., a reduct is a minimal subset with respect to property (1)

4. APPROACHES OF EVALUATING CLASSIFICATION ALGORITHMS Among researchers, it is commonly to use an approach to compare their algorithm with other known algorithms. Two approaches are usually used to conduct the comparison. These two approaches are the Holdout method and the Cross Validation approach.

4.1 HOLDOUT APPROACH In this approach, the database is randomly split into two disjoint datasets. The first set, which the data mining system tries to extract knowledge from, is called Training Set. By examining the data in this database, the system tries to create general rules and descriptions of the patterns and relations in the database. The goal is to gain knowledge, which is valid not only in the specific dataset considered, but also for other similar datasets. The extracted knowledge may be tested against the second set which is called the test set. If the knowledge gained from the training set was general knowledge, it will be correct for most parts of the test set as well. In machine learning, to have the training and test sets, it is common to randomly split dataset under the mining task into two parts to create a training set and a test set. In the first part, it is common to have 70% of the instances or objects of the original database as a training set. The rest of the objects of the original database is used as a test set to check if the knowledge extracted from the training set was of general nature or not. The process is depicted in Figure 1. To evaluate the algorithm in hand, some measures must be used. For classification algorithms it is common to use the classification accuracy as an evaluation measure.

4.2 CROSS VALIDATION Cross-Validation (CV) [2] [3] [6] refers to a widely used experimental testing procedure. The process of CV is a way of getting more reliable estimates and more mileage out of possibly scarce data. In k-fold CV approach the database is randomly divided into k disjoint blocks of

2

objects, usually of equal size. Then the data mining algorithm is trained using k-1 blocks. The remaining block is used to test the performance of the algorithm. This process is repeated for each of the k blocks where a measure is recorded for each iteration. The measure depends on the data mining task being used. For classification task, it is common to use the classification accuracy as a measure. At the end, the recorded measures are averaged. By this process, each object is guaranteed to be in the test set once and in the training size k-1 times. It

Data Mining Task

Training DS

Random Splitter

Dataset

is common to choose k=10 or any other size depending on the size of the original dataset.

Patterns

Pattern Evaluation

Test DS

Figure 1: Classifier Evaluation using Training and Testing Sets

An extreme variant of selecting k is to choose k = |U|, i.e., letting each test set consist of a single example. This is called leave-one-out CV, and, although potentially extremely computer intensive may be intuitively pleasing as it most closely mimics the true size of the training set. The process of doing Data Mining for the dataset using the 4-fold cross validation approach is depicted in Figure a1

...

an

a1

...

an

a1

...

2. The objects chosen for training are not necessary to be adjacent.

an

a1

...

an

a1

...

an

Average the result of the 4 folds.

Dataset Train

Mine Fold 1

Mine Fold 2

Mine Fold 3

Mine Fold 4

Test

Figure 2: Classifier Evaluation using 4-Fold Cross Validation

3

5. EXPERIMENTS AND DISCUSSIONS To compare the two approaches, some experiments have been conducted using a Rough Set base classification approach. The ROSETTA [7] rough set based knowledge discovery tool has been used for experimentation. In the experiment, two reduct computation approaches named Johnson Reducer and Genetic Algorithm Reducer have been investigated. Using both algorithms, object related reducts have been generated. For the comparison purposes 10 standard datasets have been experimented. The datasets are obtained from the machine learning data repository at the University of California at Irvine [8]. A brief description of the datasets is presented in Table 1. Table 1: Datasets features and preprocessing.

No 1 2 3 4 5 6 7 8 9 10

Dataset Australian Cleveland Heart Dis Hepatitis Iris Lung Can Lymph Soyabean Vehicle Zoo

C. Attr 14 13 13 19 4 56 18 35 18 16

Objs 690 313 270 155 150 32 148 307 856 101

Class 2 2 2 2 3 2 4 19 4 7

Disc. Yes Yes Yes Yes No No No No Yes No

Fill No Yes No Yes No Yes No Yes No No

For the Training and Test approach experiment, the datasets are split randomly into two subsets. The training set contains 70% of the objects and the testing set contains the other 30% objects. Before the training phase some dataset are discretized using the Boolean Reasoning Scaling method [9]. The missing values in some of dataset are filled using the fill mode method implemented in Rosetta. The values (Yes and No) of the Disc and Fill columns of Table 1 indicate if the discritization and filling processes have been applied to a dataset or not. To evaluate the classification algorithm, we adopt the classification accuracy base on rough set classification as an evaluation measure. The classification accuracy measure used in this experiment is computed using the confusion matrix. A confusion matrix contains information about actual and predicted classifications done by a classification algorithm. The accuracy (AC) is the proportion of the total number of predictions that were correct. The results of the experiments have been summarized in Table 2 and Table 3. For the Training and Testing approach (TT) the total classification accuracy has been recorded. For the CV approach the minimum (Min), maximum (Max), and the average (Avg) classification accuracy over the k folds of training sets.

The same experiment has been repeated over the same datasets using 5-Fold CV. The results of the experiment have been summarized in table 4. It can be noticed that the average of the classification accuracy for most datasets have been preserved. In addition, the accuracy has been enhanced for the datasets with small number of objects. This is because using large fold CV such as 10 or more will make the training set very small to represent the entire dataset. In this case the generated reducts and so the classification rules will not be able to classify the larger test set. No 1 2 3 4 5 6 7 8 9 10

Table 2: Using Johnson Reducer and 10 - Fold CV DataSet Holdout Min Max Australian 78.26 66.67 84.06 Cleveland 46.15 33.33 66.67 Heart Disease 80.25 62.97 92.59 Hepatitis 70.21 43.75 87.50 Iris 75.56 60.00 86.67 Lung Cancer 70.0 33.33 100.00 Lymphography 81.82 60.00 100.00 Soyabean 80.44 64.52 89.29 Vehicle 62.60 56.47 66.67 Zoo 100.00 80.00 100.00

Avg 77.97 47.55 78.89 66.67 74.00 61.33 80.56 77.96 62.55 95.00

From the results presented in Table 2, it is noticed that the classification accuracy was better for most of the experimented datasets. Johnson algorithm generally produces small reduct set and hence smaller number of classification rules. In contrast, the Genetic Algorithm approach produces larger rule set that were more difficult to read and manage. From the experiments, it could be noticed that the classifier performance generally remained the same Table 3: Using Genetic Algorithm based Reducer and 10 - Fold CV DataSet Holdout Min Max Avg No Australian 85.99 81.16 92.75 86.52 1 Cleveland 78.02 63.33 93.33 81.30 2 Heart Disease 80.00 70.37 96.30 80.74 3 Hepatitis 85.11 68.75 93.75 80.06 4 Iris 88.89 80.00 100.00 89.33 5 Lung Cancer 70.00 60.00 100.00 72.67 6 Lymphography 81.82 60.00 93.33 80.56 7 Soyabean 85.87 80.65 92.86 86.71 8 Vehicle 65.35 57.65 69.14 64.56 9 96.67 80.00 100.00 92.00 10 Zoo

6. CONCLUSION AND RECOMMENDATIONS In this paper we analyze two classification evaluation techniques in the context of rough set methodology. The experimental study presented in this paper showed that using different comparison approach yield different classification accuracy using the same classification

4

approach. Moreover, it could be observed that no single reduct generation method can be guaranteed to be the best method over all datasets. Table 4: Using Johnson Reducer and 5 - Fold CV No DataSet Min Max Avg 71.74 83.33 77.68 1 Australian 34.43 65.57 43.86 2 Cleveland 75.93 90.74 82.59 3 Heart Disease 45.16 70.97 63.87 4 Hepatitis 70.00 83.33 76.00 5 Iris 62.50 83.33 72.50 6 Lung Cancer 79.14 7 Lymphography 60.00 90.00 72.13 85.71 77.47 8 Soyabean 50.89 62.94 59.57 9 Vehicle 90.00 100.00 95.00 10 Zoo As a recommendation out of this research, the same comparison approach must be used for all classifiers being compared. It is not fair to use different approaches to compare the results and then claim that a given classification approach is performing better or worse than another classification approach. The experimental comparative study must be consistent and using the same framework of classification. We also recommend for comparison purposes to use the cross validation approach by dividing the dataset into k folds subsets. Most typical experiments use k = 10. Other values for K could also be used depending on the dataset size. As mentioned earlier, we recommend choosing a smaller k, which will leave a suitable number of objects in the training set. Another recommendation could be added, is to repeat the same experiment several times, say 10 or 100, and then average the results of all experiments.

REFERENCES [1] Al-Radaideh Q., Sulaiman M.N, Selamat M. H., & Ibrahim H., "Comparison of Reduct Generation Approaches in the context of Rough Set based Classification". Proc. of the International Conference on Information Technology and Natural Sciences (ICITNS 2003), Amman, Jordan, 2003.

[5] Bazan J., Nguyen H.S., Nguyen S.H., Synak P., Wróblewski J,. "Rough Set Algorithms in Classification Problem". In: L. Polkowski, et al (eds.): Rough Set Methods and Applications. Physica-Verlag, Heidelberg, pp. 49 – 88, 2000. [6] Liu H. and Yu L. "Feature Selection for Data Mining". 2002. Collected from /citeseers.nj.nec.com/ on 10 Jan 2003. [7] Han J. and Kamber M., "Data Mining Concepts and Techniques". Morgan Kaufmann Publishers. 2001. [8] Stone, M., "Cross-validatory choice and assessment of statistical predictions", Journal of the Royal Statistical Society, Vol. 36, pp.111-147. 1974. [9] Ohrn, A., Komorowski, J., Skowron, A., Synak, P. "The design and implementation of a knowledge discovery toolkit based on rough sets- The ROSETTA system". In: L. Polkowski & A. Skowron (Eds.), Rough Sets in Knowledge Discovery 1: Methodology and applications, Physica-Verlag, pp. 376-399. 1998. [10] Choubey, S.K.; Deogun, J.S.; Raghavan, V.V.; Sever, H. "A comparison of feature selection algorithms in the context of rough classifiers". In Proceedings of 5th IEEE International Conference on Fuzzy Systems. 1996. [11] Son H. Nguyen and Hoa S. Nguyen, "Discretization Methods in Data Mining". In: L. Polkowski & A. Skowron (Eds.): Rough Sets in Knowledge Discovery. Physica-Verlag, Heidelberg,. pp. 451-482. 1998. [12] Blake C., Keogh E., and Merz C. J., "UCI repository of machine learning databases". University of California, Irvine, Department of Information and Computer Sciences. 1998. [13] Hrudaya Ku. Tripathy, B. K. Tripathy, and Pradip K. Das. "An Intelligent Approach of Rough Set in Knowledge Discovery Databases". Proceedings of World Academy of Science, Engineering and Technology Volume 26 December 2007.

[2] Salzberg S.L., "On Comparing Classifiers: Pitfalls to avoid and Recommended Approach". Journal of Data Mining and Knowledge Discovery, 1, pp 313-327, Kluwer Academic Publishers, Boston. 1998. [3] Ron Kohavi, "A study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection". Proc. of the 15th International Joint Conference on Artificial Intelligence (IJCAI), pp. 1137-1143. 1995. [4] Pawlak Z. "Rough Sets: Theoretical Aspects of Reasoning about Data". Kluwer, Dordrecht. 1991.

5

Suggest Documents