A Genetic Algorithm for Classification - wseas

3 downloads 0 Views 353KB Size Report
genetic algorithms used for mining the classification rules is performed. A genetic algorithm ... Key-Words: - genetic algorithm, data mining, classification ..... Data Mining, International Jurnal of Computing & Information Sciences, 2006, pp. 143- ...
Recent Researches in Computers and Computing

A Genetic Algorithm for Classification RAUL ROBU, ŞTEFAN HOLBAN Department of Automation and Applied Informatics, Department of Computers “Politehnica” University of Timisoara Vasile Parvan Blvd.2, 300223, Timisoara ROMANIA [email protected], [email protected]

Abstract: - The paper presents aspects regarding genetic algorithms, their use in data mining and especially about their use in the discovery of classification rules. A synthetic presentation of the fitness functions of the genetic algorithms used for mining the classification rules is performed. A genetic algorithm with a new fitness function for mining the classification rules is suggested. The proposed algorithm was tested on classic dataset Car, Zoo and Mushroom. The same datasets were tested with classic algorithms NaiveBayes si J48. The results obtained by applying the three algorithms are presented. Key-Words: - genetic algorithm, data mining, classification

1 Introduction The genetic algorithms are adaptive techniques that can be successfully used to solve complex search and optimization problems [1]. They are based on the principles of genetics and Darwin’s natural selection theory (“the one that is best endowed, survives”). Genetic algorithms have been successfully used in data mining, in order to determine classification rules [2], in order to search for appropriate cluster centers) [1], to select the attributes of interest in predicting the value of a target attribute [3], etc. Classification of instances was performed using some hybrid algorithms based on genetic algorithms and particle swarm optimization [4], respectively Naïve Bayes and k-Nearest Neighbors [5]. A few applications in which genetic algorithms were successfully applied to solve classification problems are prints’ classification, heart disease classification, classification of emotions on the human face, etc. The fitness functions of the genetic algorithms used for mining classification rules may contain metrics concerning predictive accuracy, rule comprehensibility as well as rule interestingness [2][3]. Diverse studies suggest genetic algorithms with fitness functions that take into consideration in different ways the above mentioned metrics. The paper synthetically presents the fitness functions suggested in mining classification rules and proposes a new fitness function. The genetic algorithm that incorporates the suggested function was implemented in Java and tested on 3 classic datasets Zoo, Car and Mushrooms. The paper presents aspects regarding the steps of the proposed algorithm. The 3 datasets were also classified using Naive Bayes and J48 algorithms from WEKA. The results obtained with the three algorithms are comparable and are presented in chapter 4.

2 Fitness functions used in mining classification rules Genetic algorithms are used to discover classification rules for data that will be used for predictions. The discovered rules must have a great prediction accuracy, they must be comprehensive and interesting [2][3]. The form of the rules is IF X THEN Y, where X is the antecedent of the rule and is formed from a conjunction of conditions and Y is the consequent of the rule, that is the predicted class. In the fitness functions used for the discovery of classification rules the following factors from the confusion matrix are often used: • True positive (tp): the actual class is Y and the predicted class is also Y. • False positive (fp): the actual class is Y, but the predicted class is not Y.

ISBN: 978-1-61804-000-8

52

Recent Researches in Computers and Computing

• True negative (tn): the actual class is not Y and the predicted class is also not Y. • False negative (fn): the actual class is not Y, but the predicted class is Y. In [3] the next fitness function is suggested: Fitness=w1 x (CF x Comp) + w2 x Simp

(1)

The first term from relation (1) determines prediction accuracy and the second one determines rule comprehensibility. CF= tp / (tp+fp) is confidence factor and Comp= tp / (tp+fn) is the completeness factor. Simp represents the simplicity of the rule (is inversaly proportional to the number of conditions from the antecedent of the rule). w1 and w2 are weights defined by the user. In [6] the fitness function was determined: Fitness=tp / (tp+A.fn) * tn / (tn+B.fp)

(2)

Where 0.2