A Niched Genetic Programming Algorithm for Classification Rules Discovery in Geographic Databases Marconi de Arruda Pereira1,3, Clodoveu Augusto Davis Júnior2, and João Antônio de Vasconcelos3 1
Centro Federal de Educação Tecnológica de Minas Gerais, Av. Amazonas, 7675, Belo Horizonte, Brazil
[email protected] 2 Laboratório de Banco de Dados Universidade Federal de Minas Gerais, Av. Antônio Carlos, 6627, Belo Horizonte, Brazil
[email protected] 3 Evolutionary Computation Laboratory Universidade Federal de Minas Gerais, Av. Antônio Carlos, 6627, Belo Horizonte, Brazil
[email protected]
Abstract. This paper presents a niched genetic programming tool, called DMGeo, which uses elitism and another techniques designed to efficiently perform classification rule mining in geographic databases. The main contribution of this algorithm is to present a way to work with geographical and conventional data in data mining tasks. In our approach, each individual in the genetic programming represents a classification rule using a boolean predicate. The adequacy of the individual to the problem is assessed using a fitness function, which determines its chances for selection. In each individual, the predicate combines conventional attributes (boolean, numeric) and geographic characteristics, evaluated using geometric and topological functions. Our prototype implementation of the tool was compared favorably to other classical classification ones. We show that the proposed niched genetic programming algorithm works efficiently with databases that contain geographic objects, opening up new possibilities for the use of genetic programming in geographic data mining problems. Keywords: Classification rules, data mining, knowledge discovery in geographic databases.
1 Introduction Classification rule mining is one of the most important tasks for knowledge discovery in databases (KDD). Recently, techniques and algorithms for classification rule mining have been intensively studied due to the large variety of practical applications for them. For instance, commercial firms want to know more about the behavior of their customers. Governments need to prioritize resource allocation and decide about K. Deb et al. (Eds.): SEAL 2010, LNCS 6457, pp. 260–269, 2010. © Springer-Verlag Berlin Heidelberg 2010
A Niched Genetic Programming Algorithm for Classification Rules Discovery
261
public policies. Educators want to find out the factors that lead students to fail. In all of these situations, specialists are seeking unexpected patterns within data, and such patterns can emerge in the form of classification rules. Mining classification rules usually utilizes supervised learning techniques that consist in discovering patterns in training data so that the resulting rules can be applied in the classification of other data. Goldberg [7] shows that genetic algorithms have been applied successfully in machine learning problems since the 1970s. The growth of interest in data mining has motivated the scientific community of evolutionary algorithms. Freitas [5] shows that genetic algorithms, genetic programming and, more recently, artificial immune systems, ant colony algorithms [12] and particle swarm optimization [9] have been successfully used in various data mining problems. The main advantage of evolutionary algorithms is their robustness, that is, once the problem is correctly modeled, the algorithm is able to explore the feasible region within the space of problem solutions, looking for the best global solution. Greedy algorithms can be applied, but they usually return a local solution, not the global one [2]. The popularization of the access and use of geographical data, sponsored by big companies like Google (with Google Earth, Google Maps) and governmental departments like NASA (in USA) and INPE (in Brazil), brought up a new challenge: developing good data mining algorithms that works well with geographic tools. There are a lot of good algorithms presented in the literature that works with geographical data mining. In section 2 the last algorithms designed to geographical data mining are shown. Nevertheless, there are not any algorithms capable to manipulate conventional data and geographical data at the same time exploring the topological relations. In this paper, a niched genetic programming algorithm with elitism, called DMGeo, is presented. DMGeo has been designed to work with conventional and geographic data, which makes it suitable to a new range of applications. This paper is organized as it follows. Section 2 presents related works. Section 3 introduces DMGeo and presents the proposed algorithm. Section 4 presents a case study developed to demonstrate DMGeo and shows the results. Finally, Section 5 presents a conclusion of this paper.
2 Related Works Whigham [17] propose a first genetic programming that uses a context-free grammar to predict the density of Australians marsupials. This algorithm identifies spatial pattern behavior of these marsupials with confidence of 99% against all the non-spatial methods. However, this algorithm does not use topological relations (such as contains, covers, crosses and others [3]). Bogorny et al [1] show a tool that permits integration between a classical data mining toolkit (Weka [16]) and a geographic information system. The tool, implemented as an extension of Weka, is used as it follows. The user selects a geographic database to be used and chooses a set of feature types or instances. The tool preprocesses the geographic relations between the elements of the set. These relations include topology and distance. A Weka ARFF file is thus generated, encoding a representation of the geographic relations among the selected items. Weka is then used to mine the data using conventional algorithms.
262
M. de Arruda Pereira, C.A. Davis Júnior, and J.A. de Vasconcelos
Pappa et al. [13] show a multi-objective genetic algorithm (MOGA) that is used in selecting attributes from database for data mining purposes. In particular, this genetic algorithm looks for the best set of attributes to create a decision tree to be used in a classification algorithm, namely C4.5 [11]. Just one of the mentioned papers presents a tool that is able to work with geographic and conventional data at the same time, but it cannot deal with topological relations. Moreover, the data mining specialist has to decide which kind of data is more important, that is, the tools found in literature are not capable to tell which set of attributes (combining geographical and conventional ones) is the most relevant in the target problem. The algorithm we propose and present in the next section has been designed specifically to allow the incorporation of spatial functions and operators in the genetic programming approach to rule mining, thus, avoiding complex and costly preprocessing.
3 The Proposed Algorithm - DMGeo We propose a niched genetic programming algorithm, called DMGeo, to perform classification tasks in the presence of geographic attributes. DMGeo is capable of incorporating geographic constraints and operations in its rules, combining them with conventional operators and functions at each individual. In our approach, an individual represents a Boolean predicate, defined in the same manner as a SQL WHERE clause. The adequacy of the individual to the problem is assessed using a fitness function, which determines its chances of selection. These elements of the algorithm are presented in the next subsections. 3.1 Individual The individual was modeled to represent a rule that should be applied to pattern selection in a database, as in the WHERE clause of a SQL query. Fig. 1 shows an example of an individual’s tree, in which logical operators, attribute names and constants are combined to form a filter clause. It is important to note that the tree represents the rule to identify features of one class.
Fig. 1. Representation of an individual and its class
Tree nodes include the following information: • Type. The types implemented in DMGeo are Boolean, numeric, and geographic (point, line or polygon).
A Niched Genetic Programming Algorithm for Classification Rules Discovery
263
• Node body. The body can be a constant or a geographic function call. This will be detailed next. • Parameters. If the body is a function call, it must receive parameters from other nodes in the tree. Nodes of the tree can be of the following types: • Function Nodes: functions that, in this paper, have exactly two parameters, and form the set of possible inner nodes. They can be divided in two groups: conventional and geographical; • Terminal Nodes: The set of terminal nodes is formed by the set of leaves in the tree. These nodes represent constants or database attributes. The set of function nodes, as described previously, is composed by conventional functions (or operators) like =, , >=, 200.000) AND (rail.geom crosses city.geom)” which class is different of ‘A’. The False Negative value is the number of items that are not selected by the rule, which class matches the expected one. 3.2.2 Fitness Function Evaluation A classical way to measure the effectiveness of a classifier is to obtain and compare indicators such as Accuracy, Sensitivity and Specificity. These indicators are calculated using the confusion matrix, as it follows:
A Niched Genetic Programming Algorithm for Classification Rules Discovery
265
TP +TN , TP + FP +TN + FN TP , Sensitivity (A) = TP + FN TN , Specificity (A) = TN + FP where A denotes the class. Thus, after the confusion matrix coefficient evaluation, the fitness function is calculated using (1): Accuracy (A) =
F(I, X) = Accuracy(I, X) * Sensitivity(I, X) * Specificity(I, X)
(1)
where F (I, X) is the fitness function that evaluates the individual I for classifying items of class X. 3.3 Mutation Differently of genetic algorithms, the mutation operator implemented in genetic programming is not so simple. First, it is necessary to make sure that the individual’s tree remains valid after the mutation process, i.e., the mutation operation cannot replace a node (or a subtree) by a node of a different type. There are four possible outcomes after the mutation process [5]:
• Point Mutation: a terminal node is replaced by another terminal node. • Collapse Mutation: a terminal node replaces a function node (subtree); the individual’s tree decreases in size; • Expansion Mutation: a function node replaces a terminal node. In this case, the individual’s tree increases in size. • Subtree Mutation: a function node replaces another function node. In this case, the size of the individual’s tree can increase or decrease. We also implemented a mechanism to generate changes in the class during the mutation process. It happens when another class leads to a greater value of the fitness function. Section 3.5 will show that the predicted class of an individual is used as a niche and the class change mechanism works as a niche migration. It is important to highlight that in all four situations previously cited, the mutation works as follows: 1. Randomly select a node (terminal or function); 2. Generate another node (terminal or function) of the same type of data of the node selected in step 1; 3. Replace the selected node by the newly generated one. 3.4 Crossover DMGeo’s crossover is based on the classical crossover of genetic programming [10]. The operation is implemented as it follows: select two individuals of the same class, using roulette wheel; clone these individuals; permute a randomly selected subtree of the first individual with a randomly selected subtree from the second one.
266
M. de Arruda Pereira, C.A. Davis Júnior, and J.A. de Vasconcelos
3.5 Niches and Elitism The number of niches should be equal to the number of classes, in such a way that each class will have, in the final process, just one rule (individual) that is expected to be the best one for selecting samples, belonging to the expected class. To ensure that the best individual of each niche will be preserved at the next generation, the elitism technique was applied. 3.6 Storage of the Fitness Value in Cache Memory As seen previously, the calculation of the fitness function for each individual is based on SQL queries executed in databases. As these queries demand expensive disk access, we implemented a simple cache memory to store the fitness value of some individuals. This cache consists in a hash table where the WHERE-Clause that represents the individual is the key of the table and the stored value is the performance of fitness function.
4 Tests and Results 4.1 Problems Used in Tests The performance of the proposed algorithm DMGeo is analyzed by applying it in four classification problems: two datasets available in the UCI Machine Learning Repository[14], denoted as Heart and Wine databases, which are composed by numeric data, and two others datasets available in Geominas repository [6], City Development and Soy Aptitude, with numeric and geographic data. The Heart database is composed of 270 instances where 120 have a heart disease (class B) and 150 are normal (class A). This problem is available in a dataset with 13 numerical attributes like age, resting blood pressure, maximum heart rate achieved, between others. The Wine database contains data of three different groups of wine, according to its level of alcohol. The problem is composed by 178 instances where 59 are level A, 71 are level B and 48 are level C. The problem is stored in a dataset with 13 attributes, like for example, ash, alkalinity of ash and color intensity. These attributes were obtained from a chemical analysis. The City Development database contains data of three different levels of development: high, medium and low. There are 852 cities where 264 are high (class A), 296 are medium (class B) and 292 are low (class C). This problem is stored in a dataset with 22 numerical attributes like quantity of schools (public and private), quantity of industrial electricity customers and GINI number. The dataset also contains geographical attributes like cities geometry (stored as polygons), railway and highway (stored as polygonal lines [3]). The Soy Aptitude database contains data from two different type of soil, according to its aptitude to produce soy. There are 852 cities, where 562 have its soil appropriated to soy cultivation (class A) and 290 that have restriction in their soil for soy cultivation (class B). In this problem, we just have geographical attributes to work with:
A Niched Genetic Programming Algorithm for Classification Rules Discovery
267
cities geometry, rain incidence (stored as polygons), soil aptitude to beans, citrus fruits and cotton cultivation (all stored as polygons). The datasets from UCI were used to evaluate the performance of DMGeo using just numerical attributes. The main contribution of the proposed algorithm is that it can explore well the feasible region of problems whose datasets are formed by numerical and geographical attributes. The datasets from Geominas have this property. As a baseline, three standard classification algorithms were used in all databases: decision tree (J48), Radial Basis Function Neural Network (RBF) and Support Vector Machine (SVM). Results of these three techniques were compared to the ones obtained by DMGeo. The Soy Aptitude problem cannot be solved using the standard tools (J48, RBF and SVM) because they do not manipulate geographical attributes. Then, we did a pre-processing of the geographical attributes, using the tool presented in [1], in order to generate a dataset composed just by numeric attributes. Thus, the standard tools were applied to classify this new dataset. It is important to notice that the Soy Aptitude problem is unbalanced, that is, one class has much more instances than the other one (66% of class A and 34% of class B). To balance this problem, copies of instances of class B were randomly generated to make the number of instances of each class similar. The experiments were made using the balanced and unbalanced datasets. The experiments using DMGeo in the City Development dataset were conducted in the presence and in the absence of geographic data in two different runs, one using geographic data (Table 3) and one using only numeric data (Table 2). All algorithms used the cross-validation procedure with 5 folds and the experiment was repeated 3 times with each tool using each base. DMGeo used population size = 200, generations = 200, crossover probability = 90% and mutation probability = 2%. 4.2 Results and Analysis This section presents the main results obtained with DMGeo and with the standard techniques J48, RBF and SVM. First, we present the global index in Table 1. This index is an average of the results obtained for each class. For example, the DMGeo, in Wine Problem, classified correctly 93% as class A, 90% as B and 99% as C. In this case, the global index is the medium value, which is (93+90+99)/3 = 94%. Table 1. Actual global index of each tool
Wine Heart Soy Aptitude City Development
J48 RBF SVM DMGeo 91% 94% 98% 98% 78% 81% 78% 84% Unbalanc. Balanc. Unbalanc. Balanc. Unbalanc. Balanc. Unbalanc. Balanc. 44% 68% 38% 56% 74% 73% 76% 80% 62% 58% 57% 82%
As we can see, DMGeo performed well all these problems. Table 2 shows the results obtained by each algorithm stratified by each class, as well as the standard deviation (σ). It is important to emphasize that a low standard deviation means that the tool was able to achieve a more homogeneous classification.
268
M. de Arruda Pereira, C.A. Davis Júnior, and J.A. de Vasconcelos Table 2. Actual index for each class A B C
Wine σ
A B
Heart σ Soy Aptitude
A B
σ City Development (without geographic att.) σ
A B C
J48 RBF SVM DMGeo 97% 98% 93% 100% 87% 96% 90% 100% 89% 98% 96% 99% 5.292 4.583 2.000 2.000 77% 79% 83% 85% 78% 82% 72% 84% 2.121 9.192 0.707 0.707 Unbalan. Balan. Unbalan Balan Unbalan Balan Unbalan Balan 66% 68% 66% 54% 73% 74% 93% 90% 22% 68% 10% 58% 59% 74% 70% 70% 31.113 39.598 2.828 2.828 24.042 14.142 0.000 0.707 75% 78% 78% 78% 52% 47% 47% 71% 57% 47% 47% 66% 13.796 17.898 17.898 4.509
Table 3. DMGeo with geographic attributes in City Development dataset A
B
C
σ
Global index
85%
80%
82%
2.516
82%
DMGeo obtained, in this analysis, good results in the datasets that are composed by conventional data. For example, the best results of class C in the Wine dataset and class A in Heart dataset were found by DMGeo. But the biggest contribution of this algorithm is obtained when results of the classification problem can be improved by the geographical analysis. As shown in Table 3, when DMGeo used the geographical attributes the results became better. This table presents the result obtained in City Development dataset with geographic attributes. The results show that the DMGeo took advantage of the topological relations of these geographic attributes to increase its performance. When the dataset contains just geographical attributes a preprocessing tool can be applied and posteriorly a conventional classification algorithm can be used. Nevertheless, in this case, the DMGeo can present a better result.
5 Conclusion This paper proposes a new evolutionary algorithm that can be applied in classification problems in which numeric and/or geographic data may be present. The algorithm uses niches, elitism and a cache memory to improve its performance, and represents the individual as a SQL WHERE-clause. In order to evaluate the performance of the designed algorithm, tests of classification problems were used. Classical classification algorithms, like Neural Network, Decision Tree and Support Vector Machine were also used to generate results to be compared with those obtained with DMGeo. The comparison shows that the proposed algorithm is competitive and robust, since it has presented the best results in most cases. The main contribution of the proposed algorithm is achieved when the classification problem presents regular and geographical attributes.
A Niched Genetic Programming Algorithm for Classification Rules Discovery
269
Acknowledgment. This work has been supported in part by the Brazilian agency CNPq – Conselho Nacional de Desenvolvimento Científico e Tecnológico.
References [1] Bogorny, V., Palma, A.T., Engel, P.M., Alvares, L.O.: Weka-GDPM – Integrating Classical Data Mining Toolkit to Geographic Information Systems. In: SBBD Workshop on Data Mining Algorithms and Aplications (WAAMD 2006), Florianopolis, Brasil, October 16-20, pp. 9–16 (2006) [2] Coello Coello, C.A., Lamont, G.B., Van Veldhuizen, D.A.: Evolutionary Algorithms for Solving Multi-Objective Problems, 2nd edn. Springer, Heidelberg (2007) [3] Egenhofer, M.A.: Model for Detailed Binary Topological Relationships. Geomatica 47, 261–273 (1993) [4] Ester, M., Kriegel, A.F.H.P., Sander, J.: Spatial Data Mining: Database Primitives, Algorithms and Efficient DBMS Support. Data Mining and Knowledge Discovery 4(3-4), 193–216 (2000) [5] Freitas, A.A.: Data Mining and Knowledge Discovery with Evolutionary Algorithms. Natural Computing Series. Spring, Germany (2002) [6] GeoMINAS - Programa Integrado de Uso da Tecnologia de Geoprocessamento pelos Órgãos do Estado de Minas Gerais, http://www.geominas.mg.gov.br/ (accessed in January 2010) [7] Goldberg, D.E.: Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley, Reading (1989) [8] Han, J., Koperski, K., Stefanovic, N.: GeoMiner: A System Prototype for Spatial Data Mining. In: SIGMOD Special Interest Group on Management Of Data, Arizona, EUA, pp. 553–556 (1997) [9] Holden, N., Freitas, A.A.: Hierarchical classification of protein function with ensembles of rules and particle swarm optimization. Soft Computing Journal 13(3), 259–272 (2009) [10] Koza, J.R.: Genetic Programming: on the programming of computers by means of natural selection. MIT Press, Cambridge (1992) [11] Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1993) [12] Parpinelli, R.S., Lopes, H.S., Freitas, A.A.: Data Mining with an Ant Colony Optimization Algorithm. IEEE Trans. on Evolutionary Computation, special issue on Ant Colony algorithms 6(4), 321–332 (2002) [13] Pappa, G.L., Freitas, A.A.: Evolving rule induction algorithms with multi-objective grammar-based genetic programming. Knowledge and Information Systems 19(3), 283– 309 (2009), http://dx.doi.org/10.1007/s10115-008-0171-1 [14] UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine, CA, http://archive.ics.uci.edu/ml (accessed in Jun. 2010) [15] Vasconcelos, J.A., Ramírez, J.A., Takahashi, R.H.C., Saldanha, R.R.: Improvements in Genetic Algorithms. IEEE Transactions on Magnetics 37(5), 3414–3417 (2001) [16] Weka, http://www.cs.waikato.ac.nz/ml/weka/ (accessed in December 2010) [17] Whigham, P.A.: Induction of a marsupial density model using genetic programming and spatial relationships. Ecological Modelling 131, 299–317 (2000)