Classification and Associative Classification Rule ... - CiteSeerX

0 downloads 0 Views 3MB Size Report
3.3 ACO Based Classification Rule Discovery: AntMiner Algorithms . ...... rules pruning procedure for ten-fold cross validation with support = 1% and confidence ...
Classification and Associative Classification Rule Discovery Using Ant Colony Optimization

by

Waseem Shahzad, MS (CS)

A thesis submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy to the FAST National University of Computer & Emerging Sciences

Department of Computer Science FAST National University of Computer & Emerging Sciences, Islamabad, Pakistan. (September 2010)

ii

Classsification and d Associativee Classificatioon Rule Discoovery using Ant Colony Opptimization

Dedication

1 Dedication

Dedicated to my parents & my brothers

iii

iv

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

Acknowledgements I would like to thanks Almighty Allah, the Most Merciful, and Most Gracious, Who enabled me to undertake and carry out this research work. I could not reach at this goal without the prayers, moral support, and love of my parents. I would like also to thank my brothers and all other well-wishers. This dissertation presents research undertaken at the Department of Computer Science, FAST, National University of Computer & Emerging Sciences (NUCES), Islamabad, from 2008 to 2010, under the supervision of Prof. Dr. Abdul Rauf Baig, to whom I am grateful for recomending the subject and for his guidance, support, and encouragement provided to me throughout the research program. His consultation, effective comments and deliberation were always a source of inspiration. I am also thankful to Dr. Anwar Majeed Mirza, who administrated my research activities. I am also extremely thankful to Dr. Aftab A. Maroof, Dr. Farrukh Aslam, Dr. Arshad Ali Shahid, Dr. Ayaz Hussain, Dr. Arfan Jaffer and other colleagues, for supporting me in my research work at NUCES. I would also like to praise to my friends Dr. Abdul Basit, Dr. Amjad Iqbal, M. Ramzan, Hassan Mujtaba, Naveed Iqbal, Zahid Haleem, Abdul Rauf, M. Nazir, Aamir, Zahoor, Sajid, Naveed, Sohail, M. Asif, Hamid and Arif. I endorse the permissive role of the Higher Education Commission of Pakistan for their financial support provided through its Indigenous PhD Scheme.

2 Table of Contents 1

2

Chapter 1: Introduction ............................................................................................... 1 1.1

Research Background ........................................................................................... 1

1.2

Research Contributions ........................................................................................ 3

1.3

Layout of Thesis ................................................................................................... 4

Chapter 2: Classification System and Classification Techniques ............................... 5 2.1

Classification ........................................................................................................ 5

2.2

Performance Evaluation of Classification Methods ............................................. 6

2.2.1

Predictive Accuracy ...................................................................................... 6

2.2.2

Robustness .................................................................................................... 6

2.2.3

Speed ............................................................................................................. 6

2.2.4

Scalability ..................................................................................................... 6

2.2.5

Interpretability............................................................................................... 6

2.3

2.3.1

Comprehensible Classifiers .......................................................................... 7

2.3.2

Statistical or Mathematical Classifiers.......................................................... 8

2.4 3

Types of Classifiers .............................................................................................. 7

Summary ............................................................................................................ 11

Chapter 3: Swarm Intelligence and Ant Miners ....................................................... 12 3.1

Swarm intelligence ............................................................................................. 12

3.2

Ant Colony Optimization ................................................................................... 13

3.3

ACO Based Classification Rule Discovery: AntMiner Algorithms................... 16

3.3.1

AntMiner ..................................................................................................... 16

vi

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

3.3.2

AntMiner 2 .................................................................................................. 19

3.3.3

AntMiner 3 .................................................................................................. 19

3.3.4

AntMiner+ .................................................................................................. 20

3.3.5

Other Ant Miner Versions .......................................................................... 20

3.4 4

Chapter 4: Correlation Based Ant Miner .................................................................. 22 4.1

Introduction ........................................................................................................ 22

4.2

Correlation Based AntMiner (AntMiner–C) ...................................................... 26

4.2.1

General Description .................................................................................... 26

4.2.2

Rule Construction ....................................................................................... 31

4.2.3

Rule Quality & Pheromone Update ............................................................ 36

4.2.4

Termination of REPEAT-UNTIL Loop...................................................... 38

4.2.5

Pruning of Rule ........................................................................................... 39

4.2.6

Final Rule Set .............................................................................................. 40

4.2.7

Early Stoppage of Algorithm ...................................................................... 40

4.2.8

Default Rule ................................................................................................ 41

4.2.9

Pruning of Rule Set ..................................................................................... 41

4.2.10

Use of Discovered Rule Set for Classifying Unseen Samples .................... 42

4.3 5

Summary ............................................................................................................ 21

Summary ............................................................................................................ 42

Chapter 5: Investigation of Components and Parameter Optimization .................... 44 5.1

Experiments and Analysis .................................................................................. 44

5.1.1

Datasets ....................................................................................................... 44

5.1.2

Performance Metrics ................................................................................... 45

5.1.3

Parameters Setting ...................................................................................... 45

5.1.4

Results ......................................................................................................... 46

Table of Contents 5.1.5

Number of Probability Calculations ........................................................... 48

5.1.6

Convergence Speed..................................................................................... 50

5.2

5.2.1

Class Choice Prior or After Rule Construction........................................... 51

5.2.2

Termination of Rule Construction .............................................................. 52

5.2.3

Rule Pruning: All Rules versus Best Rule .................................................. 53

5.2.4

Rule Set Pruning ......................................................................................... 54

5.2.5

Default Rule ................................................................................................ 55

5.3

Improved AntMiner-C........................................................................................ 56

5.4

Parameter Optimization...................................................................................... 57

5.4.1

Relative Importance of Pheromones and Heuristics ................................... 58

5.4.2

Evaporation Rate ......................................................................................... 60

5.5

Results and Comparisons ................................................................................... 61

5.6

Time Complexity of AntMiner-C ...................................................................... 66

5.6.1

Initialization of Main WHILE Loop ........................................................... 67

5.6.2

Single Iteration of REPEAT UNTIL Loop ................................................. 67

5.6.3

Complexity of a Single Iteration of WHILE Loop ..................................... 68

5.6.4

Computational Complexity of Entire Algorithm ........................................ 69

5.7 6

Analysis of Different Algorithmic Components ................................................ 51

Summary ............................................................................................................ 69

Chapter 6: Further Improvements and Investigations ............................................... 70 6.1

Modified Algorithm: AntMiner-CC ................................................................... 70

6.2

Differences with previous versions .................................................................... 71

6.3

Heuristic Function of AntMiner-CC .................................................................. 72

6.3.1 6.4

Heuristic Function for the 1st Term ............................................................. 75

Default Rule ....................................................................................................... 77

vii

viii

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

6.5

7

Experiments and Analysis .................................................................................. 77

6.5.1

Datasets, Performance Metrics and Parameter Settings ............................. 78

6.5.2

Importance of Heuristic Function ............................................................... 81

6.5.3

Rules for Majority Class Also..................................................................... 81

6.5.4

Termination of Rule Construction .............................................................. 82

6.5.5

Symmetric or Asymmetric Pheromone Matrix ........................................... 83

6.5.6

Rule Pruning: All Rules, No Rules or Best Rule? ...................................... 84

6.5.7

Average Probabilities Calls of AntMiner and AntMiner-CC ..................... 85

6.6

Comparison with Other Algorithms ................................................................... 86

6.7

Summary ............................................................................................................ 88

Chapter 7: Associative Classification Using Ant Colony Optimization ................... 89 7.1

Associative Rules Mining and Associative Classification ................................. 89

7.2

Differences with AntMiner-C and AntMiner-CC .............................................. 94

7.3

Proposed Technique ........................................................................................... 94

7.3.1

General Description .................................................................................... 94

7.3.2

Rule Construction ....................................................................................... 97

7.3.3

Pheromone Initialization ............................................................................. 97

7.3.4

Selection of an Item .................................................................................... 98

7.3.5

Heuristic Function ....................................................................................... 98

7.3.6

Heuristic Function for the 1st Item .............................................................. 99

7.3.7

Rule Construction Stoppage ..................................................................... 100

7.3.8

Quality of a Rule ....................................................................................... 100

7.3.9

Pheromone Update .................................................................................... 100

7.3.10

Rule Selection Process .............................................................................. 101

7.3.11

Discovered Rule Set .................................................................................. 102

Table of Contents 7.3.12

Pruning Discovered Rule List ................................................................... 102

7.3.13

Use of Discovered Rule set for Classifying New unseen Cases ............... 102

7.4

Experiments and Analysis ................................................................................ 103

7.5

Time Complexity of ACO-AC ......................................................................... 109

7.5.1

Computational Complexity of a Single Iteration of Main FOR Loop ...... 109

7.5.2

Computational Complexity of Single Iteration of Main WHILE Loop .... 110

7.5.3

Computational Complexity of Entire Algorithm ...................................... 111

7.6 8

9

Summary .......................................................................................................... 111

Chapter 8: Feature Selection Based on Ant Colony optimization .......................... 113 8.1

Introduction ...................................................................................................... 114

8.2

Decision Trees .................................................................................................. 116

8.3

Proposed Technique ......................................................................................... 118

8.3.1

General Description .................................................................................. 118

8.3.2

Search Space for ACO in Proposed Algorithm ........................................ 119

8.3.3

Initialization of Pheromone values ........................................................... 120

8.3.4

Generation of a Candidate Solution of Subsets ........................................ 120

8.3.5

Selection of an Attribute ........................................................................... 121

8.3.6

Heuristic Function ..................................................................................... 122

8.3.7

Fitness Function ........................................................................................ 122

8.3.8

Pheromone Updating ................................................................................ 123

8.3.9

Proposed Algorithm for FSS Using ACO ................................................. 123

8.4

Experimentation and Analysis ......................................................................... 126

8.5

Summary .......................................................................................................... 130

Chapter 9: Conclusions & Future Work ................................................................. 132 9.1

Conclusion........................................................................................................ 132

ix

x

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

9.2

Future Work ..................................................................................................... 132

9.2.1

AntMiner-C ............................................................................................... 132

9.2.2

ACO-AC ................................................................................................... 133

9.2.3

ACO-FSS .................................................................................................. 134

References ....................................................................................................................... 135 

List of Figures

3 List of Figures Figure 1-1 Steps of data mining for knowledge discovery ................................................. 2  Figure 3-1 Environment of ants in the form of a graph .................................................... 15  Figure 4-1 An example problem’s search space represented as a graph .......................... 28  Figure 4-2 Proposed AntMiner-C algorithm..................................................................... 29  Figure 4-3 Selection of a term for adding in a rule’s antecedent part............................... 33  Figure 4-4 Asymmetric pheromone matrix for the example problem of Figure 4-1 ........ 37  Figure 5-1 Average accuracies over all datasets for different classification techniques .. 47  Figure 5-2 Improved AntMiner-C algorithm .................................................................... 57  Figure 5-3 Average accuracies over all datasets for different classification techniques .. 65  Figure 5-4 Sample rule list of tic-tac-toe dataset discovered by improved AntMiner-C.. 66  Figure 6-1 The AntMiner–CC algorithm .......................................................................... 71  Figure 6-2 An example for understanding the working of heuristic function .................. 73  Figure 6-3 The asymmetric heuristic look-up table for the example problem of Figure 6-1 ........................................................................................................................................... 76  Figure 7-1 Proposed ACO-AC algorithm ......................................................................... 93  Figure 7-2 Flow chart of proposed ACO-AC algorithm................................................... 96  Figure 8-1 ID3 Algorithm ............................................................................................... 117  Figure 8-2 ACO Algorithm............................................................................................ 118  Figure 8-3 NxN search space for ant traversal................................................................ 119  Figure 8-4 Selection of subset of features by an ant ....................................................... 121  Figure 8-5 Information gain calculation algorithm ......................................................... 122  Figure 8-6 Proposed feature subset selection algorithm ................................................. 124  Figure 8-7 Flow chart of proposed FSS approach .......................................................... 125 

xi

xii

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

4 List of Tables Table 4.1 Comparison of different versions of AntMiners ............................................... 24  Table 5.1 Characteristics of datasets ................................................................................. 44  Table 5.2 Parameters used in experiment ......................................................................... 46  Table 5.3 Average predictive accuracies obtained using 10-fold cross validation ........... 46  Table 5.4 Average number of rules per discovered rule set and average number of terms per rule the results are obtained using 10-fold cross validation........................................ 47  Table 5.5 Average number of ant runs per iteration ......................................................... 51  Table 5.6 Results obtained by first constructing rule antecedent and then choosing class label ................................................................................................................................... 52  Table 5.7 Results obtained by imposing restriction that a constructed rule must cover a minimum of 10 samples .................................................................................................... 53  Table 5.8 Result of pruning each constructed rule and pruning only the best rule........... 54  Table 5.9 Comparison of results with and without pruning of redundant rules from the rule set ............................................................................................................................... 55  Table 5.10 Results obtained when the default rule is composed of the majority class labels of remaining samples and compared with remaining training samples are of same class label .......................................................................................................................... 56  Table 5.11 Predictive accuracy results obtained for different values of alpha and beta. The default rule on the basis of uncovered training samples of same class labels ........... 58  Table 5.12 Average number of rules obtained for different values of alpha and beta. The default rule on the basis of uncovered training samples of same class labels .................. 59  Table 5.13 Average number of terms per rule obtained for different values of alpha and beta. The default rule on the basis of uncovered training samples of same class labels .. 59  Table 5.14 Predictive accuracy results obtained by using different values of evaporation rate with alpha = 1 and beta = 3 ........................................................................................ 60  Table 5.15 Average number of rules and average terms per rule obtained by using different values of evaporation rate with alpha = 1 and beta = 3...................................... 60  Table 5.16

Datasets used in the experiment. The datasets are sorted on the basis of

attributes, samples and classes .......................................................................................... 61 

List of Tables Table 5.17 Parameters used in experiments ...................................................................... 62  Table 5.18 Average predictive accuracies obtained using 10-fold cross validation ......... 63  Table 5.19 Average number of rules per discovered rule set and average number of terms per rule and the results are obtained using 10-fold cross validation ................................. 63  Table 6.1 Extended suite of datasets used in the experiment ........................................... 78  Table 6.2 One fold accuracy results for different combinations of ρ, α and β ................. 80  Table 6.3 Average predictive accuracies obtained using 10-fold cross validation ........... 81  Table 6.4 Comparison of Accuracy Results after ten-fold cross validation ..................... 82  Table 6.5 Results obtained with the condition of a constructed rule covering a minimum of 10 samples .................................................................................................................... 83  Table 6.6 Results of pheromone updating on one link between two chosen consecutive terms (asymmetric) and on both the links (symmetric) .................................................... 84  Table 6.7 Results of pruning each constructed rule and no pruning at all and pruning only the best rule one fold only ................................................................................................. 85  Table 6.8 Average number ants and average number of probability calls of single iteration of repeat until loop with standard deviations of one fold ................................... 85  Table 6.9 Parameters used in experiments ........................................................................ 86  Table 6.10 Average predictive accuracies obtained using 10-fold cross validation ......... 87  Table 6.11 Average predictive accuracies obtained using 10-fold cross validation ......... 87  Table 7.1 Datasets used in the experiment. The datasets are sorted on the basis of attributes, samples and classes ........................................................................................ 104  Table 7.2 Parameters used in experiments ...................................................................... 105  Table 7.3 Average predictive accuracies with standard deviations obtained after 10-fold cross validation ............................................................................................................... 105  Table 7.4 Average number of rules per discovered rule set and average number of terms per rule. The results are obtained using 10-fold cross validation .................................. 106  Table 7.5 Average number of associative rules discovered without applying redundant rules pruning procedure for ten-fold cross validation with support = 1% and confidence = 50% ................................................................................................................................. 107 

xiii

xiv

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

Table 7.6 Average accuracy, number of rules, number of terms/rule with different coverage threshold obtained with ten-fold cross validation with support = 1% and confidence = 50% ........................................................................................................... 108  Table 7.7 Average accuracy, number of rules, number of terms/rule with different values of support threshold for TAE dataset after ten-fold cross validation with confidence = 50% & coverage = 98% .................................................................................................. 109  Table 8.1 Parameters used in our experiments ............................................................... 126  Table 8.2 Datasets used in experiments .......................................................................... 127  Table 8.3 Average accuracies, number of rules, number of terms/rule without and after feature subset selection. .................................................................................................. 128  Table 8.4 Number of reduced features after feature subset selection ............................. 129  Table 8.5 Comparison of predictive accuracies after selection of feature subsets by ACO and naïve Bayes .............................................................................................................. 130 

5

Abstract

6 Abstract The primary goal of this research is to investigate the suitability of ant colony optimization, a swarm intelligence based meta-heuristic developed by mimicking some aspects of the food foraging behavior of ants, for building accurate and comprehensible classifiers which can be learned in reasonable time even for large datasets. Towards this end, a novel classification rule discovery algorithm called AntMiner-C and its variants are proposed. Various aspects and parameters of the proposed algorithms are investigated by experimentation on a number of benchmark datasets. Experimental results indicate that the proposed approach builds more accurate models when compared with commonly used classification algorithms. It is also computationally less expensive than previously available ant colony algorithm based classification rules discovery algorithms. A hybrid classifier using ant colony optimization is also proposed that combines association rules mining and supervised classification. Experiments show that the proposed algorithm has the ability to discover high quality rules. Furthermore, it has the advantage that association rules of each class can be mined in parallel if distributed processing is used. Experimental results demonstrate that the proposed hybrid classifier achieves higher accuracy rates when compared with other commonly used classification algorithms. A feature subset selection algorithm is also proposed which is based on ant colony optimization and decision trees. Experiments show that better accuracy is achieved if the subset of features selected by the proposed approach is used instead of full feature set and number of rules is also decreased substantially.

xv

xvi

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

7 List of Publications Following is the list of research publications produced as the result of research carried out for this PhD thesis.

Accepted 1. W. Shahzad, and A.R. Baig, “Compatibility as a heuristic for construction of rules by artificial ants,” Journal of Circuits, Systems, and Computers, Vol.19, No.1, pp. 297-306, Feb 2010 (ISI – SCI - Impact Factor 2009 0.264), (related to Chapter 4 of this thesis). 2. W. Shahzad, and A.R. Baig, “A hybrid associative classification algorithm using ant colony optimization,” International Journal of Innovative Computing, Information and Control (IJICIC), 2010, (Accepted for publication), (ISI – SCI Impact Factor 2009 2.932), (related to Chapter 7).

Submitted 3. A.R. Baig, and W. Shahzad, “A correlation based AntMiner for classification rule discovery,” Neural Computing and Applications, 2010 (2nd round of review), (ISI – SCI - Impact Factor 2009 0.812), (related to Chapters 4, 5, 6). 4. W. Shahzad, A.R. Baig, and S.W. Asghar, “Improving classifier performance by feature subset selection based on ant colony optimization,” International Journal of Innovative Computing, Information and Control (IJICIC), 2010 (2nd round of review), (ISI – SCI - Impact Factor 2009 2.932), (related to Chapter 8). 5. W. Shahzad, A.R. Baig, S. Khan, and F. Altaf “ACO based discovery of comprehensible and accurate rules from medical datasets,” International Journal

List of Publications of Innovative Computing, Information and Control (IJICIC), 2010 (1st round of review), (ISI – SCI - Impact Factor 2009 2.932), (related to Chapter 6). 6. W. Shahzad, and A.R. Baig, “Association rule discovery using ant colony optimization for credit risk evaluation,” International Journal of Data Mining and Knowledge Discovery, 2010 (1st round of review), (ISI – SCI - Impact Factor 2009 2.95), (related to Chapter 7). 7. W. Shahzad, A.R. Baig, “Discovery of classification rules by an improved Ant Miner,” International Journal of Data Mining and Knowledge Discovery, 2010 (1st round of reviews), (ISI – SCI - Impact Factor 2.95), (related to Chapter 6).

Other Publications by the Author 1. W. Shahzad, F.A. Khan, and A.B. Siddiqui, “A weighted clustering algorithm using comprehensive learning particle swarm optimization in mobile adhoc networks,” International Journal of Future Generation Communication and Networking, Vol. 3, No. 1, March, 2010, pp 61-70. 2. S. Ali, F.A. Khan, and W. Shahzad, "Intrusion detection using bi-classifier-based genetic algorithm," ICIC Express Letters, 2010, (Accepted for publication). 3. W. Shahzad, F.A. Khan, and A.B. Siddiqui, “Clustering in mobile adhoc networks using comprehensive learning particle swarm optimization (CLPSO),” In Proceedings of International Conference, FGCN/CAN 2009, pp 342-349. 4. W. Shahzad, A.B. Siddiqui, F.A. Khan “Cryptanalysis of four-rounded DES using binary particle swarm optimization,” in Proceedings of Genetic and Evolutionary Computation Conference, 2009, pp 2161-2166. 5. S. Khan, W. Shahzad, F.A. Khan “Cryptanalysis of four-rounded DES using ant colony optimization,” in International Conference on Information Science and Applications (ICISA), pp. 1-7, 2010.

xvii

xviii

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

6. W. Shahzad, F.A. Khan, and H. Ali, “Clustering in mobile adhoc networks using swarm intelligence,” Book Chapter in “Theory and Applications of Ad Hoc Networks”, 2010, (Accepted for publication). 7. S. Ali, W. Shahzad, F.A. Khan, “Remote-to-local intrusion attacks detection using an incremental genetic algorithm,” International Conference for Internet Technology

and

Secured

Transactions

(ICITST),

2010,

(Accepted

for

publication). 8. W. Shahzad, H. Ali, and F.A. Khan, “Multi-objective particle swarm optimization for clustering in wireless sensor networks,” Journal of Applied Soft Computing, (1st round of review), (ISI – SCI - Impact Factor 2009 2.42). 9. S. Khan, W. Shahzad, F.A. Khan, "Ant-Crypto: an attack for the cryptanalysis of data encryption standard (DES),” ICIC Express Letters, 2010, (1st round of review).

1 Chapter 1: Introduction

1.1

Research Background

Data is collected and available in every sphere of life. Processing and analysis of the generated data usually provides useful insights and knowledge about the system which has produced that data. The field of data mining deals with conversion of raw data into useful information. Data mining is a collection of techniques used for extracting or mining of previously unknown, useful and understandable patterns from large databases. Data mining integrates techniques from multiple disciplines such as database technology, machine learning, statistics, pattern recognition, neural networks, and image processing and data visualization. There is always a requirement for efficient and scalable data mining algorithms and it is a subject of ongoing research [1]. Figure 1-1 shows the process of data mining for extracting information from data. The first step is to extract data from the database and then perform preprocessing steps on it. Data mining techniques are used to extract data patterns. Evaluation and presentation means to represent the knowledge in a way which is understandable to users. The result is empowerment off users with knowledge. There are different data mining techniques including supervised classification, association rules mining or market basket analysis, unsupervised clustering, web data mining, and regression. One important technique of data mining is classification. The objective of classification is to build one or more models based on the training data, which can correctly predict the class of test objects. There are several problems from a large scale of domains which can be cast as classification problems [1]. Classification has several important applications in our lives [2-5]. Examples include customer behavior prediction, portfolio risk management, identifying suspects, medical applications, sports, fraud detection, and

2

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

biometric detection. This thesis deals mainly with the classification technique of data mining.

Figure 1-1 Steps of data mining for knowledge discovery Swarm intelligence [6-11], which deals with the collective behavior of small and simple entities, has been used in many application domains. It is an intelligent, innovational, and distributed, paradigm for solving optimization problems. It may seem that data mining and swarm intelligence do not have a lot in common. However, recent research studies suggest that both can be used together for wide range of real world data mining problems including classification, clustering, regression, and image processing. It is especially suitable for those cases when other methods would be difficult to implement or too expensive [12]. To develop more accurate models of swarm intelligence in the field of data mining that perform better than those that are already known is an ongoing research area.

Chapter 1: Introduction Ant colony optimization (ACO) is a famous technique under the umbrella of swarm intelligence [13]. Ant colonies are distributed entities. Despite of the simplicity of their individuals they show a highly structured collective organization. By utilizing this organization, ant colonies can achieve complex tasks, which rise above the individual capabilities of a single ant. Examples are cooperative transport, foraging, and division of labor. In all these examples, ants coordinate their actions via stigmergy. It is a kind of indirect communication between ants using modifications of the environment. For example, a foraging ant drops a chemical on the ground that increases the probability that other ants will follow the same path. ACO is encouraged by the foraging behavior of ant colonies, and well suited for discrete optimization problems [13]. Since its inception, ACO has been applied to resolve a large numbers of problems. It is naturally suited to discrete optimization problems, such as, quadratic assignment [14], job scheduling [15], subset problems [16], network routing [17], vehicle routing [18], graph coloring problem [19], bioinformatics [20-22] and data mining [23] which is the subject of this thesis.

1.2

Research Contributions

In this thesis a number of contributions have been made in two important areas of data mining; namely classification and feature sub-set selection. A new classification algorithm and its variants based on ACO have been proposed. The proposed algorithms have a higher accuracy rate and are well suited for large dimensional search spaces. A hybrid classification algorithm, combining the idea of association rules mining and supervised classification using ACO, is also proposed. This algorithm finds more accurate and compact rules. The algorithm avoids exhaustive search for discovering all possible rules and has an additional advantage that rules can be mined for each class in parallel in a distributed manner and thus save computation time. Experimental results demonstarte that the proposed hybrid classifier is more accurate as compared to some other classification techniques. Feature subset selection is the process of selecting a subset of relevant features for building learning models. A feature subset selection algorithm is proposed, which incorporates ACO with a decision tree builder for finding more relevant and more

3

4

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

appropriate features from the data. Experimental results show that better accuracy rates are achieved if we use the subset of features selected by our proposed approach instead of the full feature set. The numbers of rules are also substantially decreased. Experimental results also indicate that the proposed approach selects features that increase the predictive accuracy when compared with the naïve Bayes approach.

1.3

Layout of Thesis

This thesis consists of nine chapters. Chapters 2 and 3 provide a background to the research presented in later chapters. Classification systems, their performance metrics, swarm intelligence, ant colony optimization, and previously proposed classification rule discovery algorithms (called AntMiners) are presented in these chapters. Chapter 4 presents the proposed correlation based AntMiner called AntMiner-C for classification rule discovery. Chapter 5 describes the investigations of different components of AntMiner-C. It also gives the experimentation done to find out appropriate values of the proposed algorithm’s user defined parameters. Furthermore, comparison is performed with other state of the art classification techniques. Chapter 6 presents an improved version of the proposed algorithm, called AntMiner-CC. The results are compared with AntMiner-C and other classification algorithms. Chapter 7 describes a novel associative classification algorithm which combines classification and association rules mining. Results of the algorithm are compared with eight other state of the art classification algorithms. This hybrid approach significantly performed better than the compared classification algorithms. Chapter 8 presents a proposed feature subset selection algorithm using ant colony optimization and decision trees. Experiments are performed on a number of benchmark datasets to demonstrate the worth of the proposed approach. Finally, Chapter 9 concludes this thesis and suggests some future directions of research.

Chapter 2: Classification System and Classification Techniques

2 Chapter 2: Classification System and Classification Techniques

2.1 Classification There are two different categories of building data mining learning models, one is supervised learning method and the other is unsupervised. Supervised learning methods use labeled training data, in which class to which a training sample belongs is known during the learning process. This data is used to build the predictive model and unlabeled data is used to test the model. Unsupervised learning methods use unlabelled training data and groups training samples according to their similarities, one example of unsupervised learning method is clustering. Classification is used to determine class membership of data samples. For example, we may use classification to predict whether to play cricket or not on a particular day. As a first step, a model is built that describes the set of classes present in the data [24-32]. The model is constructed by examining data samples described by a set of attributes. There are different types of attributes. The two main types are called numerical and categorical. Each sample belongs to a predetermined class. Since the class of training samples is known beforehand, this learning is called supervised learning. The actual classification is done on the basis of the learnt classification model and it comprises of assigning a class label to test samples. A fundamental aim of research in the field of classification is to develop algorithms which learn highly accurate models from the available data. Other objectives include efficiency when dealing with large datasets and comprehensibility of the learnt model [33-34].

5

6

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

2.2 Performance Evaluation of Classification Methods Classification methods are usually compared on the basis of following criteria [35]:

2.2.1 Predictive Accuracy This is the ability of the classification model to correctly classify unseen data. After a classification model has built with the help of training data its accuracy is measured on test samples whose correct class labels are known but not shown to the model. Predictive accuracy is the number of correctly categorized test samples divided by total number of test samples. For example, if we have twenty test samples and the classification model correctly classifies eighteen out of them, then accuracy of the model is 90%.

2.2.2 Robustness This is the ability of the classification model to perform well on noisy or missing values data.

2.2.3 Speed This is the computational cost of generating the model. This cost is measured in terms of running time of the algorithm. The running time is measured in terms of number of steps/operation required by an algorithm and it is independent of the operating system and the machine used.

2.2.4 Scalability This is the ability to construct the model efficiently even for a large amount of high dimensional data. When we increase the size of input the algorithm should be able to construct the classification model as efficiently as for small input size.

2.2.5 Interpretability This is the ease of comprehensibility or understanding of the model by the user.

Chapter 2: Classification System and Classification Techniques

2.3 Types of Classifiers A large number of classification methods are available. They can be divided into two major groups: comprehensible classifiers and statistical (or mathematical classifiers).

2.3.1 Comprehensible Classifiers Comprehensible classifiers are usually rule based classifiers [36-39]. These are easy to understand and interpret and are interesting for the users (or at least the domain experts). They are in contrast to mathematical classifiers which are difficult to understand. The major benefit of these classifiers is that comprehensibility leads to trust of the user on the decisions obtained from them. They enhance the knowledge of users and can be combined with domain knowledge to give better results. Domain knowledge can also be used to correct any inaccuracies or contradictions present in these classifiers. Some of the commonly used rule induction algorithms are described below. 2.3.1.1 C4.5 Decision Tree Decision tree is a tree configuration, in which each internal node represents a test condition on an attribute and leaf nodes represent classes. It is constructed top down with the help of a greedy algorithm. The aim of the algorithm is to construct a tree that best fits the training data. The tree starts with a single node. The algorithm uses information gain as a heuristic for choosing the attribute that best separates the training samples on the basis of their classes. The attribute that has highest information gain is selected. The selected attribute become a test condition or node. A branch is created for every value of the attribute. The same process of the algorithm is used recursively on each branch to form a decision tree. Once an attribute has appeared in a node, it is not considered again in any of the node’s decedents. After completion of the tree, every path from the root to leaf node becomes a rule. The leaf represents the class label of the rule [40-42]. C4.5 is a decision tree construction algorithm that can handle both continuous and discrete attributes. The continuous attributes are handled in C4.5 in the following way. For each continuous attribute it creates a threshold and then splits the training samples into those whose attribute value is above the threshold and those whose value is less than

7

8

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

or equal to it. It can also handle missing values, attributes with different costs and prunes the tree after its construction [43-45]. 2.3.1.2 CN2 CN2 is a sequential covering algorithm that learns one rule at a time [46]. It finds a best rule from the training data and then removes the training samples correctly classified by that rule. This process continues until there are no more training samples left in the training set. It uses beam search to evaluate the worth of each node. Beam search uses best b promising nodes at each depth, where b is the beam width specified by the user. An accuracy measure is used for choosing the best rule. Considering only accuracy as measure can generate more specific rules therefore another estimate, called Laplace error estimate, is used to penalize more specific rules. A threshold is also defined which specifies a minimum number of samples that a rule must cover. 2.3.1.3 Ripper Ripper is another well known rule learner algorithm for supervised data [47]. It is also a sequential learning algorithm. After a rule is found, all samples correctly predicted by the rule are removed from the training dataset. A unique stopping condition is introduced that depends on the description length of the examples and rule set. This is called minimum description length formula. It divides the training data randomly into growing set (2/3 of samples) and pruning set (1/3 of samples). It builds the rules for each class from smallest to largest. The initial sets of rules are generated by using growing set. It repeatedly adds conditions by testing every possible value of each attribute and selecting the value with highest information gain. It then prunes the rules by using the pruning set. It uses incremental reduced error pruning technique for pruning rules.

2.3.2 Statistical or Mathematical Classifiers Non comprehensible classifiers are those classifiers that are not easily understandable by users [48]. These classifiers are black boxes that take input, process it according to the learned model and give output. Most of them use mathematical or statistical models. Some commonly used non-comprehensible are described below:

Chapter 2: Classification System and Classification Techniques 2.3.2.1 K-Nearest Neighbor In k-nearest neighbor, a test sample is compared with existing ones by using a distance metric and the majority class of the closest k neighbors is assigned to the test case. If the domain of an attribute is numeric then the distance between two examples can be computed by using any distance metric, e.g. Euclidean and Manhattan. If the attribute is nominal then usually the similarity is set to zero if two values are identical otherwise it is set to one [49-54]. The performance of KNN is dependent on several factors. One of these is the choice of k. If the value of k is too small then the output can be influenced by noisy samples present in the data. If the value of k is set too large then many neighbors may be from other side of the decision boundary and belong to other classes. Another important issue is the assignment of class label. One simple method is to assign the majority class of neighbors. However, this can create a problem if neighbors vary widely in their distance and closer neighbors get out voted by distant ones. One solution of this problem is to assigns a weight to each neighbor according to its distance from the test sample. Weights can be assigned to training samples themselves on the basis of their reliability. In this way influence of suspicious data can be minimized. 2.3.2.2 SVM (Support Vector Machine) Support vector machine (SVM) is a technique in which a small number of boundary samples called support vectors are selected from each class and a linear function is built that separates them as widely as possible [55-56]. These support vectors are global representatives of the whole set of training samples. An optimization algorithm is used for training a support vector classifier that forms hyper planes between data samples of different classes. In other words, it implements non-linear models by using linear models. 2.3.2.3 Naïve Bayes Naïve Bayes is an approach that learns from probabilistic knowledge [57-58]. It is a statistical classification technique based on Bayes theorem. It has been used to generate impressive results and is easy to program and fast to train. It calculates the probabilities of a given sample belonging to different classes. It assumes class conditional independence, that is, it assumes that attributes are independent of one another and the

9

10

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

influence of one attribute on determining the class is independent of any other attribute. Naïve Bayes is a special kind of Bayesian network that has been commonly used for data classification. Its predictive performance is comparable with other commonly used classifiers such as CN2 and C4.5. Bayes classifier learns the class conditional probabilities of each attribute from supervised training data with the help of Bayes theorem. A test sample is then assigned a class that has the highest posterior probability. 2.3.2.4 Regression Regression is a classification technique specifically used in domains with numeric attributes [59]. Regression techniques can be linear and nonlinear. Linear regression performs a regression for each class; the output is set to one for those training samples that belong to that class and zero for others and the result is a linear expression for each class. In other words, linear regression approximates a numeric membership function for each class. The membership function assigns 1 to those samples that belong to that class and 0 for other samples. For calculating the class of a test sample, the values of all linear expressions are calculated (its membership is calculated for each class) and the one whose value is largest is chosen (the class with highest membership is assigned). 2.3.2.5 Ada-Boost Ada-Boost is a technique that combines decisions of different classifiers into a single one [60]. It can be used to combine the decisions of same types of classifiers as well as of heterogeneous classifiers. In this approach, a positive number is assigned as a weight to each training sample. The weights are not fixed and can change on the basis of error of the classifier. In the beginning equal weights are assigned to all training samples. Learning algorithm forms a classifier by using this data and weights are updated according to the output of the classifier. The weights of misclassified samples are increased and the weights of correctly classified samples are decreased. This strategy tries to give more importance to those samples which are hard to classify. Next another classifier is built by using re-weighted data. This process is repeated and the final classifier is formed by combing all the learnt classifiers.

Chapter 2: Classification System and Classification Techniques Voting or averaging is used to combine the output of different classifiers into a single output.

2.4 Summary This chapter provides a background for things to follow. It gives an overview of classification systems, classification algorithms, and different performance evaluation metrics used to measure the performance of classifiers. Next, in Chapter 3 we provide some basic information about swarm intelligence, ant colony optimization, and working of classification rule discovery algorithms based on ant colony optimization (called AntMiners).

11

12

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

3 Chapter 3: Swarm Intelligence and Ant Miners

3.1

Swarm intelligence

Swarm intelligence is the name given to a suite of evolutionary [61], population based computational techniques influence by the natural world and it has been applied for solving hard problems in artificial intelligence [62-64]. These evolutionary techniques have attracted the attention of researchers due to their inherent parallelism, reliability in finding solutions and convergence capability. There are different examples of swarming behavior regarding social insects such as ants, bees, termites and wasps. An insect nest may contain hundreds, thousands, or even millions of individual insects. The individual agent is not a very intelligent organism and it responds to the environment prompted by local stimuli in a very simple way. However, collectively these agents accomplish formidable tasks. For example, a group of ants can efficiently gather food and temperature can be regulated by the bees in a hive. Natural ant colonies are distributed systems that present a highly structured social formation although behavior of an individual is very simple. This organization allows the ants to achieve complex tasks, which exceed the individual capabilities of a single ant. Activities of ants are coordinated via stigmergy, which is a form of indirect communication between ants that leads to modifications of the environment. For example, a chemical is deposited on the ground by ants that increase the probability of selection of that path by other ants [8]. Behavior of ant colonies has several different aspects and this has inspired different kinds of ant algorithms. Examples are brood sorting, food foraging, division of labor, and

Chapter 3: Swarm Intelligence and Ant Miners cooperative transport. These models have been useful in solving complex optimization problems [9].

3.2

Ant Colony Optimization

Ant Colony Optimization (ACO) is a branch of swarm intelligence (and, in general, a genre of population based heuristic algorithms) inspired by the behavior of natural ants. First ACO algorithm was proposed by Marco Dorigo in 1992 [63]. ACO algorithm has been developed by modeling some aspects of the food foraging behavior of ants. Ants pass on the information about the trail they are following by spreading a chemical substance called pheromone in the environment. Other ants that arrive in the vicinity are more likely to take the path with higher concentration of pheromone than the paths with lower concentrations. Hence, the desirability of possible paths is proportional to their pheromone concentrations. The pheromone evaporates with time and becomes insignificant unless new pheromone is added. This indirect form of information passing via the environment helps the ants to search the shortest path to the food source. If two paths between a food source and the ant nest are initially discovered by some ants, then the longer of the two paths will soon become unattractive to subsequent ants because the ants following it will take longer to go to the food source and return back and hence the pheromone concentration on that path will not increase as rapidly as on the shorter path. If an established path is blocked, some ants will first go to the left and some to the right with equal probability. However, an ant taking the shorter of the two paths will return earlier than the ant taking the longer path and hence the pheromone on the shorter path will be doubled and subsequent ants will have a higher probability of taking the shorter path. Soon a new shorter path which bypasses the blockage will be established. The more ants pursue a given route, the more attractive that route becomes and is followed by other ants. This phenomenon has been modeled in the ACO algorithm. An artificial ant constructs a solution to the problem by adding solution components one by one. When a solution is constructed its quality (i.e. fitness) is determined and the components of the solution are assigned pheromone concentrations proportional to the quality. Subsequently other ants construct their solutions one by one and they are guided by the pheromone concentrations

13

14

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

in their search for components to be added in their solutions. The components with higher pheromone concentrations are thus identified as contributing to a good solution and repeatedly appear in the solutions. It is expected that after a while the artificial ants converge on a good, if not the optimal, solution. These simple agents cooperate with one another to find high quality solutions for problems with large search spaces. Since its inception, ACO has been applied to solve many problems. It is naturally suited to discrete optimization problems, such as, quadratic assignment, job scheduling, subset problems, network routing, vehicle routing, load dispatch in power systems, bioinformatics, and, of course, data mining. For the employment of ACO to a problem we need the following: •

The complete solution can be represented as a combination of different components.



There should be a method to evaluate the fitness or quality of the solution.

• It is desirable, though not necessary, that there is a heuristic measure for the each component of the solution.

Chapter 3: Swarm Intelligence and Ant Miners

Figure 3-1 Environment of ants in the form of a graph

Figure 3-1 shows an example environment created for searches by ants. It is the representation of a simple problem of searching of shortest path from source to destination. The graph is represented as a G = (V, E) where |V| denotes the vertices or nodes and |E| indicates edges in graph. The length of a path is given by the number of nodes in the path or alternatively by cost values associated with each edge comprising the path. Typically each edge of the graph connecting two nodes Vi and Vj has two associated values called pheromone value and heuristic value. In this example problem heuristic value can be the cost associated with an edge. The pheromone values are changed by the ants when they visit the nodes. When ants go from a node to another node, they have to decide which node to move to next. For this purpose, these ants use the pheromone and heuristic values associated with each edge of the graph and calculate the probability of going to a

15

16

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

particular node. If an ant is on node i, the probability of selecting node j is calculated according to Equation (3.1). p i, j =

[τ i , j ]α ⋅ [η i , j ] β



v∈S

[τ i ,v ]α ⋅ [η i ,v ] β

(3.1)

Where τi,j is the pheromone value associated with the link between node i and j, ηi,j is the heuristic value between node i and j, S is the set of those nodes which are connected with node i and which have not been visited yet by the ant. The parameters α and β indicates the relative importance of pheromone value and heuristic value, respectively.

3.3

ACO Based Classification Rule Discovery: AntMiner Algorithms

There are different versions of Ant Miner algorithm proposed by different authors. These are discussed below.

3.3.1 AntMiner A few authors have applied ACO for discovery of classification rules. The first ACO based algorithm for classification rule discovery, called, Ant Miner, was proposed by Parpinelli, et al. [65]. An ant constructs a rule. It starts with an empty rule and incrementally constructs it by adding one term at a time. The selection of a term to be added is probabilistic and based on two factors: a heuristic quality of the term and the amount of pheromone deposited on it by the previous ants. The authors use information gain as the heuristic value of a term. The rule construction continues until one of the two situations occurs. One situation is that there is no term left whose addition would not cause the rule to cover a number of cases smaller than a threshold specified by the user called Min_cases_per_rule (miniumum number of cases covered by the rule). The second situation is that there are no more attributes that could be inserted in the rule because all attributes have already been utilized by the ant. When one of these two stopping conditions is met then an ant’s tour is considered complete (the rule’s antecedent part is complete). The consequent of the rule is assigned by taking a majority vote from the training samples covered by the rule. The constructed rule is then pruned to remove irrelevant terms and to improve its accuracy. The quality of the constructed rule is determined and pheromone values are

Chapter 3: Swarm Intelligence and Ant Miners updated on the trail take place by the ant relative to the quality of rule. After this a new ant starts with updated pheromone values to guide its search. When all ants have constructed their rules, the best rule among them is selected and added to a discovered rule list. The training samples correctly classified by that rule are deleted from the training set. This process continues until the number of uncovered samples is less than a threshold specified by the user. The final product is an ordered discovered rule list that is used to classify the test data. 3.3.1.1 Heuristic function The original Ant Miner uses a heuristic function based on information theory. It is based on the entropy measure of a term which is defined as: k

H (W | Ai = Vij ) = −

∑ (P(w | A = V ).log i

ij

2

P( w | Ai = Vij

(3.2)

w =1

Where •

w is the class attribute.



k is the number of classes in the domain of class attribute.



Ai is the i-th attribute.



Vij is the jth value of the domain of the attribute Ai.



P(w | Ai = Vij) is the empirical probability of monitoring class w conditional on having observe Ai = Vij.

If the entropy of a term is higher, the classes are more uniformly distributed and hence the predictive ability of term is lower. The heuristic function is:

η ij =

log 2 k − H (W | Ai = Vij ) bi

a

∑ x .∑ (log i =1

i

j =1

2

(3.3)

k − H (W | Ai = Vij )

Where: •

a is the total number of attributes in data set.



xi is set to 1 if the attribute Ai was not yet selected by the current ant.



bi is the number of possible values in the domain of ith attribute

17

18

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

This heuristic function is a local heuristic function, as it considers an individual term and does not consider correlation among attributes. This heuristic is similar to the one used in decision tree algorithms such as C4.5. 3.3.1.2 Fitness Function The quality of a rule Q in Ant Miner is determined by: Q=

TP TN . TP + FN FP + TN

(3.4)

Where: •

TP is the number of true positives, i.e. the number of cases covered by the rule that have the class predicted by the rule.



FP is the number of false positives, i.e. the number of cases covered by the rule that have class different from the class predicted by the rule.



FN is the number of false negatives, i.e. the number of cases that are not covered by the rule that have the class predicted by the rule.



TN is the number of true negatives, i.e. the number of cases that are not covered by the rule and that do not have the class predicted by the rule.

3.3.1.3 Transition Rule The transition rule used by the Ant Miner is given by: P ij =

ηij .τ ij

∑ x .∑ (η .τ i

i =1

(3.5)

bi

a

ij

ij (t ))

j =1

Where: •

Pij is the probability of a term that is candidate for selection in the current partial rule.



ηij is the value of heuristic function.



τij(t) is the quantity of pheromone associated with a term.



a is the number of attributes.



bi is the number of possible values in the domain of ith attribute.

Chapter 3: Swarm Intelligence and Ant Miners 3.3.1.4 Pheromone Update When an ant completes its tour, the pheromone values of all terms are updated. The pheromones of those terms that occur in the rule are increased in proportion to the quality of the rule: τ ij (t + 1) = τ ij (t ) + τ ij (t ).Q, ∀i, j ∈ R

(3.6)

The pheromone values of all terms are then normalized, dividing each pheromone value by the summation of all pheromone values. In this way, the pheromone values of those terms that do not occur in the rule are decreased. 3.3.1.5 Rule Pruning Rule pruning is used to remove irrelevant terms in order to improve a rule’s accuracy. The rule pruning procedure is applied on every rule constructed by an ant. The basic idea is to repetitively remove one term at a time and see if it improves the accuracy. This process continues until there are no more terms whose removal can increase the accuracy. The authors compare their Ant Miner algorithm with two well known rule induction algorithms named C4.5 and CN2 on different public domain datasets. They show that Ant Miner is competitive with C4.5 and CN2 in achieving predictive accuracy. Furthermore, Ant Miner discovered rule list is simpler than the rules discovered by the two compared algorithms.

3.3.2 AntMiner 2 The extensions of the Ant Miner algorithm are proposed by Liu et al. in two of their works, Ant Miner2 [66] and Ant Miner3 [67-68]. Ant Miner2 has the same algorithm as original Ant Miner. The difference is that in AntMiner2 density estimation is used as a heuristic function instead of information gain. The authors show that this simpler heuristic does the same job as well as the complex one and hence Ant Miner2 is computationally less expensive than the original Ant Miner but has comparable performance.

3.3.3 AntMiner 3 In Ant Miner3 the authors use a different pheromone update method [67-68]. They update and evaporate the pheromone of only those terms that occur in the rule and do not

19

20

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

evaporate the pheromones of unused terms. In this way exploration is encouraged. They also use a different state transition rule, namely, pseudo random proportional transition rule. This transition rule encourages exploitation and concentrates on the best solution instead of encouraging exploration constantly.

3.3.4 AntMiner+ David Martens et al. [69] propose a Max-Min Ant System based algorithm (AntMiner+) that differs from the previously proposed AntMiners in several aspects. Only the iteration-best ant is allowed to update the pheromone, the range of the pheromone trail is limited within an interval, class label of a rule is chosen prior to the construction of the rule and a different rule quality measure is used. The search space of AntMiner+ is different than that of previous AntMiner versions. An ant starts from the start vertex and walks through the environment to stop vertex. During the process it must visit each attribute vertex. The above mentioned algorithms are main variants of AntMiner algorithms.

3.3.5 Other Ant Miner Versions Other works on AntMiner include [70] in which an algorithm for discovering unordered rule sets has been presented. When classifying data with unordered rule set, a case might be covered by more than two rules. They use different methods of assigning class label to a test case. If only one of the discovered rules covers the test case then the test case is assigned the class label of the rule. If more than one of the discovered rules covers the test case then the test case is assigned the class label of the rule with highest rule quality. The authors show that AntMiner performs well with unordered approach. In [71] a hybrid ACO/PSO algorithm is proposed. PSO algorithm is used for continuous valued attributes and ACO for nominal valued ones. These two algorithms are jointly used to construct rules. For the continuous attributes the standard PSO algorithm with constriction is used [72]. A two dimensional array is used for each continuous attribute, one for lower attributes and other for upper bound. At each evaluation, the array is converted into a set of rule conditions and added to the rule list produced by PSO/ACO algorithm for fitness evaluation.

Chapter 3: Swarm Intelligence and Ant Miners The issue of continuous attributes has also been dealt in [73]. The authors used entropy based discretization method for dealing with continuous attributes during the rule discovery process. This creates discrete intervals for the continuous attribute. An AntMiner version for multi-label classification problem can be found in [74]. In multi-label classification there are two or more than two class attributes to be predicted. The consequent of the rule involves more than one class label. Each ant constructs a set of rules, where different rules predict different class attributes. The algorithm uses a pheromone matrix for each class attribute. ACO has also been applied for discovering fuzzy classification rules [75]. The algorithm discovers fuzzy rules rather than crisp rules. Multiple populations of ants are maintained and each population discovers rules of one class. Each ant constructs a fuzzy rule by adding one fuzzy condition to its current partial rule. ACO has also been applied for web page classification problem for classifying web pages [76].

3.4

Summary

This chapter discusses the foundation of the problem where problem comes from. It describes the swarm intelligence which is a branch of evolutionary computation that is based on collective behaviors of real bees, ants etc. ant colony optimization meta heuristic under the umbrella of swarm intelligence has been discussed in detail. It describes how ant based algorithms work and what we need to define for mapping a problem in ant colony optimization. Different versions of previously proposed Ant Miners called Ant Miner, Ant Miner2, Ant Miner3, Ant Miner+ and other Ant Miner versions and working of each version and how they differ with each other. The motivation of research comes from after studying these different versions of Ant Miners. Next chapter 4 will describe the proposed AntMiner-C algorithm for the classification technique of data mining.

21

22

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

4 Chapter 4: Correlation Based Ant Miner Classification is a data mining technique studied by statisticians and machine leaning researchers for a very long period of time. It is a process in which classes are predefined and a classifier is built (learnt) which allocates data samples to classes. Many classification algorithms already exist, such as decision trees, neural networks, k-nearest neighbor classifiers and support vector machines. Some of them (e.g. neural networks and support vector machines) are incomprehensible and opaque to humans, while others are comprehensible (e.g. decision tress). In many applications classifiers are required to be comprehensible as well as accurate. In this chapter we propose a novel ACO based algorithm for discovering classification rules from supervised data.

4.1 Introduction ACO has been previously applied for discovery of classification rules. The first ACO based algorithm for classification rule discovery, called, AntMiner, was proposed by Parpinelli, et al. [65]. The authors use the information gain as the heuristic value of a term. The extensions of the AnMiner algorithm are proposed by Liu et al. in Ant Miner2 [66] and AntMiner3 [67-68]. In AntMiner2 the authors propose density estimation as a heuristic function instead of information gain used by AntMiner. David Martens et al. [69] propose a Max-Min Ant System based algorithm (AntMiner+). In this chapter, a new ACO based classification algorithm called AntMiner–C for the discovery of rules for the classification technique using supervised training data is presented. The algorithm is a sequential covering algorithm and it progressively discovers rules and builds a rule set. A rule is built by first assigning it a class label and then adding terms in its antecedent part until a stopping criterion is met. Its main feature is a heuristic function based on the correlation among the attributes. This algorithm also takes into consideration the overall discriminatory capability of the term to be added in order to guide the ant to choose the next term. Other highlights include the manner in which class labels are assigned to the rules prior to their discovery, a strategy for dynamically

Chapter 4: Correlation Based Ant Miner stopping the addition of terms in a rule’s antecedent part, and a strategy for pruning redundant rules from the rule set. We study the performance of our proposed approach for twenty six commonly used datasets and compare it with the original AntMiner algorithm, decision tree builder C4.5, Ripper, logistic regression technique and a SVM. Experimental results show that the accuracy rate obtained by AntMiner-C is better than that of the compared algorithms. However, the average number of rules and average terms per rule are higher. The proposed algorithm AntMiner-C has several similarities to previously proposed AntMiners, yet it is also different in many ways. The main differences are presented in Table 4.1.

23

24

Table 4.1 Comparison of different versions of AntMiners Ant Miner Heuristic function

Initial pheromone

Class selection Heuristic values Rule pruning Rule

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

k



ηaij = − (P(w| Ai =Vaij)).log2 P(w| Ai =

Ant Miner2

Ant Miner3

η aij = Majorit _ ClassTaij

ηaij = Majorit_ ClassTij

1

τ a ij ( t = 0 ) =

a



bi

i =1

After rule construction

Taij

Ta ij

w=1

τ a ij ( t = 0 ) =

Ant Miner+

1

τ a ij ( t = 0 ) =

a



bi

i =1

After rule construction

1 a



ηaij =

AntMiner-C

Taij &Class= classant Taij

ηij =

term, i te

termi

τ max

τ ij ( t = 0 ) =

Prior selection but each ant can

Prior selection an

bi

i =1

After rule construction

choose different class label One heuristic value for each term All rules are pruned

One heuristic value for each term All rules are pruned

One heuristic value for each

One heuristic value for each

term

term

All rules are pruned

Prune best rule in each iteration

Multiple heuristi Prune best rule their rules

On the basis of minimum cases per rule

construction

On the basis of minimum cases per

On the basis of minimum

All

rule

cases per rule

considered

attributes

have

been

When all samp

covered by a rule

Symmetric

Symmetric

Symmetric

Asymmetric

termination Pheromone

Symmetric

matrix Rule quality

Pheromone update Default rule Stop training

Q=

TP TN . TP + FN FP + TN

Q=

TP TN . TP + FN FP + TN

Q=

TP TN . TP + FN FP + TN

τaij(t +1) =τaij(t) +τaij(t).Q,∀ai, j ∈R

τaij(t +1) =τaij(t) +τaij(t).Q,∀ai, j ∈R

Majority class of reaming uncovered

Majority class of reaming uncovered

Majority class of reaming

samples

samples

uncovered samples

On the basis of max uncovered cases

On the basis of max uncovered cases

Q=

TP TP + Covered N

1 1+Q

τaij(t+1)=(1−ρ)τaij(t)+(1− )τaij(t) τaij(t +1) =ρτaij(vaij,vaii,k) +

On

the

basis

uncovered cases

of

max

Q= Q 10

TP Covere

τaij(t +1) =(1−ρ

Majority class of training set

Class of uncover

When accuracy decreased on

When all uncov

validation set

class

Chapter 4: Correlation Based Ant Miner

Table 4.1 provides a comparison between different versions of AntMiners including proposed AntMiner-C. Differences with AntMiner, AntMiner2, and AntMiner3 AntMiner-C has •

Prior selection of class label.



New heuristic function.



In other ant miner versions each term has only one value of heuristic function. In AntMiner-C, a term has multiple value of heuristic function by considering every term from it.



Pruning of best rule only.



Rule construction termination on the basis of class homogeneity of samples covered by that rule.



An asymmetric pheromone matrix.



A different pheromone update Equation, (an exception is AntMiner3 which has the same update Equation).



A different method of pheromone normalization.



A different Equation for assessing the quality of rule found and for rule pruning.

Differences with AntMiner+ AntMiner-C has •

A different search space.



No provision of discretization of continuous variables within the algorithm.



A new heuristic function.



State transition (next term selection) and pheromone update according to Ant System.



A different method of class selection which is done only once and subsequently the class remains fixed for all ant runs made for the corresponding antecedent part of the rule.

25

26

Classification and Associative Classification Rule Discovery using Ant Colony Optimization



Does not exclude the majority class from the class selection choices.



Default rule class label assignment according to the majority class of remaining uncovered samples of the training set.



Early stopping of algorithm according to a user defined maximum number of uncovered training samples.



A criterion for early stopping of rule construction.



AntMiner-C has several iterations of REPEAT-UNTIL loop for the extraction of a rule. Each, iteration consists of only one ant run. solution is pruned after exiting from the REPEAT-UNTIL loop, In AntMiner+ the REPEAT-UNTIL loop is executed until the pheromone values on one path converge to τmax and for all other paths are τmin. There are multiple ants per iteration (1000 ants are used in the experiments reported in [41]). The best solution from each, iteration is pruned. In other words several solutions are pruned before exiting from the REPEATUNTIL loop.

In the next section we present the details of our proposed algorithm.

4.2 Correlation Based AntMiner (AntMiner–C) In this section we describe our ACO based classification rules mining algorithm based on the correlation between the data attributes, called AntMiner–C. We begin with a general description of the algorithm and then discuss the heuristic function, rule pruning, early stopping, etc.

4.2.1 General Description The core of the algorithm is the incremental construction of a classification rule of the type by an ant. IF THEN

Each term is an attribute-value pair related by an operator. In our current experiments we use “=” as the only possible relational operator. An example term is “Color = red”. The attribute’s name is “color” and “red” is one of its possible values. Since we use only “=”

Chapter 4: Correlation Based Ant Miner any continuous (real-valued) attributes present in the data have to be discretized in a preprocessing step. For the ants to search for a solution, we need to define the space in which the search is to be conducted. This space is defined by the dataset used. The dimensions (or coordinates) of the search space are the attributes (i.e. variables) of the dataset (including the class attribute). The possible values of an attribute constitute the range of values for the corresponding dimension in the search space. For example, a dimension called ‘Color’ may have three possible values {red, green, blue}. The technique of the ant is to visit a dimension and choose one of its possible values to form an antecedent condition of the rule (e.g. Color = red). The total number of terms for a dataset is is calculated according to Equation (4.1). a

Total _ terms = ∑ bi

(4.1)

i =1

Where a is the total number of attributes (excluding the class attribute) and bi is the number of possible values that can be taken on by an attribute Ai. When a dimension has been visited, it cannot be visited again by an ant, because we do not allow conditions of the type “Color = Red OR Color = Green”. The search space is such that an ant may pick a term from any dimension and there is no ordering in which the dimensions can be visited. An example search space represented as a graph is shown in Figure 4-1. In Figure 4-1 there are four attributes A, B, C and D having 3, 2, 3 and 2 possible values, respectively. An ant starts from the “Start” vertex and constructs a rule by adding conditions (attribute-value pairs or terms) for the antecedent part. After a term has been selected, all the other terms from the same attribute are taboo for the ant. Suppose the ant chooses C1 as its first term. All the other terms in the search space involving C become taboo for the ant. Subsequently the ant chooses A3, D1 and B1. The construction process is stopped when the ant reaches the “Stop” vertex. The consequent part of the rule (class label) is known to the ant prior to the rule construction. It can stop prematurely if the addition of the latest term makes the partial rule to cover only those training samples which have the chosen class label.

27

28

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

Figure 4-1 An example problem’s search space represented as a graph

A general description of the AntMiner–C algorithm is shown in Figure 4-2 and its flowchart is given in Figure 4-3. Its basic structure is that of Ant Miner [65]. It is a sequential covering algorithm. It discovers a rule and the training samples correctly covered by this rule (i.e. samples which satisfy the rule antecedent and have the class predicted by the rule consequent) are removed from the training set. The algorithm discovers another rule using the reduced training set and after its discovery the training set is further reduced by removing the training samples covered by the newly discovered rule. This process continues until the training set is almost empty and a user-defined number is used as a threshold for terminating the algorithm. If the number of remaining uncovered samples in the training set is lower than this number then the algorithm stops. One rule is discovered after one execution of the outer WHILE loop. First of all, a class label is chosen from the set of class labels present in the uncovered samples of training set. Each iteration of the REPEAT-UNTIL loop sends an ant to construct a rule. Each ant starts with an empty list of conditions and constructs the antecedent part of the rule by adding one term at a time. Every ant constructs its rule for the particular class chosen before beginning of the REPEAT-UNTIL loop. The quality of the rule constructed by an ant is determined and pheromones are updated and then another rule is constructed.

Chapter 4: Correlation Based Ant Miner

1

Size(TrainingSet) = {all training samples};

2

DiscoveredRuleList = {};

3

WHILE (Size(TrainingSet) > Max_uncovered_samples)

/* rule list is initialized with an empty list */

4

t = 1;

/* counter for ants */

5

j = 1;

/* counter for rule convergence test */

6

Select class label;

7

Initialize all trails with the same amount of pheromone;

8

Initialize the correlation matrix used by the heuristic function;

9

REPEAT

10

Send an Antt which constructs a classification rule Rt for the selected class;

11

Assess the quality of the rule and update the pheromone of all trails;

12

IF (Rt is equal to Rt-1)

13

THEN j = j + 1;

/* update convergence test */

14

ELSE j = 1;

15

END IF

16

t = t + 1;

17

UNTIL (t ≥ No_of_ants) OR (j ≥ No_rules_converg)

18

Choose the best rule Rbest among all rules Rt constructed by all the ants;

19

Prune the best rule Rbest;

20

Add the pruned best rule Rbest to DiscoveredRuleList;

21

Remove the training samples correctly classified by the pruned best rule Rbest;

22 END WHILE 23 Add a default rule in the DiscoveredRuleList:

24 Prune the Discovered RuleList; (Optional)

Figure 4-2 Proposed AntMiner-C algorithm

29

30

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

Figure 4-3 Flow chart of proposed AntMiner-C

Chapter 4: Correlation Based Ant Miner

Several rules are constructed through an execution of the REPEAT-UNTIL loop. The best one among them, after pruning, is added to the list of discovered rules. The algorithm terminates when the outer WHILE loop exit criterion is met. The output of the algorithm is an ordered set of rules. This set can then be used to classify unseen data. The main features of the algorithm are explained in detail in the following sub-sections. As mentioned earlier, an ant constructs the antecedent part of the rule by adding one term at a time. The choice of adding a term in the current partial rule is based on the pheromone value and heuristic value associated with the term and is explained in the next sub-section.

4.2.2 Rule Construction Rule construction, which is an important part of the algorithm, is described below. 4.2.2.1 Class Commitment Our heuristic function (described later) is dependent upon the class of samples and hence we need to decide beforehand the class of the rule being constructed. For each iteration of the WHILE loop in the algorithm, a class is chosen probabilistically, by roulette wheel selection, on the basis of the weights of the classes present in the yet uncovered data samples. The weight of a class is the ratio of its uncovered data samples to the total number of uncovered data samples. If 40% of the yet uncovered data samples are of class “A”, then there is a 40% chance of its selection. The class is chosen once and becomes fixed for the all the ant runs made for the search of next rule in the REPEAT-UNTIL loop. Our method of class commitment is different from those used by AntMiner and AntMiner+. In AntMiner the consequent part of the rule is assigned after rule construction and is according to the majority class among the cases covered by the rule. In AntMiner+ each ant selects the consequent of the rule prior to antecedent construction. The consequent selection is from a set of class labels which does not contain the class label of samples in majority. In our case, even though the class label is selected prior to the rule construction, it is not selected by the individual ants and is fixed for all the ant runs of a REPEAT-UNTIL loop.

31

32

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

4.2.2.2 Pheromone Initialization At the beginning of each iteration of the WHILE loop, the pheromone values on the edges between all terms are initialized with the same amount of pheromone. The initial pheromone is calculated according to Equation (4.6).

τ ij (t = 0) =

1 a

∑b i =1

(4.6)

i

Where a is the total number of attributes (excluding the class attribute) and bi is the number of possible values that can be taken on by an attribute Ai. Since all the pheromone values are the same, the first ant has no historical information to guide its search. This method of pheromone initialization has been used by the AntMiner. AntMiner+ utilizes MAX-MIN Ant System and all the pheromone values are set equal to a value τmax. 4.2.2.3 Term Selection An ant incrementally adds terms in the antecedent part of the rule that it is constructing. The selection of the next term is subject to the restriction that the attribute Ai of that term should not be already present in the current partial rule. In other words, once a term (i.e. an attribute-value pair) has been included in the rule then no other term containing that attribute can be considered. Subject to this restriction, the probability of selection of a term for addition in the current partial rule is given by the Equation (4.2): Pij (t ) =

τ ij α (t )η ij β ( s ) bi

a

∑ x ∑{τ i =1

i

j =1

α ij

(4.2)

β

(t )η ij ( s )}

where τij(t) is the amount of pheromone associated with the edge between termi and termj for the current ant (the pheromone value may change after the passage of each ant), ηij(s) is the value of the heuristic function for the current iteration s (an iteration finishes when all ants have passed), a is the total number of attributes, xi is a binary variable that is set to 1 if the attribute Ai has not been used by the current ant and else set to 0, and bi is the number of values in the Ai attribute’s domain. The denominator is used to normalize the numerator value for all the possible choices. The parameters alpha (α) and beta (β) are used to control the relative importance of the pheromones and heuristic values in the probability determination of next movement of

Chapter 4: Correlation Based Ant Miner the ant. We use α = β = 1, which means that we give equal importance to the pheromone and heuristic values. However, we note that different values may be used (e.g. α = 2, β = 1 or α = 1, β = 3). In our current experiments we have done no such work. Equation (3) is a classical Equation and used in Ant System, MAX-MIN Ant System, Ant Colony System (even though there state transition is also dependant on one other Equation), AntMiner (with α = β = 1), and AntMiner+. 4.2.2.4 Heuristic Function The heuristic value of a term gives an indication of its usefulness and thus provides a basis to guide the search. We use a heuristic function based on the correlation of the most recently chosen term (attribute-value pair) with other candidate terms in order to guide the selection of next term. The heuristic function is: η ij =

termi , term j , classk termi , classk

.

term j , classk term j

(4.3)

The most recently chosen term is termi and the term being considered for selection is termj. The number of uncovered training samples having termi, and termj and which belong to the committed class label k of the rule is given by | termi, termj, classk|.

Figure 4-3 Selection of a term for adding in a rule’s antecedent part

33

34

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

This number is divided by the number of uncovered training samples which have termi and which belong to classk to find the correlation between the terms termi, and termj. In Figure 4-3, suppose the specified class label is classk. An ant has recently added termi in its rule. It is now looking to add another term from the set of available terms. For two competing terms (termj1 and termj2), we want the chosen term to maximize correct coverage of the rule as much as possible and to encourage the inclusion of a term which has better potential for inter class discrimination. These two considerations are incorporated in the heuristic function. The correlation is multiplied by the importance of termj in determining the classk. The factor |termj, classk| is the number of uncovered training samples having termj and which belong to classk and the factor |termj| is the number of uncovered training samples containing termj, irrespective of their class label. Since the class label of the rule is committed before the construction of the rule antecedents hence our heuristic function is dependent on the class chosen by the ant. The heuristic function quantifies the relationship of the term to be added with the most recently added term and also takes into consideration the overall discriminatory capability of the term to be added. An example is shown in Figure 4-3. Suppose the specified class label is classk. An ant has recently added termi in its rule. It is now looking to add another term from the set of available terms. We take the case of two competing terms (termj1 and termj2) which can be generalized to more terms. We want the chosen term to maximize correct coverage of the rule as much as possible. This is encouraged by the first part of the heuristic function. The heuristic value on link between termi and termj1 is termi , term j1 , class k termi , class k and that on the link between termi and termj2 is termi , term j 2 , classk termi , classk . We also want to encourage the inclusion of a term which has better potential for inter class discrimination. This is made possible by the second portion of the heuristic function. The two competing terms will have the values term j1 , class k

term j1 and term j 2 , class k

term j 2 . The two ratios quantifying correct

coverage and interclass discrimination are multiplied to give the heuristic value of the competing terms.

Chapter 4: Correlation Based Ant Miner The correlation based heuristic function has the potential to be effective in large dimensional search spaces. The heuristic values are calculated only once at the beginning of the REPEAT-UNTIL loop. It assigns a zero value to the combination of those terms which do not occur together for a given class label, thus efficiently restricting the search space for the ants. In contrast, each ant of original AntMiner continues its attempts to add terms in its rule until it is sure that there is no term whose addition would not violate the condition of minimum number of covered samples. In AntMiner+ also, every ant has to traverse the whole search space and there are no short cuts. 4.2.2.5 Heuristic function for the 1st term The heuristic value when considering the first term of the rule antecedent is calculated on the basis of the following Laplace-corrected confidence of a term:

ηj =

term j , classk + 1 term j + Total _ classes

(4.4)

Where, Total_classes is the number of classes present in the dataset. This heuristic function has the advantage of penalizing the terms that would lead to very specific rules and thus helps to avoid over-fitting. For example, if a term occurs in just one training sample and its class is the chosen class, then its confidence is 1 without the Laplace correction. With Laplace correction, its confidence is 0.66 if there are two classes in the data. Equations (4.4) and (4.5) have not been used before in any AntMiner. AntMiner utilizes a heuristic function based on the entropy of terms and their normalized information gain. AntMiner2 and AntMiner3 use

uses

termi , term j , class k termi , term j

termi , term j , majorityclass (termi , term j ) termi , term j

, and AntMiner+

.

4.2.2.6 Rule Termination An ant continues to add terms in the rule it is constructing and stops only when all the samples covered by the rule have homogenous class label (which is committed prior to the REPEAT-UNTIL loop). The other possibility of termination is that there are no more

35

36

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

terms to add. Due to these termination criteria, it might happen that a constructed rule may cover only one training sample. Our rule construction stoppage criterion of homogenous class label is different from that of Ant Miner and Ant Miner+. In Ant Miner the rule construction is stopped if only those terms are left unused whose addition will make the rule cover a number of training samples smaller than a user defined threshold called minimum samples per rule. In Ant Miner+ one term from each attribute is added. However, each attribute has a don’t care option which allows for its non utilization in the rule.

4.2.3 Rule Quality & Pheromone Update This section describe the how to evaluate the quality of a rule and pheromone initialization and update strategy. 4.2.3.1 Quality of a Rule When an ant has completed the construction of a rule its quality is measured. The quality, Q, of a rule is computed by using confidence and coverage of the rule and is given by:

Q=

TP TP + Covered N

(4.5)

where TP is the number of samples covered by the rule that have the same class label as that of the rule’s consequent, Covered is the total number of samples covered by the rule, and N is the number of samples in the training set yet uncovered by any rule in the discovered rule set. The second portion is added to encourage the construction of rules with wider coverage. Equation (4.5) has been used in Ant Miner+, whereas Ant Miner uses sensitivity multiplied by specificity as the quality measure. 4.2.3.2 Pheromone Matrix Update An example pheromone matrix is shown in Figure 4-4. The pheromone matrix is asymmetric and captures the fact that pheromone values on links originating from a term to other terms are kept the same in all the layers of the search space.

Chapter 4: Correlation Based Ant Miner

Figure 4-4 Asymmetric pheromone matrix for the example problem of Figure 4-1 In Figure 4-4 the first row elements are the pheromone values on links from Start node to nodes of the first layer of terms. All other rows are pheromone values for links from a term in the previous layer to terms in the next layer. If an ant chooses C1, A3, D1 and B1 terms in its rule then the elements τ06, τ63, τ39 and τ94 are updated according to Equation (7). Elements of rows S, C1, A3 and D1 are then normalized. The remaining rows remain unchanged. The pheromone values are updated so that the next ant can make use of this information in its search. The amount of pheromone on the links between consecutive terms occurring in the rule is updated according to the Equation (4.7):

τ ij (t + 1) = (1 − ρ )τ ij (t ) + (1 −

1 )τ ij (t ) 1+ Q

(4.7)

where τij(t) is the pheromone value encountered by Antt between termi and termj. The pheromone evaporation rate is represented by ρ and Q is the quality of the rule constructed by Antt. Equation (4.7) is according to the Ant System and has been previously used in AntMiner3, but with a different Equation for determining Q. The Equation updates pheromones by first evaporating a percentage of the previously occurring pheromone and then adding a percentage of the pheromone dependant on the quality of the recently discovered rule. If the rule is good then the pheromone added is greater than the pheromone evaporated and the terms become more attractive for subsequent ants. The

37

38

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

evaporation in the Equation improves exploration, otherwise the presence of a static heuristic function tends to make the ants quickly converge to the terms selected by the first few ants. Note that our pheromone matrix is asymmetric whereas that used in original Ant Miner is symmetric. This means that if an ant constructs a rule with termi occurring immediately before termj, then the pheromone on link between termi and termj is updated but the pheromone on link between termj and termi is not updated. This is done to encourage exploration and discourage early convergence. The next step is to normalize the pheromone values. Each pheromone value is normalized by dividing it by the summation of pheromone values of all its competing terms given the current term. In the pheromone matrix (Figure 4-3), normalization of the elements is done by dividing them with the summation of values of the row to which they belong. Note that for those rows for which no change in the values has occurred for the ant run under consideration, the normalization process yields the same values as before. Referring to Figure 4-1, normalization of pheromone value of a link originating from a term is done by dividing it by the summation of all pheromones values of the links originating from that term. This process changes the amount of pheromone associated with those terms that do not occur in the most recently constructed rule but are competitors of the selected term. If the quality of rule has been good and there has been a pheromone increase on the terms used in the rule then the competing terms become less attractive for the subsequent ants. The reverse is true if the rule found is not of good quality. The normalization process is an indirect way of simulating phenomenon evaporation. Note that in original AntMiner every element of the pheromone matrix is normalized by dividing it with the sum of all elements. This is unnecessary and tends to discourage exploration and favor early convergence.

4.2.4 Termination of REPEAT-UNTIL Loop The REPEAT-UNTIL loop is used to construct as many rules as the user defined number of ants. After the construction of each rule its quality is determined and the pheromones on the trails are updated accordingly. The pheromone values guide the construction of next rule. An early termination of this loop is possible if the last few ants have

Chapter 4: Correlation Based Ant Miner constructed the same rule. This implies that the pheromone values on a trail have become very high and convergence has been achieved. Any further rule construction will most probably yield the same rule again. Hence the loop is terminated prematurely. For this purpose, each constructed rule is compared with the last rule and a counter is incremented if both are the same. If the value of this counter exceeds a threshold “No. of rules converged”, then the loop is terminated. In our experiments, we use a value of 10 for this threshold. This method of early termination of REPEAT-UNTIL loop is also used by Ant Miner. In Ant Miner+ this loop terminates when the pheromone values on one path converge to τmax and all other paths have τmin, as required by the MAX-MIN Ant System. It does not have provision for early termination of this loop.

4.2.5 Pruning of Rule Rule pruning is the process of finding and removing irrelevant terms that might have been included in the constructed rule. Rule pruning has two advantages. First, it increases the generalization of the rule thus potentially increasing its predictive accuracy. Second, a shorter rule is usually simpler and more comprehensible. In AntMiner–C , rule construction by an ant stops when all the samples covered by the rule have homogenous class label (which is committed prior to the REPEAT-UNTIL loop). The other possibility of termination is that there are no more terms to add. Those rules whose construction is stopped due to homogeneity of class label of samples covered are already compact and any attempt at pruning is bound to cause non-homogeneity. However those rules which have one term from all the attributes have the possibility of improvement with pruning of terms. As a compromise we do not prune all found rules but apply it to the best rule discovered during an iteration of the algorithm. This means that pheromone updates are done with un-pruned rules. The rule pruning procedure starts with the full rule. It temporarily removes the first term and determines the quality of the resulting rule according to Equation 4.5. It then replaces the term back and temporarily removes the second term and again calculates the quality of the resulting rule. This process continues until all the terms present in the rule are dealt with. After this assessment, if there is no term whose removal improves or maintains the

39

40

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

quality then the original rule is retained. However, if there is one or more terms whose removal improves (or maintains) the quality of the rule then the term whose removal most improves the quality of the rule is permanently removed. In such cases, the shortened rule is again subjected to the procedure of rule pruning. The process continues until any further shortening is impossible (removal of any term present in the rule leads to decrease in its quality) or if there is only one remaining term in the rule. Rule pruning is also used by Ant Miner and Ant Miner+. However, the rule quality criterion used in Ant Miner [37] is different. Furthermore, in Ant-Miner each discovered rule is subjected to rule pruning (prior to pheromone update) whereas we prune only the best rule among the rules found during the execution of the REPEAT-UNTIL loop of the algorithm and our pheromone update is prior to and independent of rule pruning. Pruning only the best rule reduces the computational cost of the algorithm. However, a potential disadvantage is that pruning all found rules may have given us a better rule than the one found. Ant Miner+ also prunes the best rule. However, it is pertinent to note that Ant Miner+ has several iterations of the REPEAT-UNTIL loop and in each iteration there are a 1000 ant runs. The best rule from these 1000 rules is pruned.

4.2.6 Final Rule Set The best rule is placed in the discovered rule set after pruning and the training samples correctly covered by the rule are flagged and have no role in the discovery of other rules. The algorithm checks whether the uncovered training samples are still above the threshold defined by “Max_uncovered_samples”. If that is the case, a new iteration starts for discovery of the next rule. If not, a final default rule is added to the rule set and the rule set may optionally be pruned of redundant rules. The rule set may then be used for classifying unseen samples. These aspects are discussed below.

4.2.7 Early Stoppage of Algorithm The algorithm can be continued till the training dataset is empty. However, this leads to rules which usually cover one or two training samples and thus over-fit the data. Such rules usually do not have generalization capabilities. Some of the methods to avoid this phenomenon are:

Chapter 4: Correlation Based Ant Miner •

Use a separate validation set to monitor the training. The algorithm can be stopped when the accuracy on the validation set starts to dip.



Specify a threshold on the number of training samples present in the dataset. If, after an iteration of the REPEAT-UNTIL loop, the remaining samples in the training set are equal to or below this specified threshold, then stop the algorithm. The threshold can be defined as a percentage of the samples present in the initial training set.



Specify a threshold on the maximum number of rules to be found.

The validation set is not appropriate for small datasets because we have to divide the total data samples into training, validation and test sets. We have opted for the second option and used a fixed threshold for all the datasets. AntMiner and Ant Miner+ both employ early stopping. We use the same option for early stopping as is done in Ant Miner while Ant Miner+ uses the validation set technique for large datasets (greater than 250 samples) and a threshold of 1% remaining samples for small datasets.

4.2.8 Default Rule A final default rule is added at the bottom of the rule set. The rule is without any conditions and has a consequent part only. The assigned class label for this rule is the majority class label of the remaining uncovered samples of the training set. The default rule is used by all Ant Miner and our method is same as that used by Ant Miner. Ant Miner+, however, assigns the default rule class label on the basis of majority class of the complete set of training samples.

4.2.9 Pruning of Rule Set Some of the rules may be redundant in the final rule set. The training samples correctly classified by a rule may all be correctly classified by one or more rules occurring later on in the rule set. In such cases, the earlier rules are redundant and their removal will not decrease the accuracy of the rule set. Furthermore, the removal of redundant rules increases the comprehensibility of the rule set. We have developed a procedure which attempts to reduce the quantity of rules without compromising on the accuracy obtained.

41

42

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

The rule set pruning procedure is applied to the final rule set which includes the default rule. Each rule is a candidate for removal. The procedure checks the first rule, and if removing it from the rule list does not decrease the accuracy on the training data, then it is permanently removed. After dealing with the first rule the second rule is checked and it is either retained or removed. Each rule is subjected to the same procedure on its turn. Our experiments (described later on) show that the technique is effective in reducing the number of rules. The remaining rules also have a tendency to have lesser terms/rule on the average. However, the strategy is greedy and although it makes the final classifier relatively faster but it tends to decrease predictive accuracy for some datasets. However, increased accuracy on other datasets make us believe that sometimes the pruning results in a rule set with superior generalization capability. Hence, pending further investigation, we have made this procedure optional.

4.2.10 Use of Discovered Rule Set for Classifying Unseen Samples A new test sample unseen during training is classified by applying the rules in order of their discovery. The first rule whose antecedents match the new sample is fired and the class predicted by the rule’s consequent is assigned to the sample. If none of the discovered rules are fired then the final default rule is activated.

4.3 Summary This chapter sets the foundation for next chapters. Previously ACO has been used to discover classification rules from datasets. In this chapter, a novel classification algorithm using ACO is introduced. The main highlights are the use of a heuristic function based on the correlation between the recently added term and the term to be added in the rule, the manner in which class labels are assigned to the rules prior to their discovery, a strategy for dynamically stopping the addition of terms in a rule’s antecedent part. The chapter describes how this proposed AntMiner-C differs with other versions of Ant Miners. We present a complete description of the proposed approach including construction process of a rule by an ant, heuristic function, rule quality measure, and rule pruning procedure etc. Next, in Chapter 5 we present the experimentation performed for showing

Chapter 4: Correlation Based Ant Miner the worth of the proposed approach and compare the results with other state of the art classification algorithms. We also present a modified version of original AntMiner-C.

43

44

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

5 Chapter 5: Investigation of Components and Parameter Optimization In this chapter we report our experimental setup, experiments for investigating different components of the algorithm proposed in the previous chapter and the results obtained. These are class choice before or after rule construction, criteria comparison for termination of rule construction, pruning of all rules versus pruning of best rule only and method of forming a default rule. The algorithm was tested on 12 datasets. Also, as a result of these experiments we modify the algorithm in order to improve it. The improved AntMiner–C is then tested on 26 datasets.

5.1 Experiments and Analysis This section reports some of the experiments on AntMiner-C.

5.1.1 Datasets For the experiments of the current and the next section, we have used twelve datasets Table 5.1 Characteristics of datasets Dataset

Attributes Samples Classes

Wisconsin breast cancer (WBC)

9

683

2

Wine

13

178

3

Credit (Australia)

15

690

2

Credit (Germany)

19

1000

2

Car

6

1728

4

Tic-tac-toe

9

958

2

Iris

4

150

3

Balance-scale

4

625

3

Teacher Assistant Evaluation (TAE)

6

151

3

Glass

9

214

7

Heart

13

270

2

Hepatitis

19

155

2

Chapter 5: Investigation of Components and Parameter Optimization obtained from the UCI repository [77]. The main characteristics of the datasets are summarized in Table 5.1. The datasets in this suite have reasonable variety in terms of number of attributes, samples and classes and are commonly used. The proposed algorithm works with categorical attributes and continuous attributes need to be discretized in a preprocessing step. We use unsupervised discretization filter of Weka-3.4 machine learning tool for discretizing continuous attributes. This filter first computes the intervals of continuous attributes from the training dataset and then uses these intervals to discretize them.

5.1.2 Performance Metrics Our performance metrics are: predictive accuracy, number of rules, and the number of terms per rule. 5.1.2.1 Predictive Accuracy The predictive accuracy is defined as the percentage of true predictions among all predictions on the test set. The experiments are performed using a ten-fold cross validation procedure. A dataset is divided into ten equally sized, mutually exclusive subsets. Each of the subset is used once for testing while the other nine are used for training. The results of the ten runs are then averaged and this average is reported as final result. 5.1.2.2 Number of Rules This is the average of number of rules in the rule sets obtained by ten-fold cross validation. 5.1.2.3 Number of Terms per Rule This is the average of terms per rule in all the rule sets obtained by ten-fold cross validation.

5.1.3 Parameters Setting The AntMiner–C has six user defined parameters: number of ants, maximum uncovered cases, evaporation rate, convergence counter, alpha and beta (weights of pheromone and

45

46

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

heuristic). The values of these parameters are given in Table 5.2. These values have been chosen because they seem reasonable and have been used by other AntMiner versions reported in literature [65-70]. Table 5.2 Parameters used in experiment Parameter

Value

Number of Ants

1000

Max. uncovered cases

10

Evaporation rate

0.15

No. of rules converged

10

Alpha

1

Beta

1

5.1.4 Results We obtained and compared the results of our algorithm with those for AntMiner, C4.5, Ripper, logistic regression and support vector machines (SVM). AntMiner-C and AntMiner have been implemented by us in Matlab. For other algorithms we use the Weka machine learning tool. The predictive accuracies, average number of rules per discovered rule set, and average number of terms per rule are shown in Table 5.3 and 5.4. All results are obtained using ten-fold cross validation. Table 5.3 Average predictive accuracies obtained using 10-fold cross validation Datasets

AntMiner-C

AntMiner

C4.5

Ripper

Logistic Reg.

SVM

BC-W

97.54 ± 0.98

94.64 ± 2.74

94.84 ± 2.62

95.57 ± 2.17

96.56 ± 1.21

96.70 ± 0.69

Wine

98.24 ± 2.84

90.0 ± 9.22

96.60 ± 3.93

94.90 ± 5.54

96.60 ± 4.03

98.30 ± 2.74

Credit(Aus)

89.42 ± 4.21

86.09 ± 4.69

81.99 ± 7.78

86.07 ± 2.27

85.77 ± 4.75

85.17 ± 2.06

Credit(Ger)

73.64 ± 2.67

71.62 ± 2.71

70.73 ± 6.71

70.56 ± 5.96

75.82 ± 4.24

75.11 ± 3.63

Car

98.02 ± 0.96

82.38 ± 2.42

96.0 ± 2.13

89.17 ± 2.52

93.22 ± 2.10

93.74 ± 2.65

Tic-tac-toe

100 ± 0.0

74.95 ± 4.26

94.03 ± 2.44

97.57 ± 1.44

98.23 ± 0.50

98.33 ± 0.53

Iris

97.33 ± 4.66

95.33 ± 4.50

94.0 ± 6.63

94.76 ± 5.26

97.33 ± 5.62

96.67 ± 3.52

Balance-scale

86.61 ± 6.18

75.32 ± 8.86

83.02 ± 3.24

80.93 ± 3.35

88.30 ± 2.69

87.98 ± 1.80

TAE

77.33 ± 10.5

50.67 ± 6.11

51.33± 9.45

44.67 ± 10.35

53.33 ± 11.33

58.67 ± 10.98

Glass

74.29 ± 6.43

53.33 ± 4.38

68.90 ± 8.98

70.48 ± 8.19

63.65 ± 6.72

57.70 ± 8.10

Heart

77.78 ± 5.79

80.74 ± 4.94

78.43 ± 6.26

73.59± 9.57

77.0 ± 5.05

80.32 ± 6.25

Hepatitis

87.33 ± 9.66

80.67 ± 8.67

68.25 ± 11.63

73.46 ± 8.21

64.25 ± 8.87

75.37 ± 8.62

Chapter 5: Investigation of Components and Parameter Optimization Table 5.4 Average number of rules per discovered rule set and average number of terms per rule the results are obtained using 10-fold cross validation No. of Rules/Rule Set AntMiner-C

AntMiner

C4.5

Terms/Rule Ripper

AntMiner-C

AntMiner

C4.5

Ripper

BC-W

18.60

11.0

10.50

5.10

1.45

1.02

2.32

1.79

Wine

4.10

5.50

5.30

3.90

1.65

1.04

1.41

1.62

Credit(Aus)

7.80

3.90

74.80

4.60

1.58

1.0

3.22

1.81

Credit(Ger)

10.0

8.50

73.60

4.20

1.71

1.13

3.21

2.36

Car

57.60

11.40

80.26

41.10

2.49

1.03

2.59

4.01

Tic-tac-toe

14.80

6.60

38.60

10.30

2.50

1.09

2.64

2.82

Iris

10.90

9.20

5.50

3.90

1.05

1.0

1.22

1.03

Bal. scale

98.30

17.70

40.10

11.10

2.51

1.0

2.85

2.91

TAE

44.50

20.90

18.30

3.90

1.48

1.0

2.69

1.64

Glass

41.60

15.50

15.40

7.20

2.0

1.01

2.83

2.33

Heart

9.20

5.60

12.60

5.60

1.83

1.08

1.73

1.86

Hepatitis

7.30

3.90

11.60

4.60

2.55

1.11

1.70

1.0

The results indicate that the AntMiner-C achieves higher accuracy rate than the five compared algorithms for most of the datasets. However, the number of rules and the number of terms per rule generated by our proposed technique is mostly higher. The reason is that we allow the generation of rules with low coverage. It means those rules which cover only a few training examples are also allowed. 90 88.13 88 86 83.67

Accuracy

84

82.51 81.51

82

80.98

80 77.98

78 76 74 72 Ant-Miner-C

Ant-Miner

C4.5

Ripper

Log.Reg

SVM

Classification Techniques

Figure 5-1 Average accuracies over all datasets for different classification techniques

47

48

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

Figure 5-1 shows the average accuracies achieved by different techniques over all the datasets. This average is calculated by adding averages of ten-folds of each technique and then dividing by the number of datasets. Figure 5-1 shows that the proposed AntMiner-C achieves higher accuracy rate than the compared algorithms.

5.1.5 Number of Probability Calculations An important aspect of AntMiner-C is its potential for extracting rules from high dimensional datasets in reasonable time. The nature of the heuristic function (Equation 4.3) combined with the fact that the class is selected prior to the rule construction reduces the search space considerably. It assigns the value of zero to all terms which do not occur together for a given class in the training samples of the dataset. The heuristic matrix is calculated once and remains fixed for one complete execution of the REPEAT-UNTIL loop. The costliest computation in the algorithm, which is done repeatedly, is the probability Equation (Equation 3), one calculation of probability Equation requires one scan of the data base. The counter of its utilization gives a direct indication of the amount of search space which an ant has had to cover. We present some probability Equation usage counts, which provide an indication of the reduction of search space of AntMiner-C and also highlight its performance improvement over AntMiner. The sets used in this limited experiment are Iris, Wine and Dermatology with 4, 13 and 33 attributes. The experiment was run for one fold only. The ratios of overall probability calls of AntMiner and AntMiner–C for the three datasets are 11.07, 6.06 and 8.35, respectively. Iris Dataset (4 attributes, 150 samples, 3 classes) Ant Miner Rule #1:

Ants Used = 49,

Total probability Equation calls = 1255

Rule #2:

Ants Used = 58,

Total probability Equation calls = 1674

Rule #3:

Ants Used = 58,

Total probability Equation calls = 1549

Rule #4:

Ants Used = 40,

Total probability Equation calls = 1150

Rule #5:

Ants Used = 43,

Total probability Equation calls = 1288

Rule #6:

Ants Used = 24,

Total probability Equation calls = 755

Rule #7:

Ants Used = 11,

Total probability Equation calls = 372

Chapter 5: Investigation of Components and Parameter Optimization Accuracy: 86.66, Total number of rules found = 7, Number of terms/rule: 1, Grand total of probability Equation calls = 9193

AntMiner–C Rule #1:

Ants Used = 51,

Total probability Equation calls = 73

Rule #2:

Ants Used = 66,

Total probability Equation calls = 111

Rule #3:

Ants Used = 33,

Total probability Equation calls = 41

Rule #4:

Ants Used = 71,

Total probability Equation calls = 107

Rule #5:

Ants Used = 65,

Total probability Equation calls = 83

Rule #6:

Ants Used = 49,

Total probability Equation calls = 58

Rule #7:

Ants Used = 72,

Total probability Equation calls = 142

Rule #8:

Ants Used = 71,

Total probability Equation calls = 168

Rule #9:

Ants Used = 40,

Total probability Equation calls = 47

Accuracy: 93.33, Total number of rules found = 9, No of terms / rule: 1, Grand total of probability Equation calls = 830 Wine Dataset (13 attributes, 178 samples, 3 classes)

Ant Miner Rule #1:

Ants Used = 38,

Total probability Equation calls = 1023

Rule #2:

Ants Used = 17,

Total probability Equation calls = 477

Rule #3:

Ants Used = 230,

Total probability Equation calls = 5646

Accuracy: 94.44, Total number of rules found = 3, Number of terms/rule: 2, Grand total of probability Equation calls = 7146

AntMiner–C Rule #1:

Ants Used = 74,

Total probability Equation calls = 178

Rule #2:

Ants Used = 31,

Total probability Equation calls = 51

Rule #3:

Ants Used = 120,

Total probability Equation calls = 587

Rule #4:

Ants Used = 137,

Total probability Equation calls = 362

Accuracy: 100, Total number of rules found = 4, No of terms / rule: 1.25, Grand total of probability Equation calls = 1178 Dermatology Dataset (33 attributes, 366 samples, 6 classes)

Ant-Miner Rule #1:

Ants Used = 25,

Total probability Equation calls = 2813

Rule #2:

Ants Used = 26,

Total probability Equation calls = 3083

Rule #3:

Ants Used = 504,

Total probability Equation calls = 58656

Rule #4:

Ants Used = 1000,

Total probability Equation calls = 117240

49

50

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

Rule #5:

Ants Used = 131,

Total probability Equation calls = 15700

Rule #6:

Ants Used = 21,

Total probability Equation calls = 2542

Rule #7:

Ants Used = 11,

Total probability Equation calls = 1347

Accuracy: 80.48, Total number of rules found = 7, Number of terms/rule: 1.88, Grand total of probability Equation calls = 201,381

AntMiner–C Rule #1:

Ants Used = 134,

Total probability Equation calls = 377

Rule #2:

Ants Used = 47,

Total probability Equation calls = 156

Rule #3:

Ants Used = 80,

Total probability Equation calls = 121

Rule #4:

Ants Used = 467,

Total probability Equation calls = 4268

Rule #5:

Ants Used = 104,

Total probability Equation calls = 259

Rule #6:

Ants Used = 45,

Total probability Equation calls = 82

Rule #7:

Ants Used = 68,

Total probability Equation calls = 121

Rule #8:

Ants Used = 65,

Total probability Equation calls = 130

Rule #9:

Ants Used = 550,

Total probability Equation calls = 4348

Rule #10:

Ants Used = 162,

Total probability Equation calls = 684

Rule #11:

Ants Used = 188,

Total probability Equation calls = 927

Rule #12:

Ants Used = 123,

Total probability Equation calls = 336

Rule #13:

Ants Used = 42,

Total probability Equation calls = 61

Rule #14:

Ants Used = 87,

Total probability Equation calls = 167

Rule #15:

Ants Used = 90,

Total probability Equation calls = 165

Rule #16:

Ants Used = 50,

Total probability Equation calls = 146

Rule #17:

Ants Used = 654,

Total probability Equation calls = 5317

Rule #18:

Ants Used = 609,

Total probability Equation calls = 6431

Accuracy: 97.29, Total number of rules found = 18, Number of terms/rule: 1.89, Grand total of probability Equation calls = 24,096

5.1.6 Convergence Speed In Table 5.2, we have specified the value of the ‘Number of Ants’ parameter as 1000, that is, a maximum of 1000 rules can be constructed out of which the best one is chosen to be placed in the discovered rule set. In reality, on the average, very few ants are used because the REPEAT-UNTIL loop gets terminated if convergence is achieved (all of the recently discovered rules are duplicates of each other. For this purpose the threshold parameter ‘Number of rules converged’ (10 in our experiments) is used. The average of

Chapter 5: Investigation of Components and Parameter Optimization the actual number of ants used per iteration is reported in Table 5.5. Convergence speed is an important aspect, particularly, for large and high dimensional datasets.

Table 5.5 Average number of ant runs per iteration Dataset

Avg. ants/iteration

Breast-cancer-w

Dataset

Avg. ants/iteration

166

Iris

77

Wine

80

Balance-scale

148

Credit(Aus)

175

TAE

108

Credit(Ger)

349

Glass

137

Car

146

Heart

216

Tic-tac-toe

308

Hepatitis

259

5.2 Analysis of Different Algorithmic Components This section reports some experiments for analyzing some algorithmic components of AntMiner-C.

5.2.1 Class Choice Prior or After Rule Construction In this experiment we deviate from our algorithm by first constructing the rule antecedent and then assigning the rule consequent. The rule consequent assigned is the majority class label of the samples covered by the rule. Since our heuristic function (Equation 4) requires prior commitment of class label, therefore we had to modify it for this particular experiment. The heuristic function used is:

η ij =

termi , term j

(5.1)

termi

The heuristic function used for the first layer of terms is: m

η j = ∑− k =1

term j , classk term j

log 2 (

term j , classk term j

)

(5.2)

Experimental results are shown in Table 5.6. The predictive accuracy is lower as compared to our original algorithm. In our opinion, the prior commitment of class label focuses the search (all the ants in an iteration are searching for the best version of the same rule) resulting in more appropriate rules.

51

52

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

Table 5.6 Results obtained by first constructing rule antecedent and then choosing class label Dataset

Posterior Class Commitment

Prior Class Commitment

Accuracy

#R

#T/R

Accuracy

#R

#T/R

Breast-cancer-w

95.36 ± 1.65

14.8

1.12

97.54 ± 0.98

18.60

1.45

Wine

95.88 ± 4.84

4.2

1.39

98.24 ± 2.84

4.10

1.65

Credit(Aus)

85.80 ± 4.58

7.67

1

89.42 ± 4.21

7.80

1.58

Credit(Ger)

70.53 ± 3.40

9.2

1.34

73.64 ± 2.67

10.0

1.71

Car

95.58 ± 1.20

46.2

2.05

98.02 ± 0.96

57.60

2.49

Tic-tac-toe

89.79 ± 2.98

12.8

1.77

100 ± 0.0

14.80

2.50

Iris

96.00 ± 4.66

11

1

97.33 ± 4.66

10.90

1.05

Balance-scale

83.23 ± 4.18

56.4

1.59

86.61 ± 6.18

98.30

2.51

TAE

73.33 ± 7.70

44.2

1.01

77.33 ± 10.5

44.50

1.48

Glass

65.71 ± 8.92

36.90

1.09

74.29 ± 6.43

41.60

2.0

Heart

74.44 ± 7.70

8.70

1.01

77.78 ± 5.79

9.20

1.83

Hepatitis

85.33 ± 4.30

4.30

1.05

87.33 ± 9.66

7.30

2.55

5.2.2 Termination of Rule Construction In our algorithm terms are added to a rule until all the samples covered by it have the same class label or until there are no more terms to be added. There is no restriction that a rule should cover a minimum number of samples. A constructed rule might cover only one sample. In this section, we have reported the results of an experiment in which the algorithm used is same as proposed with the exception of a restriction that a rule being created should cover at least a minimum number of samples. We set this threshold as 10, a value used in [65] for the same purpose. A term is added to a rule only if the rule still covers at least 10 samples after its addition. As a result, all the constructed rules cover a minimum of ten samples. The results of this experiment are shown in Table 5.7. The predictive accuracy decreases on most of the datasets. The reason is that when we restrict a rule in this way we may miss a very effective term in the rule that may increase the accuracy of the rule. Another problem with this approach is that there are different datasets and each has different number of dimensions and number of samples. It is difficult to decide how this parameter is set for the different datasets. Our experiment, though limited, provides some evidence that the strategy of rule construction by adding terms without any restriction is

Chapter 5: Investigation of Components and Parameter Optimization

effective. However, we note that an advantage of the restriction is that we have less number of rules. Table 5.7 Results obtained by imposing restriction that a constructed rule must cover a minimum of 10 samples Dataset

Condition of Minimum Rule Coverage

No Minimum Rule Coverage

Accuracy

Accuracy

#R

#T/R 1.45

#R

#T/R

Breast-cancer-w

96.64 ± 2.06

15.33

1.14

97.54 ± 0.98

18.60

Wine

98.24 ± 2.84

4.70

1.49

98.24 ± 2.84

4.10

1.65

Credit(Aus)

85.94 ± 2.56

4.20

1.31

89.42 ± 4.21

7.80

1.58

Credit(Ger)

69.65 ± 3.67

7.46

1.43

73.64 ± 2.67

10.0

1.71

Car

96.57 ± 1.04

52.90

2.57

98.02 ± 0.96

57.60

2.49

Tic-tac-toe

91.89 ± 4.41

11.90

2.22

100 ± 0.0

14.80

2.50

Iris

96.00 ± 5.62

9.60

1.08

97.33 ± 4.66

10.90

1.05

Balance-scale

80.48 ± 4.33

45.40

1.65

86.61 ± 6.18

98.30

2.51

TAE

74.00 ± 11.90

28.90

1.09

77.33 ± 10.5

44.50

1.48

Glass

58.10 ± 11.40

17.70

1.93

74.29 ± 6.43

41.60

2.0

9.20

1.83

7.30

2.55

Heart

80.74 ± 6.49

6.60

1.48

77.78 ± 5.79

Hepatitis

85.33 ± 11.67

3.30

1.72

87.33 ± 9.66

5.2.3 Rule Pruning: All Rules versus Best Rule In this experiment, we prune each rule constructed by the ants. Prior to pheromone update, all other parameters are kept the same. The results of this experiment are shown in Table 5.8. The predictive accuracy is less for nine out of twelve sets but the number of rules generated is also reduced. The number of terms per rule is almost the same. The reason for the dip in accuracy is that there are many rules whose construction is stopped due to homogeneity of class label of samples covered. These rules are already compact and any attempt at pruning causes non-homogeneity. The criterion for assessing quality of a rule has two components: confidence and coverage. While attempting to prune those rules which cover samples of homogenous class labels the confidence decreases but that may be compensated by an increase in coverage. A rule may thus get pruned on the basis of lower confidence but better coverage than before. Such pruned rules cause pheromone increase on terms which may be misleading for future ants. Also, more importantly, the final best rule selected from the pruned rules may not have the same discriminatory capability as the final rule selected from un-pruned rules.

53

54

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

Table 5.8 Result of pruning each constructed rule and pruning only the best rule Dataset

Pruning All Generated Rules Accuracy

#R

Pruning Only the Best Rule #T/R

Accuracy

#R

#T/R

Breast-cancer-w

94.06 ± 3.95

13.10

1.21

97.54 ± 0.98

18.60

1.45

Wine

94.71 ± 5.15

4.40

1.55

98.24 ± 2.84

4.10

1.65

Credit(Aus)

84.93 ± 4.54

4.0

1.43

89.42 ± 4.21

7.80

1.58

Credit(Ger)

73.54 ± 3.15

4.40

1.29

73.64 ± 2.67

10.0

1.71

Car

96.51 ± 2.33

46.80

2.52

98.02 ± 0.96

57.60

2.49

Tic-tac-toe

86.21 ± 5.07

10.40

1.87

100 ± 0.0

14.80

2.50

Iris

95.33 ± 4.50

8.90

1.04

97.33 ± 4.66

10.90

1.05

Balance-scale

87.26 ± 4.65

96.20

2.46

86.61 ± 6.18

98.30

2.51

TAE

72.0 ± 14.67

45.70

1.52

77.33 ± 10.5

44.50

1.48

Glass

78.10 ± 12.13

37.10

1.98

74.29 ± 6.43

41.60

2.0

Heart Hepatitis

80.0 ± 6.81

5.0

1.43

77.78 ± 5.79

9.20

1.83

81.33 ± 10.33

4.20

2.09

87.33 ± 9.66

7.30

2.55

5.2.4 Rule Set Pruning This experiment is for testing the effect of rule set pruning. The algorithm is run with and without the procedure and the results are shown in Table 5.9. From the table we can see that the accuracy sometimes improves and sometimes decreases. This shows that sometimes the procedure removes even those rules which were not redundant, and at other times the removal results in a rule set with superior generalization capability. However, the number of rules consistently decreases if pruning is used. The remaining rules are simpler and the average number of terms/rule also decreases in most of the cases. If someone is interested in a smaller rule set then he can use this option.

Chapter 5: Investigation of Components and Parameter Optimization Table 5.9 Comparison of results with and without pruning of redundant rules from the rule set Dataset

Pruning of Rule Set

No Pruning of Rule Set

Accuracy

#R

#T/R

Accuracy

#R

#T/R

Breast-cancer-w

97.39 ± 1.78

13.20

1.26

97.54 ± 0.98

18.60

1.45

Wine

98.24 ± 2.84

3.80

1.55

98.24 ± 2.84

4.10

1.65

Credit(Aus)

86.67 ± 3.97

3.10

1.45

89.42 ± 4.21

7.80

1.58

Credit(Ger)

72.12 ± 3.76

8.60

2.06

73.64 ± 2.67

10.0

1.71

Car

97.85 ± 1.16

48.20

2.48

98.02 ± 0.96

57.60

2.49

Tic-tac-toe

100 ± 0.0

8.90

2.61

100 ± 0.0

14.80

2.50

Iris

98.00 ± 3.22

6.50

1.03

97.33 ± 4.66

10.90

1.05

Balance-scale

87.58 ± 5.49

58.10

2.26

86.61 ± 6.18

98.30

2.51

TAE

78.00 ± 13.68

30.90

1.55

77.33 ± 10.5

44.50

1.48

Glass

74.76 ± 6.75

37.20

1.99

74.29 ± 6.43

41.60

2.0

Heart

77.78 ± 6.49

6.90

1.98

77.78 ± 5.79

9.20

1.83

Hepatitis

88.0 ± 10.11

6.90

2.52

87.33 ± 9.66

7.30

2.55

5.2.5 Default Rule In the original AntMiner–C we keep on discovering new rules until the number of uncovered training samples become less than or equal to 10. A final default rule is added to the rule set which does not have any conditions (antecedent part) and has the class label (consequent part) equal to the majority class of the uncovered training samples. Default Rule: ELSE Class Label = Majority class of uncovered training samples In this sub-section, we experiment with an alternate method for specifying the default rule. We keep on discovering new rules until all the remaining training samples have the same class label, and then add a default rule with that class label. The results of this experiment are shown in Table 5.3. The accuracy improves in majority of the datasets. However, the average number of rules increases. Still there are three datasets on which there is a decrease in the average number of rules found. This means that the training samples for these datasets become homogenous (in class label) even when they are more than 10 but the AntMiner–C algorithm keeps on finding rules for the same class until the uncovered samples drop down to 10 or less. Finally, we note that the new method discovers rules with less number of terms per rule.

55

56

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

Table 5.10 Results obtained when the default rule is composed of the majority class labels of remaining samples and compared with remaining training samples are of same class label Datasets

Accuracy

Avg. No. of Rules/Rule Set

Avg. Terms/Rule

Option 1

Option 2

Option 1

Option 2

Option 1

Option2

Breast-

97.54 ± 0.98

97.42 ± 2.76

18.60

21.0

1.45

1.39

Wine

98.24 ± 2.84

98.82 ± 2.48

4.10

5.60

1.65

1.38

Credit(Aus)

89.42 ± 4.21

86.81 ± 3.51

7.80

5.70

1.58

1.25

Credit(Ger)

73.64 ± 2.67

72.5 ± 2.51

10.0

7.63

1.71

1.63

Car

98.02 ± 0.96

98.20 ± 0.82

57.60

60.60

2.49

2.56

100 ± 0.0

100 ± 0.0

14.80

13.71

2.50

2.77

97.33 ± 4.66

98.0 ± 3.22

10.90

12.90

1.05

1.08

Tic-tac-toe Iris Balance-

86.61 ± 6.18

90.23 ± 1.96

98.30

103.70

2.51

2.39

TAE

77.33 ± 10.5

76.79 ± 10.59

44.50

49.10

1.48

1.54

Glass

74.29 ± 6.43

76.19 ± 8.98

41.60

47.30

2.0

2.04

Heart

77.78 ± 5.79

78.89 ± 6.07

9.20

9.40

1.83

1.70

Hepatitis

87.33 ± 9.66

87.75 ± 7.73

7.30

10.40

2.55

2.31

5.3 Improved AntMiner-C This section describes the modified AntMiner-C algorithm in light of the experiments described above. The description of the improved algorithm is shown in Figure 5-2. The changes in the AntMiner–C algorithm are: •

The algorithm now keeps on discovering rules until the last remaining samples are of homogenous class labels. Previously the algorithm stopped when the remaining samples fell below or became equal to a user specified number (called number of minimum cases). This change is shown in line 3 in Figure 5-2. As experiment results show if we used this stopping condition then we have superior accuracy rate.



In line 23 the default rule added at the end of the discovered rule set has the class label of last remaining samples (which are of homogenous class labels). As in our previous algorithm the default class is the majority class of remaining uncovered training samples.

Rule set pruning procedure is removed from the algorithm, which is listed in line 24 in Figure 4-2.

Chapter 5: Investigation of Components and Parameter Optimization

1

TrainingSet = {all training samples};

2

DiscoveredRuleList = {};

3

WHILE (TrainingSet has samples of more than one (non-homogenous) class labels)

/* rule list is initialized with an empty list */

4

t = 1;

/* counter for ants */

5

j = 1;

/* counter for rule convergence test */

6

Select class label;

7

Initialize all trails with the same amount of pheromone (Eq. 2);

8

Initialize the matrix of values used by the heuristic function;

9

REPEAT

10

Send an Antt which constructs a classification rule Rt for the selected class (Eqs. 3, 4, 5);

11

Assess the quality of the rule (Eq. 6) and update the pheromone of all trails (Eq. 7);

12

IF (Rt is equal to Rt-1) /* update convergence test */

13

THEN j = j + 1;

14

ELSE j = 1;

15

END IF

16

t = t + 1;

17

UNTIL (t ≥ No_of_ants) OR (j ≥ No_rules_converg)

18

Choose the best rule Rbest among all rules Rt constructed by all the ants;

19

Prune the best rule Rbest;

20

Add the pruned best rule Rbest to DiscoveredRuleList;

21

Remove the training samples correctly classified by the pruned best rule Rbest;

22 END WHILE 23 Add a default rule in the DiscoveredRuleList:

Figure 5-2 Improved AntMiner-C algorithm

5.4 Parameter Optimization This section describes some parameter optimization experiments for AntMiner-C algorithm and their results. AntMiner-C has following user defined parameters. •

Number of ants



Number of rules converged (for early termination of REPEAT-UNTIL loop).



The powers of pheromone and heuristic values (α, β) in the probabilistic selection



The pheromone evaporation rate

57

58

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

The number of ants is not a sensitive parameter due to the early convergence option of the REPEAT-UNTIL loop. Any high number (e.g. 1000 used by us) would serve adequately. The number of rules converged is an indication of pheromone saturation and does not seem to be a sensitive parameter as long as it is not a very small number. The relationship between α, β and ρ is complex and needs to be analyzed experimentally.

5.4.1 Relative Importance of Pheromones and Heuristics In our previous approach, we used the alpha = beta = 1. In this section, we experiment with different values of alpha and beta. The value of alpha is varied from 0 to 3 and for each of these values, the value of beta is set at 1. Then the value of alpha is set at 1 and beta is varied from 1 to 4. This gives us the following ratios for alpha/beta: 3, 2, 1, 0, 0.5, 0.33, and 0.25. The results are shown in Tables 5.11, 5.12 and 5.13. Table 5.11 Predictive accuracy results obtained for different values of alpha and beta. The default rule on the basis of uncovered training samples of same class labels Datasets

α = 2, β = 1

α = 1, β = 1

BC-W

96.85± 2.85

95.43 ± 2.67

97.42 ± 2.76

96.28± 1.38

96.56 ± 3.05

97.85± 1.69

96.42 ± 1.68

Wine

98.89 ± 2.34

98.33 ± 2.68

98.82 ± 2.48

98.33 ± 2.68

98.89 ± 3.51

99.44 ± 1.76

97.22 ± 2.93

Credit(Aus)

86.23 ± 3.36

82.75 ± 3.95

86.81 ± 3.51

87.39 ± 4.88

85.22 ± 4.03

87.54 ± 3.21

87.10 ± 5.86

Credit(Ger)

71.26± 6.14

71.96 ± 6.58

72.5 ± 2.51

70.20 ± 6.38

71.56 ± 4.93

72.46 ± 5.13

72.0 ± 2.92

Car

97.51± 1.68

97.74 ± 1.14

98.20 ± 0.82

97.74± 0.92

97.57 ± 1.30

98.03 ± 1.17

97.69 ± 1.57

Tic-tac-toe

98.0± 1.51

98.85 ± 2.23

100 ± 0.0

100 ± 0.0

100 ± 0.0

100 ± 0.0

100 ± 0.0

Iris

α = 3, β = 1

α = 0, β = 1

α = 1, β = 2

α = 1, β = 3

α = 1, β = 4

96.67 ± 4.71

96.0 ± 4.66

98.0 ± 3.22

94.67 ± 5.26

98.0 ± 3.22

98.0 ± 4.50

96.67 ± 4.71

Balance-scale

90.54 ± 4.22

88.18 ± 4.34

90.23 ± 1.96

88.0 ± 3.35

88.17 ± 3.05

87.49 ± 6.34

89.28 ± 3.37

TAE

83.38 ± 9.09

70.80 ± 12.36

76.79 ± 10.59

73.50± 11.77

75.54 ± 8.12

81.38 ± 11.72

73.54 ± 10.30

Glass

78.51 ± 6.24

74.76 ± 9.42

76.19 ± 8.98

73.29 ± 10.07

71.93 ± 9.32

82.27 ± 6.67

74.16 ± 8.78

Heart

79.63 ± 11.35

81.85 ± 8.81

78.89 ± 6.07

81.85± 11.64

81.48 ± 8.0

80.74 ± 9.37

82.59 ± 6.77

Hepatitis

89.75 ± 8.89

86.50 ± 9.54

87.75 ± 7.73

88.42 ± 7.75

89.17± 11.0

87.17 ± 7.81

89.67 ± 7.47

Chapter 5: Investigation of Components and Parameter Optimization Table 5.12 Average number of rules obtained for different values of alpha and beta. The default rule on the basis of uncovered training samples of same class labels Datasets

α = 3, β = 1

α = 2, β = 1

α = 1, β = 1

α = 0, β = 1

α = 1, β = 2

α = 1, β = 3

α = 1, β = 4

BC-W

23.40

21.90

18.60

20.70

22.10

20.40

23.10

Wine

6.50

5.70

4.10

3.70

4.90

6

5.50

Credit(Aus)

6

3.40

7.80

6.80

6.90

5.50

9.90

Credit(Ger)

12.46

12.20

10.0

19

6.50

13.50

6.40

Car

56.90

54.70

57.60

59.90

53.80

58

61.50

Tic-tac-toe

18.90

16.30

14.80

15.30

14.10

12.30

14

Iris

13.70

15.0

10.90

14.90

15.50

12.80

13.70

Balance-scale

107.40

111.80

98.30

107

107

108.70

119.50

TAE

49.20

51.30

44.50

53.30

48.50

48.70

55.10

Glass

44.40

49.40

41.60

44.90

47.70

42.20

47.50

Heart

13.40

10.90

9.20

10.90

9.10

13.10

12.50

Hepatitis

12.90

10.90

7.30

9.70

9.80

11.30

9.40

Table 5.13 Average number of terms per rule obtained for different values of alpha and beta. The default rule on the basis of uncovered training samples of same class labels Datasets

α = 3, β = 1

α = 2, β = 1

α = 1, β = 1

α = 0, β = 1

α = 1, β = 2

α = 1, β = 3

α = 1, β = 4

BC-W

1.43

1.40

1.45

1.35

1.43

1.42

1.37

Wine

1.28

1.54

1.65

1.64

1.28

1.25

1.35

Credit(Aus)

1.57

1.52

1.58

1.43

1.57

1.57

1.45

Credit(Ger)

1.90

1.70

1.71

1.80

1.90

1.88

1.27

Car

2.48

2.42

2.49

2.47

2.48

2.50

2.57

Tic-tac-toe

2.63

2.64

2.50

2.69

2.63

2.74

2.74

Iris

1.07

1.10

1.05

1.13

1.07

1.05

1.12

2.43

2.38

2.43

2.40

1.48

1.56

1.44

1.47

Balance-scale

2.38

2.38

2.51

TAE

1.56

1.52

1.48

Glass

2.01

1.95

2.0

2.06

2.01

1.93

1.88

Heart

1.86

1.48

1.83

1.70

1.86

1.79

1.63

Hepatitis

2.0

2.17

2.55

2.18

2.0

2.41

2.41

From Table 5.11 we can see that we have better result if we set alpha = 1 and beta = 3. These are the more appropriate values of these parameters. Therefore in further experiments we use these values of alpha and beta parameters.

59

60

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

5.4.2 Evaporation Rate In this sub-section, we experiment with different values of evaporation rate. The pheromones update formula is given in Equation (5.3). The results are shown in Tables 5.14 and 5.15.

τ ij (t + 1) = (1 − ρ )τ ij (t ) + (1 −

1 )τ ij (t ) 1+ Q

(5.3)

Table 5.14 Predictive accuracy results obtained by using different values of evaporation rate with alpha = 1 and beta = 3 Accuracy Datasets

ρ = 0.0

ρ = 0.05

ρ = 0.10

ρ = 0.15

ρ = 0.20

BC-W

97.71 ± 2.27

97.71 ± 2.26

96.85 ± 2.76

97.85± 1.69

97.42 ± 2.35

Wine

97.78 ± 3.88

96.60 ± 4.81

97.22 ± 4.72

99.44 ± 1.76

98.33 ± 3.75

Credit(Aus)

85.36 ± 4.12

87.25 ± 4.67

85.80 ± 3.47

87.54 ± 3.21

86.23 ± 5.69

Credit(Ger)

72.56 ± 5.42

71.36 ± 4.22

70.06 ± 4.0

72.46 ± 5.13

70.35 ± 4.28

Car

97.05 ± 2.53

97.51 ± 1.06

97.45 ± 1.59

98.03 ± 1.17

97.51 ± 1.52

Tic-tac-toe

99.79 ± 0.66

99.79 ± 0.66

100 ± 0.0

100 ± 0.0

100 ± 0.0

Iris

96.67 ± 4.71

96.0 ± 4.66

98.0 ± 3.22

98.0 ± 4.50

96.67 ± 3.51

Balance-scale

87.53 ± 4.90

87.83 ± 4.92

87.19 ± 3.57

87.49 ± 6.34

87.84 ± 2.83

TAE

75.50 ± 6.27

70.83 ± 10.10

74.17 ± 5.84

81.38 ± 11.72

76.69 ± 8.53

Glass

85.08 ± 5.61

83.25 ± 8.60

81.77 ± 6.41

82.27 ± 6.67

80.43 ± 6.92

Heart

82.96 ± 8.94

85.56 ± 5.08

82.96 ± 5.30

80.74 ± 9.37

83.33 ± 5.59

Hepatitis

88.46 ± 12.84

88.54 ± 10.49

85.88 ± 10.10

87.17 ± 7.81

87.25 ± 10.16

Table 5.15 Average number of rules and average terms per rule obtained by using different values of evaporation rate with alpha = 1 and beta = 3 Avg. No. of Rules Datasets

ρ = 0.0

ρ = 0.05

ρ = 0.10

ρ = 0.15

Avg. Terms/Rule ρ = 0.20

ρ = 0.0

ρ = 0.05

ρ = 0.10

ρ = 0.15

ρ = 0.20

BC-W

20.60

20.70

21.40

20.40

20.0

1.41

1.35

1.48

1.42

1.39

Wine

5.70

5.50

5.90

6

5.80

1.44

1.54

1.50

1.25

1.40

Credit(Aus)

4.10

6.90

7.90

5.50

6.30

1.31

1.47

1.49

1.57

1.61

Credit(Ger)

13.0

9.40

5.80

13.50

7.40

2.22

1.78

1.59

1.88

1.80

Car

57.0

61.90

61.80

58

60.40

2.50

2.49

2.55

2.50

2.47

Tic-tac-toe

14.60

13.60

13.0

12.30

14.10

2.72

2.71

2.68

2.74

2.73

Iris

13.0

13.10

13.80

12.80

13.60

1.11

1.08

1.09

1.05

1.11

Balance-scale

99.20

94.30

89.20

108.70

92.20

2.36

2.34

2.31

2.43

2.31

TAE

49.50

50.0

49.50

48.70

48.40

1.47

1.44

1.42

1.44

1.44

Glass

49.90

51.10

47.60

42.20

47.90

1.88

1.87

1.89

1.93

1.85

Eart

9.50

9.40

9.40

13.10

9.80

1.48

1.45

1.53

1.79

1.61

Hepatitis

10.0

10.60

11.30

11.30

10.10

2.61

2.36

2.56

2.41

2.68

Chapter 5: Investigation of Components and Parameter Optimization

From the experiment results, we can see that if we increase the evaporation rate, it will encourage the exploration and convergence will be slow (requires more computational time). If we decrease the evaporation rate, it will favor the exploitation and convergence will be fast (may lead to sub-optimal results). There is a need to maintain balance between exploration and exploitation. We achieve better accuracy rate on most of datasets if we use evaporation rate 0.15.

5.5 Results and Comparisons In this section we compare the results of our algorithm with those of previously proposed ACO based classification rule discovery algorithms: AntMiner, AntMiner2, and AntMiner3. We also compare them with the results obtained for C4.5 which is a commonly used decision tree builder. Our performance metrics for comparison are predictive accuracy, number of rules, and number of terms per rule. The experiments are performed using a ten-fold cross validation procedure. For our comparison experiments we use twenty six datasets of UCI repository [77]. The datasets sorted on the basis of their main characteristics are shown in Table 5.16.

Table 5.16 Datasets used in the experiment. The datasets are sorted on the basis of attributes, samples and classes Number of Attributes Dataset

Number of Samples

Attributes Dataset

Number of Classes

Samples Dataset

Classes

Haberman

3

Hayes Roth

132

Haberman

2

Iris

4

Iris

150

Transfusion

2

Balance-scale

4

TAE

151

Mammographic - Mass

2

Transfusion

4

Hepatitis

155

Pima Indian Diabetes

2

Mammographic - Mass

5

Wine

178

WBC

2

Car

6

Image Segmentation

210

Tic-tac-toe

2

TAE

6

Glass

214

Heart

2

Hayes Roth

6

SPECT (Heart)

267

Credit (Australia)

2

Ecolli

7

Heart

270

Congress House Votes

2

Pima Indian Diabetes

8

Zoo

282

Credit (Germany)

2

WBC

9

Vehicle

282

Hepatitis

2

Tic-tac-toe

9

Haberman

307

SPECT (Heart)

2

Glass

9

Ecolli

336

WDBC

2

61

62

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

Wine

13

Ionosphere

351

Ionosphere

2

Heart

13

Dermatology

366

Iris

3

Credit (Australia)

15

Congress House Votes

435

Balance-scale

3

Zoo

16

WDBC

569

TAE

3

Congress House Votes

17

Balance-scale

625

Hayes Roth

3

Vehicle

18

WBC

683

Wine

3

Credit (Germany)

19

Credit (Australia)

690

Car

4

Hepatitis

19

Transfusion

748

Vehicle

4

Image Segmentation

19

Pima Indian Diabetes

768

Dermatology

6

SPECT (Heart)

22

Tic-tac-toe

958

Glass

7

WDBC

31

Mammographic - Mass

961

Zoo

7

Dermatology

33

Credit (Germany)

1000

Image Segmentation

7

Ionosphere

34

Car

1728

Ecolli

8

The values of user defined parameters are given in Table 5.17. The same parameters have been retained while obtaining results for previous AntMiners. These values have been chosen as result of some experiments described in the previous section, and because they seem reasonable and also because have been used by other AntMiner versions reported in literature [65-70]. Table 5.17 Parameters used in experiments Parameter

Value

Number of Ants

1000

Max. uncovered cases (used in other Ant Miners) 10 Min. cases per rule (used in other Ant Miners)

10

Evaporation rate

0.15

No. of rules converged

10

Alpha

1

Beta

3

The predictive accuracies, average number of rules per discovered rule set, and average number of terms per rule are shown in Table 5.18 and 5.19.

Chapter 5: Investigation of Components and Parameter Optimization

Table 5.18 Average predictive accuracies obtained using 10-fold cross validation Datasets BC-W

Ant Miner – C

Ant-Miner

Ant Miner 2

Ant Miner 3

C4.5

97.85 ± 1.69

94.64 ± 2.74

92.70 ± 2.82

93.56 ± 3.45

94.84 ± 2.62

Wine

99.44 ± 1.76

90.0 ± 9.22

90.49 ± 10.13

94.44 ± 5.24

96.60 ± 3.93

Credit(Aus)

87.54 ± 3.21

86.09 ± 4.69

84.20 ± 4.55

86.67 ± 5.46

81.99 ± 7.78

Credit(Ger)

72.46 ± 5.13

71.62 ± 2.71

73.16 ± 5.21

72.07 ± 4.32

70.73 ± 6.71

Car

98.03 ± 1.17

82.38 ± 2.42

81.89 ± 2.63

78.82 ± 3.76

96.0 ± 2.13

Tic-tac-toe

100 ± 0.0

74.95 ± 4.26

72.54 ± 5.98

72.02 ± 4.50

94.03 ± 2.44

Iris

98.0 ± 4.50

95.33 ± 4.50

94.67 ± 6.89

96.0 ± 4.66

94.0 ± 6.63

Balance-scale

87.49 ± 6.34

75.32 ± 8.86

72.78 ± 9.23

75.06 ± 6.91

83.02 ± 3.24

TAE

81.38 ± 11.72

50.67 ± 6.11

53.58 ± 7.33

53.04 ± 10.67

51.33± 9.45

Glass

82.27 ± 6.67

53.33 ± 4.38

56.15 ± 10.32

46.36 ± 10.96

68.90 ± 8.98

Heart

80.74 ± 9.37

80.74 ± 4.94

78.15 ± 10.25

87.78 ± 6.77

78.43 ± 6.26

Hepatitis

87.17 ± 7.81

80.67 ± 8.67

81.46 ± 11.89

80.17 ± 10.23

68.25 ± 11.63

Zoo

96.0 ± 5.16

81.36 ± 11.30

79.36 ± 15.51

75.27 ± 11.69

94.0 ± 9.17

Haberman

83.05 ± 7.80

71.99 ± 7.57

73.94 ± 5.33

74.72 ± 4.62

73.88 ± 4.66

Ecolli

83.64 ± 6.11

47.52 ± 11.32

43.09 ± 9.76

44.61 ± 10.32

82.99 ± 7.72

Vehicles

85.91 ± 5.62

56.79 ± 9.56

53.55 ± 11.07

56.0 ± 10.24

65.86 ± 6.76

Mammographic - Mass

78.67 ± 4.65

78.25 ± 3.48

79.40 ± 3.64

82.33 ± 3.56

82.44 ± 4.56

Pima Indian Diabetes

80.26 ± 6.19

74.63 ± 6.65

72.56 ± 4.46

70.88 ± 5.05

72.11 ± 6.96

Dermatology

94.27 ± 4.51

58.72 ± 7.36

59.18 ± 14.91

67.47 ± 9.57

95.07 ± 2.80

Ionosphere

89.71 ± 7.56

68.0 ± 11.09

66.0 ± 7.31

64.57 ± 8.85

89.98 ± 5.25

WDBC

87.33 ± 5.46

85.26± 4.22

86.66 ± 4.56

88.05 ± 5.40

93.31 ± 2.72

Image Segmentation

98.10 ± 2.61

70.48 ± 10.95

72.33 ± 9.34

72.98 ± 9.54

88.82 ± 8.04

SPECT (Heart)

87.41 ± 4.97

75.38 ± 5.50

83.56 ± 8.34

78.68 ± 5.17

80.03 ± 51.85

Transfusion

79.57 ± 3.04

77.30 ± 6.37

79.44 ± 3.64

77.56± 3.31

77.71 ± 12.32

Hayes Roth

91.65 ± 8.37

75.05 ± 10.62

70.88 ± 15.47

85.49 ± 7.76

80.93 ± 8.24

Congress House Votes

95.86 ± 3.71

94.54± 2.27

95.76 ± 2.43

94.72 ± 3.42

95.31 ± 2.57

Table 5.19 Average number of rules per discovered rule set and average number of terms per rule and the results are obtained using 10-fold cross validation Datasets BC-W Wine

Number of Rules/Rule Set

Number of Terms/Rule

AMC

AM

AM2

AM3

C4.5

AMC

AM

AM2

AM3

C4.5

20.40

11.0

11.40

11.10

10.50

1.42

1.02

1.03

1.03

2.32

6

5.50

5.30

4.20

5.30

1.25

1.04

1.53

1.53

1.41

Credit(Aus)

5.50

3.90

3.70

3.0

74.80

1.57

1.0

1.0

1.0

3.22

Credit(Ger)

13.50

8.50

8.10

6.40

73.60

1.88

1.13

1.30

1.30

3.21

58

11.40

11.80

13.40

80.26

2.50

1.03

1.18

1.18

2.59

Tic-tac-toe

12.30

6.60

6.80

7.10

38.60

2.74

1.09

1.20

1.20

2.64

Iris

12.80

9.20

10.0

8.10

5.50

1.05

1.0

1.0

1.0

1.22

Balance-scale

108.7

17.70

16.40

17.50

40.10

2.43

1.0

1.0

1.0

2.85

Car

63

64

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

TAE

48.70

20.90

9.30

12.30

18.30

1.44

1.0

1.05

1.05

2.69

Glass

42.20

15.50

16.80

11.30

15.40

1.93

1.01

1.0

1.0

2.83

Heart

13.10

5.60

4.50

5.80

12.60

1.79

1.08

1.20

1.20

1.73

Hepatitis

11.30

3.90

3.50

3.0

11.60

2.41

1.11

1.0

1.0

1.70

Zoo

7.0

5.10

5.0

5.60

7.60

1.59

1.11

1.24

1.20

1.60

Haberman

58.20

20.70

17.80

15.5

3.40

1.56

1.0

1.0

1.0

1.58

Ecolli

57.60

8.60

8.10

6.90

14.0

1.75

1.01

1.0

1.0

2.84

Vehicle

43.0

14.20

13.30

13.0

20.10

1.82

1.02

1.03

1.05

3.13

Mammographic – Mass

20.30

15.90

6.40

6.80

8.90

1.82

1.0

1.0

1.0

2.47

Pima Indian Diabetes

45.60

15.30

11.30

11.50

7.80

1.93

1.0

1.0

1.0

2.18

Dermatology

20.0

10.40

9.10

8.60

9.30

2.69

1.07

1.13

1.15

2.23

Ionosphere

6.67

4.20

2.50

2.0

9.30

1.15

1.0

1.06

1.0

2.54

WDBC

15.40

8.40

7.80

7.10

7.20

1.67

1.0

1.0

1.0

1.75

Image Segmentation

26.60

16.20

17.10

16.80

10.60

1.41

1.03

1.04

1.03

1.99

SPECT (Heart)

25.40

5.60

6.20

5.40

12.50

7.17

1.28

1.98

2.33

3.02

Transfusion

24.80

10.10

9.20

8.90

4.20

1.45

1.0

1.05

1.0

1.35

Hayes Roth

19.80

8.0

7.70

7.70

11.70

1.76

1.02

1.01

1.01

2.56

Congress House Votes

7.30

3.0

3.0

3.30

7.40

1.95

1.0

1.17

1.07

1.84

The results indicate that AntMiner-C achieves higher accuracy rate than the compared algorithms on twenty datasets out of twenty six datasets. However, the number of rules and the number of terms per rule generated by our technique are all higher than the numbers obtained for other versions of AntMiners. The reason is that we allow the generation of rules with low coverage. However, the numbers of rules are higher than C4.5 on some datasets and lower on some other datasets.

Chapter 5: Investigation of Components and Parameter Optimization

92 88.61 88 82.71

A ccuracy

84 80 76

75.04

74.90

Ant-Miner

Ant-Miner2

75.74

72 68 64 Ant-Miner-C

Ant-Miner3

C4.5

Classification Techniques Figure 5-3 Average accuracies over all datasets for different classification techniques

Figure 5-3 shows the average of accuracies achieved by different techniques over all the datasets. This average is calculated by adding averages of ten-folds of each technique and then dividing by the number of datasets. From the figure we can see that the proposed AntMiner-C achieves higher accuracy rate than the compared algorithms. Figure 5-4 shows the sample rule list generated by AntMiner–C of tic-tac-toe dataset. These rules are applied in order, the rules are applied in the order they were discovered the first rule is applied if it covers the test sample then the class label of that rule is assigned to that test sample. If first rule does not fire for that test sample then second is applied. If second does not fire then third and so on. If none of the rule is fired then test sample is assigned default class.

65

66

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

if (top-left-square = x and top-middle-square = x and top-right-square = x) then class = positive else if (middle-left-square = x middle-middle-square = x middle-right-square = x) then class = positive else if (bottom-left-square = x and bottom-middle-square = x and bottom-right-square = x) then class = positive else if (top-left-square = x and middle-left-square = x and bottom-left-square = x) then class = positive else if (top-right-square = x and middle-middle-square = x and bottom-left-square = x) then class = positive else if (top-left-square = x and middle-middle-square = x and bottom-right-square = x) then class = positive else if (top-right-square = o and middle-middle-square = o) then class = negative else if (top-right-square = x and middle-right-square = x and bottom-right-square = x) then class = positive else if (middle-middle-square = o) then class = negative else if (middle-middle-square = b) then class = negative else if (top-middle-square = x and bottom-middle-square = x) then class = positive else if (middle-middle-square = x) then class = negative else class = positive

Figure 5-4 Sample rule list of tic-tac-toe dataset discovered by improved AntMiner-C

5.6 Time Complexity of AntMiner-C The time complexity of AntMiner-C is calculated in this section. The major steps involved for calculating the computational complexity are given below.

Chapter 5: Investigation of Components and Parameter Optimization

5.6.1 Initialization of Main WHILE Loop Each iteration of AntMiner-C starts by first selecting the class label for which the ants will construct the rules, if we have c number of classes this step takes O(c) time. Then initializing the heuristic values ηij of all terms, this step takes O(v2), where v is the number of possible terms in the training set. This is so because we compute the heuristic values of all terms from every term. Next pheromone values τij of all terms are initialized. We use a pheromone matrix for storing the pheromone values of each link of the graph. This step takes O(v2) time, where v is the number of possible terms in the training dataset.

5.6.2 Single Iteration of REPEAT UNTIL Loop We calculate the computational complexity of a single iteration of this loop by considering all major steps within this REPEAT UNTIL loop. These are given below. 5.6.2.1 Rule Construction

In order to construct a rule an ant can add maximum k number of terms in rule antecedent part, in worst case k is the total number of attributes in the training set except class attribute. Hence rule construction process takes O(n.k) time, where n is the size of the training set because whenever we add a term we scan the dataset. 5.6.2.2 Measuring Rule Quality

For measuring quality of a rule, a rule with maximum k conditions and training set size of n, takes O(n.k) time. 5.6.2.3 Pheromone Updating

The pheromone values of all those terms which occur in the current rule constructed by an ant are updated by first evaporating the previous pheromone values and then adding a percentage of the pheromone dependant on the quality of discovered rule. Next normalizing pheromone values are done by dividing a value with summation of pheromone values of all its competing terms. This step takes O(k.v) time, because a rule can have maximum k number of conditions and each condition will take O(v) time because it requires normalizing of all its competing terms. Note that v is the number of possible terms in the training set.

67

68

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

5.6.2.4 Rule Pruning

The rule pruning process is performed only for the best rule after construction of rules by the ants. The first pruning iteration requires evaluation of k conditions, temporarily removing the first term and determines the quality of the resulting rule and then replaces the term back and temporarily removes the second term and again calculating the quality of the resulting rule. This process continues until all the terms present in the rule are checked. Each rule evaluation takes O(n.(k – 1)) with k number of evaluation . Hence the first pruning iteration takes O(n.k2) time. The entire rule pruning process is repeated for k times, hence the complete rule pruning process takes O(n. k3) time. 5.6.2.5 Removing Training Samples

After REPEAT UNTIL loop terminates and best rule is selected and pruned then the training samples correctly classified by that rule are removed from the training set. Removing the training samples that are correctly classified by a discovered rule requires O(n.k) time, for a rule with maximum k conditions and training set size of n.

5.6.3 Complexity of a Single Iteration of WHILE Loop By adding, the computational complexities of a single iteration of REPEAT UNTIL loop relative to a single ant: O(n.k) + O(n.k) + O(k.v) For calculating the computational complexity of all ants the above complexity is multiplied by t, where t is the number of ants and it becomes: O(t.(n. k + n.k + k.v)) The computational complexities of rule pruning and removing training samples are calculated only once, therefore the complete computational complexity of single iteration of main WHILE loop is: O(t.(n. k + n.k + k.v)) + O(n.k3) + O(n.k) The final computational complexity of a single iteration of main WHILE loop is: O(t.(n.k) +(n.k3))

Chapter 5: Investigation of Components and Parameter Optimization

5.6.4 Computational Complexity of Entire Algorithm For calculating the computational complexity of the entire algorithm, we multiply the computational complexity of single run with r, where r is the number of rules, and adding the computational complexity of initializing steps. O(r.t.(n.k) +r.(n.k3)) + r.(v2) Hence the worst case computational complexity of AntMiner-C is: Computational complexity (AntMiner-C) = O(r. n.k3)

Where r is the number of rules, n is the size of the training set and k is the number of attributes in the dataset excluding class attribute.

5.7 Summary Chapter 4 introduced our AntMiner-C and highlighted its main features. The present chapter investigates different aspects of the algorithm. In this chapter we investigate the following: a different method of forming a default rule, whether to prune the rule set or not, the relative importance of pheromone and heuristic in the selection of terms, etc. for investigating all theses aspects a number of experimentation has been performed on benchmark datasets. The result is a modified algorithm whose performance is then evaluated on a suite of twenty six datasets and compared with three previously existing AntMiners and a decision tree builder. The experimental results show that our proposed approach achieves higher accuracy rate then compared algorithms. We also calculate the computation complexity of the AntMiner–C . In the Chapter 6, we shall discuss AntMiner-CC which is an improved version of AntMiner-C and a number of experimentation on a suite of twenty six datasets.

69

70

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

6 Chapter 6: Further Improvements and Investigations In previous two chapters i.e. Chapter 4 and Chapter 5, we have proposed and investigated AntMiner-C. This chapter presents an extension of AntMiner-C called AntMiner-CC, which uses a slightly modified heuristic function and does not build rules for the majority class (it is now made the default class of the classifier). The training process is stopped, when training set contains only the samples of majority class. The details of these new modifications are discussed in this chapter. We also study the performance of our proposed approach for twenty six commonly used datasets and compare it with ten well known classification algorithms, three of which are ACO based. Experimental results show that the accuracy rate obtained by AntMiner-CC is better than that of the compared algorithms.

6.1 Modified Algorithm: AntMiner-CC This chapter presents an algorithm called AntMiner-CC, which is an extension of the previously proposed algorithm called AntMiner-C. AntMiner-CC algorithm builds rules for all the classes, except one, present in the dataset. The class whose rules are not learnt is considered as the default class of a final rule. AntMiner-CC has following differences then AntMiner-C: •

A different heuristic function



Does not search for rules for majority class label



A default rule with majority class label

A general description of the AntMiner–CC algorithm is shown in Figure 6-1. Its basic structure is the same as that of AntMiner-C algorithm.

Chapter 6: Further Improvements and Investigations

1

Size(TrainingSet) = {all training samples};

2

DiscoveredRuleList = {};

3

WHILE (TrainingSet has samples of class labels other than the majority class label)

/* rule list is initialized with an empty list */

4

t = 1;

/* counter for ants */

5

j = 1;

/* counter for rule convergence test */

6

Select class label from the set of class labels excluding majority class label;

7

Set up the search space and initialize all links with the same amount of pheromone;

8

Calculate heuristic values for all links;

9

REPEAT

10

Send an Antt which constructs a classification rule Rt for the selected class;

11

Assess the quality of the rule and update the pheromone of all trails;

12

IF (Rt is equal to Rt-1) /* update convergence test */

13

THEN j = j + 1;

14

ELSE j = 1;

15

END IF

16

t = t + 1;

17

UNTIL (t ≥ No_of_ants) OR (j ≥ No_rules_converg)

18

Choose the best rule Rbest among all rules Rt constructed by all the ants;

19

Prune the best rule Rbest;

20

Add the pruned best rule Rbest to DiscoveredRuleList;

21

Remove the training samples correctly classified by the pruned best rule Rbest;

22 END WHILE 23 Add a default rule assigning majority class label in the DiscoveredRuleList

Figure 6-1 The AntMiner–CC algorithm

6.2 Differences with previous versions •

In this algorithm, shown in Figure 6-1, we do not discover rules of majority class and it is made the default class of the classifier. This is shown in line 6 in Figure 6-1, in which when we select a class for building rules but exclude the majority class from this selection.



The training is stopped when all remaining training samples belong to majority class. This is shown in line 3 in Figure 6-1. In previous versions (Figure 5-2), the algorithm stops when remaining training samples belong to same class (line 3). In

71

72

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

the first version (Figure 4-2) the algorithm stops when number of remaining training samples is less than maximum uncovered cases (line 3). •

A default rule assigns the sample to the majority class of the training set. In previous version (Figure 5-2) the default rule has the class label of those last few remaining samples which are of homogenous class labels. In first version (Figure 4-2) the default class is the majority class of remaining uncovered training samples.

6.3 Heuristic Function of AntMiner-CC The heuristic value of a term gives an indication of its usefulness and thus provides a basis to guide the search. In order to guide the selection of next term we use a heuristic function based on the confidence and coverage of the most recently chosen term with other candidate terms. The heuristic function is:

η k ,i , j =

termi , term j , classk

2

termi , term j ∗ termi , classk

(6.1)

The most recently chosen term is termi and the term being considered for selection is termj. The number of uncovered training samples having termi and termj and which belong to the committed class label k of the rule is given by |termi, termj, classk|. This number is squared and divided by the number of uncovered training samples which have termi and termj and also by the number of uncovered training samples which have termi and which belong to classk.

Chapter 6: Further Improvements and Investigations

Figure 6-2 An example for understanding the working of heuristic function

Figure 6-2 provides an example for understanding the working of the proposed heuristic function. Suppose the specified class label is classk. An ant has recently added termi in its rule. It is now looking to add another term from the set of available terms. For two competing terms (termj1 and termj2), we want to include that term which has better potential for inter class discrimination and we also want to encourage the inclusion of a term which supports better correct coverage of the rule. These two considerations are incorporated in the heuristic function. Equation (6.3) is derived as follows. The actual correlation between the committed classk and termi and termj is: Corrk ,i , j =

P (termi , term j , class k ) P (class k ) * P (termi , term j )

(6.2)

Correlation cannot be directly used for comparison purposes because the only information it gives is about negative, positive or no correlation. It does not specify the degree of correlation. Secondary reason is that it is not bounded between 0 and 1. Hence we simplify it in the following manner.

73

74

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

The value of P(classk) is same for all competing terms, e.g. termj=1 and termj=2, because it is has been previously chosen and is fixed. Hence we may safely ignore it for comparison purpose. Thus, we have:

η k' ,i , j =

P (termi , term j , class k )

(6.3)

P(termi , term j )

From the definition of probability we can rewrite Equation (6) as:

η k' ,i , j =

termi , term j , classk N termi , term j N

=

termi , term j , classk

(6.4)

termi , term j

This Equation does not take into account coverage. To encourage coverage we have a second portion of the HF.

η k'' ,i , j =

termi , term j , classk

(6.5)

termi , classk

Hence the overall HF is

η k ,i , j =

termi , term j , classk termi , term j



termi , term j , classk termi , classk

=

termi , term j , classk

2

termi , term j ∗ termi , classk

(6.6)

The first portion gives a higher value to that term which is strongly correlated with the already committed classk and termi. This means the discovery of a rule with higher confidence (termi and termj => classk), if only these two terms are considered. It also encourages early termination of rule construction (described later). The second portion gives a higher value to that term which has larger coverage given classk and termi. An example is shown in Figure 6-2. Suppose the specified class label is classk. An ant has recently added termi in its rule. It is now looking to add another term from the set of available terms. We take the case of two competing terms (termj1 and termj2) which can be generalized to more terms. We want to encourage the inclusion of that term which has better potential for inter class discrimination. This is encouraged by the first part of the heuristic function. The heuristic value for termj1 given termi and classk due to the first portion is termi , term j1 , class k

and for termj2 given termi and classk is

termi , term j1

(6.7)

Chapter 6: Further Improvements and Investigations

term i , term j 2 , class k

term i , term j 2

(6.8)

We also want the chosen term to maximize correct coverage of the rule as much as possible. This is made possible by the second portion of the heuristic function. The two competing terms will have the values term i , term j1 , class k termi , class k

(6.9)

and termi , term j 2 , class k termi , class k

(6.10)

6.3.1 Heuristic Function for the 1st Term The heuristic value when considering the first term of the rule antecedent is calculated on the basis of the following Equation:

η k ,start , j =

term j , classk term j



term j , classk classk

=

term j , classk

2

term j ∗ classk

(6.11)

6.3.1.1 Advantages of the heuristic function

The heuristic values between terms are calculated only once in an iteration of the WHILE loop, before the execution of the REPEAT-UNTIL loop, and thus remains same for a batch of ants. An example heuristic look-up table is shown in Figure 6-3. It is obtained by calculating the heuristic values between terms according to Equations (6.1) and (6.11) and then normalizing each element by dividing it by the summation of heuristic values of its row. The heuristic look-up table is asymmetric and can be used to obtain the heuristic values on links originating from a term to other terms in any layer of the search space.

75

76

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

Figure 6-3 The asymmetric heuristic look-up table for the example problem of Figure 6-1

The first row elements are the heuristic values on links from Start node to nodes of the first layer of terms. All other rows are heuristic values for links from a term in the previous layer to terms in the next layer. All the elements are normalized by dividing them with the summation of the row. Our heuristic function has the potential to be effective in large dimensional search spaces. It assigns a zero value to the combination of those terms which do not occur together for a given class label, thus efficiently restricting the search space for the ants. Equations (6.6) and (6.7) have not been used before in any ACO based classification rule discovery algorithm. AntMiner utilizes a heuristic function based on the entropy of the terms and their normalized information gain log 2 k − H (W | term j )

∑ (log

2

k − H (W | term j ))

(6.12)

competing _ terms

Where the entropy H of class attribute W is defined as k

H (W | term j ) = −∑ ( P ( w | term j ). log 2 P ( w | term j ))

(6.13)

w=1

The symbol k is for number of classes. The denominator in Equation (6.14) has to be calculated several times during each ant run. AntMiner2 and AntMiner3 use term j , majority _ class(term j ) term j and AntMiner+ uses

(6.14)

Chapter 6: Further Improvements and Investigations

term j , class _ chosen _ by _ ant term j

(6.15)

Equation (6.12) is calculated only once for each term for the discovery of a rule in AntMiner2 and AntMiner3. In AntMiner+ Equation (6.13) is used which is class dependent and has to be calculated as many times as the number of classes (minus the default class) for each term and each discovered rule.

6.4 Default Rule After an execution of REPEAT UNTIL loop, the best rule is placed in the discovered rule set after pruning and the training samples correctly covered by the rule are removed and have no role in the discovery of other rules. The algorithm checks whether the uncovered training samples all have the majority class label and there are no remaining samples of other classes. If that is not the case, a new iteration of the WHILE loop starts for discovery of the next rule. If not, a final default rule with no antecedent condition and the majority class label as the consequent is added to the rule set. The rule set may then be used for classifying unseen samples. A final default rule is added at the bottom of the rule set. The rule is without any conditions and has a consequent part only. The assigned class label for this rule is the majority class label of the samples of the complete training set. The default rule is used by all previous AntMiners. AntMiner employs early stopping of the algorithm on the basis of 10 or less uncovered training samples. It assigns class label for the default rule as the majority class label of the remaining uncovered samples of the training set. Our method is same as that used by AntMiner+, i.e. assigning of the default rule class label on the basis of majority class of the complete set of training samples.

6.5 Experiments and Analysis In this section we report our experiments and the results obtained.

77

78

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

6.5.1 Datasets, Performance Metrics and Parameter Settings 6.5.1.1 Datasets

In our experiments we first use a suite of twelve datasets obtained from the UCI repository to validate various aspects of the proposed algorithm and then an extended suite of twenty six datasets for its comparison with other algorithms. The main characteristics of the extended suite of datasets are summarized in Table 6.1. The datasets have reasonable variety in terms of number of attributes, samples and classes and are commonly used. The proposed algorithm works with categorical attributes and continuous attributes need to be discretized in a preprocessing step. We use unsupervised discretization filter of Weka-3.4 machine learning tool for discretizing continuous attributes in all our experiments. This filter first computes the intervals of continuous attributes from the training dataset and then uses these intervals to discretize them. Table 6.1 Extended suite of datasets used in the experiment Dataset

Attributes Samples Classes

Balance-scale *

4

625

3

Breast Cancer–Wisconsin (BC-W) *

9

683

2

Car

6

1728

4

Congress House Votes *

17

435

2

Credit (Australia)

15

690

2

Credit (Germany)

19

1000

2

Dermatology *

33

366

6

Ecoli *

7

336

8

Glass *

9

214

7

Haberman *

3

307

2

Hayes Roth *

6

132

3

Heart *

13

270

2

Hepatitis *

19

155

2

Image Segmentation

19

210

7

Ionosphere *

34

351

2

Iris

4

150

3

Mammographic – Mass

5

961

2

Pima Indian Diabetes

8

768

2

SPECT (Heart)

22

267

2

Teacher Assistant Evaluation (TAE)

6

151

3

Chapter 6: Further Improvements and Investigations Tic-tac-toe *

9

958

2

Transfusion

4

748

2

Vehicle

18

282

4

WDBC

31

569

2

Wine

13

178

3

Zoo

16

282

7

Table 6.1 contains the description of all datasets used in experiments. The datasets marked with * are used in the first suite for validating various aspects of AntMiner-CC. 6.5.1.2 Performance Metric

Our performance metric for the discovered rule set is its predictive accuracy. The predictive accuracy is defined as the percentage of testing samples correctly classified by the classifier. The experiments are performed using a ten-fold cross validation procedure. A dataset is divided into ten equally sized, mutually exclusive subsets. Each of the subset is used once for testing while the other nine are used for training. The results of the ten runs are then averaged and this average is reported as final result. AntMiner-CC has been implemented by us in Matlab. 6.5.1.3 Choice of Parameters

AntMiner-CC has following user defined parameters. •

Number of ants,



Number of rules converged (for early termination of REPEAT-UNTIL loop),



The powers of pheromone and heuristic values (α, β) in the probabilistic selection



The pheromone evaporation rate

As discussed in the previous chapter, the relationship between α, β and ρ is complex and needs to be analyzed experimentally. In order to get some indication of the influence of α, β and ρ on the performance of AntMiner-CC, different combinations of these parameters are used and the accuracy results are reported in Table 6.2. The values of other parameters are: Number of ants = 1000, and Number of Rules Converged = 10. The results of Table 6.2 are obtained by running AntMiner-CC for one fold only. The results are highest when ρ = 0.10, α = 1 and β = 3.

79

80

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

The ratios α/β reported in Table 6.2 are 3, 2, 1, 0, 0.5, 0.33 and 0.25. The rows pertaining to α = 0 and β = 1 are important because they give the results when the pheromone values are totally ignored in the selection of terms. These rows have some of the worst results. This shows that pheromone is important. Table 6.2 One fold accuracy results for different combinations of ρ, α and β ρ,α,β

Bal.S

BCW

Cong

Derm

Ecoli

Glass

Haber

0,3,1

92.06

97.14

79.55

0.05,3,1

92.06

100.0

84.09

0.10,3,1

90.48

100.0

0.15,3,1

92.06

0.20,3,1

92.06

0,2,1 0.05,2,1 0.10,2,1

Hayes

89.19

91.18

81.82

61.29

78.57

100.0

94.12

81.82

64.51

85.71

81.82

100.0

97.06

72.73

67.74

85.71

94.29

81.82

100.0

94.12

72.73

67.74

94.29

79.55

97.29

91.18

77.27

61.29

90.48

97.14

81.82

91.12

94.12

72.73

92.06

98.57

79.55

97.29

94.12

72.73

93.65

98.57

81.82

91.89

94.12

Heart

Hepat

Ion

TTT

88.89

93.75

91.42

97.92

92.59

87.50

88.57

98.86

96.29

87.50

91.42

96.86

78.57

96.29

93.75

94.29

96.86

78.57

92.59

87.50

91.42

97.92

64.51

85.71

92.59

87.50

91.42

97.92

61.29

78.57

96.29

87.50

88.57

96.86

77.27

64.51

85.71

88.89

87.50

88.57

96.86

0.15,2,1

93.65

98.57

84.09

91.89

94.12

77.27

64.51

85.71

92.59

87.50

91.42

97.92

0.20,2,1

92.06

95.71

81.82

97.29

91.18

72.73

61.29

78.57

88.89

87.50

88.57

96.86

0,1,1

92.06

95.71

79.55

94.59

94.12

77.27

61.29

85.71

88.89

87.50

91.42

96.86

0.05,1,1

92.06

97.14

84.09

94.59

94.12

77.27

61.29

78.57

96.29

87.50

91.42

96.86

0.10,1,1

92.06

98.57

79.55

100.0

94.12

81.82

64.51

85.71

88.89

93.75

94.29

97.92

0.15,1,1

93.65

98.57

81.82

97.29

91.18

77.27

64.51

85.71

88.90

93.75

88.57

97.92

0.20,1,1

92.06

94.29

81.82

91.89

91.18

72.73

61.29

85.71

88.89

87.50

88.57

96.86

0,0,1

90.48

97.14

81.82

94.59

94.12

72.73

64.51

85.71

96.29

87.50

91.42

96.86

0.05,0,1

90.48

97.14

81.82

94.59

94.12

72.73

64.51

85.71

96.29

87.50

91.42

96.86

0.10,0,1

90.48

97.14

81.82

94.59

94.12

72.73

64.51

85.71

96.29

87.50

91.42

96.86

0.15,0,1

90.48

97.14

81.82

94.59

94.12

72.73

64.51

85.71

96.29

87.50

91.42

96.86

0.20,0,1

90.48

97.14

81.82

94.59

94.12

72.73

64.51

85.71

96.29

87.50

91.42

96.86

0,1,2

90.48

95.71

79.55

91.89

97.06

72.73

67.74

78.57

88.89

87.50

88.57

97.92

0.05,1,2

90.48

100.0

84.09

94.59

94.12

77.27

67.74

85.71

96.29

93.75

94.29

97.92

0.10,1,2

93.65

100.0

84.09

97.29

94.12

77.74

67.74

85.71

100.0

93.75

94.29

97.92

0.15,1,2

93.65

98.57

84.09

100.0

94.12

77.27

67.74

85.71

96.29

93.75

94.29

97.92

0.20,1,2

90.48

97.14

81.82

97.29

91.18

72.73

64.51

85.71

88.89

87.50

91.92

97.92

0,1,3

90.48

97.14

81.82

91.89

91.18

77.27

64.51

78.57

88.89

87.50

91.92

96.86

0.05,1,3

93.65

98.57

84.09

100.0

97.06

77.27

67.74

85.71

96.29

87.50

94.29

98.86

0.10,1,3

93.65

98.57

84.09

100.0

97.06

81.82

67.74

85.71

100.0

93.75

94.29

98.86

0.15,1,3

93.65

98.57

84.09

97.29

94.12

81.82

67.74

85.71

92.59

93.75

91.42

97.92

0.20,1,3

92.06

97.14

79.55

91.89

97.06

72.73

61.29

85.71

88.89

87.50

94.29

96.86

0,1,4

92.06

95.71

81.82

94.59

91.18

72.73

64.51

78.57

88.89

87.50

91.42

96.86

0.05,1,4

92.06

95.71

81.82

91.89

94.12

77.27

64.51

85.71

92.59

87.50

94.29

98.86

0.10,1,4

90.48

97.14

81.82

86.67

97.06

72.73

64.51

85.71

88.89

87.50

91.42

97.92

0.15,1,4

90.48

95.71

84.09

94.59

94.12

72.73

61.29

85.71

88.89

87.50

91.42

97.92

0.20,1,4

90.48

94.29

84.09

91.89

91.18

72.73

61.29

78.57

88.89

87.50

88.57

96.86

Chapter 6: Further Improvements and Investigations

6.5.2 Importance of Heuristic Function In this section we replace our heuristic function with the one used in AntMiner+ (Equation 6.15). We also tried to use the heuristic functions of AntMiner, AntMiner2 and AntMiner3 (Equations 6.12, 6.13 and 6.14). However, we were not able to use them because they do not make use of prior known of class label. If we modify them to incorporate the selected class label, then they become similar to Equation (6.15). All other parameters of the algorithm were kept the same as before. The results are shown in Table 6.3 and indicate that our heuristic function performs better.

Table 6.3 Average predictive accuracies obtained using 10-fold cross validation Datasets

Accuracy

Accuracy

HF: Density Estimation

HF: Corr. x Cov.

Balance-scale

89.52 ± 4.21

91.04 ± 4.35

BC-W

96.41 ± 2.15

97.28 ± 2.28

Congress

93.42 ± 4.91

95.20 ± 4.72

Dermatology

93.89 ± 3.46

94.54 ± 4.25

Ecoli

82.50 ± 8.27

86.85± 6.72

Glass

71.52 ± 10.21

74.74 ± 6.46

Haberman

76.85 ± 6.12

78.88 ± 6.54

Hayes Roth

85.09 ± 8.93

86.37 ± 8.70

Heart

86.52 ± 5.06

90.37 ± 4.35

Hepatitis

88.33 ± 9.77

94.25 ± 4.76

Ionosphere

91.29 ± 6.75

96.29 ± 2.35

Tic-tac-toe

96.91 ± 1.78

98.04 ± 1.91

6.5.3 Rules for Majority Class Also In this experiment we discover rules for majority class also. This is implemented by allowing the majority class to compete with other classes while selecting a class before the execution of the REPEAT-UNTIL loop. The rest of the algorithm is the same. A second experiment is the incorporation of an early stopping criterion for WHILE loop termination in conjunction with the discovery of rules for majority class. Following the experiments of [65], we use 10 samples as the threshold for early stopping of the WHILE loop. We stop the algorithm if, at any stage, the number of training samples become less than or equal to 10. We use the majority class from these remaining uncovered training samples to form a default rule and do not have majority class label default class.

81

82

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

The results of these two experiments are reported in Table 6.4 and compared with the results of AntMiner-CC (last column). In most of the datasets, the discovery of majority class reduces the predictive accuracy. Furthermore, we observe that the majority class is a better choice for default class as compared to early stopping of algorithm and using the majority class of the remaining training samples as the default class.

Table 6.4 Comparison of Accuracy Results after ten-fold cross validation Datasets

Accuracy Option1

Accuracy Option2

Accuracy AntMiner-CC

Rules of All Classes

Rules of All Classes

Rules of All Classes Minus

Balance-scale

89.92 ± 4.82

89.76 ± 4.41

91.04 ± 4.35

BC-W

96.43 ± 2.15

96.51 ± 2.14

97.28 ± 2.28

Congress

87.80 ± 4.27

95.43 ± 4.54

95.20 ± 4.72

Dermatology

94.53 ± 4.64

92.89 ± 6.93

94.54 ± 4.25

Ecoli

88.06 ± 6.94

85.64 ± 8.72

86.85± 6.72

Glass

79.03 ± 4.80

76.54 ± 8.88

74.74 ± 6.46

Haberman

82.75 ± 9.52

84.40 ± 9.29

78.88 ± 6.54

Hayes Roth

85.77 ± 10.50

83.46 ± 14.12

86.37 ± 8.70

Heart

84.44 ± 8.34

86.67 ± 5.0

90.37 ± 4.35

Hepatitis

84.63 ± 6.53

84.50 ± 8.64

94.25 ± 4.76

Ionosphere

85.26 ± 6.34

85.87 ± 7.64

96.29 ± 2.35

Tic-tac-toe

99.48 ± 1.13

98.70 ± 3.97

98.04 ± 1.91

6.5.4 Termination of Rule Construction In AntMiner-CC, terms are added to a rule until all the samples covered by it have the same class label or until there are no more terms to be added. There is no restriction that a rule should cover a minimum number of samples. In the worst case, a constructed rule might cover only one sample. In this section we report the results of an experiment in which the algorithm used is same as proposed, but with the inclusion of a restriction that a rule being created should cover at least a minimum number of samples. We set this threshold as 10, a value used in [65] for the same purpose. A term is added to a rule only if the rule still covers at least 10 samples after its addition. As a result, all the constructed rules cover a minimum of ten samples. The rest of the parameters settings are according to Table 6.3. The results of this experiment are shown in Table 6.5. We observe that predictive accuracy decreases on all of the datasets due to this condition. The reason is that this condition may prohibit a very

Chapter 6: Further Improvements and Investigations

effective candidate term from being added in the rule. Another problem with this approach is that it is difficult to determine a universal correct value of this parameter which is appropriate for all datasets.

Table 6.5 Results obtained with the condition of a constructed rule covering a minimum of 10 samples Accuracy

Accuracy

Condition of Minimum

No Condition of Minimum

Balance-scale

78.88 ± 10.74

91.04 ± 4.35

BC-W

95.85 ± 3.13

97.28 ± 2.28

Congress

85.59 ± 6.29

95.20 ± 4.72

Dermatology

92.30 ± 6.60

94.54 ± 4.25

Ecoli

73.69 ± 13.14

86.85± 6.72

Glass

65.97 ± 11.02

74.74 ± 6.46

Haberman

76.29 ± 11.77

78.88 ± 6.54

Hayes Roth

77.31 ± 13.32

86.37 ± 8.70

Heart

84.74 ± 6.72

90.37 ± 4.35

Hepatitis

85.08 ± 12.58

94.25 ± 4.76

Ionosphere

82.86 ± 8.98

96.29 ± 2.35

Tic-tac-toe

90.47 ± 8.05

98.04 ± 1.91

Datasets

6.5.5 Symmetric or Asymmetric Pheromone Matrix Our pheromone look-up table is asymmetric. If an ant chooses termj after termi, then the pheromone on the link between termi and termj is updated but the pheromone value on the link between termj and termi remains the same. This is done to encourage exploration. In this section we report the results of an experiment in which pheromones on both the links between two chosen consecutive terms are updated. This pheromone update method yields a symmetric pheromone look-up table (after the first row). During the experiment the rest of the algorithm and the parameters are kept the same as before (α = 1, β = 3, ρ = 0.10). The results are shown in Table 6.6. By observing Table 6.6 we note that asymmetric pheromone update is better than symmetric pheromone update. This is due to lesser exploration by symmetric update. If we used symmetric pheromone matrix then ants converge quickly to a solution and it may be the sub optimal solution.

83

84

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

Table 6.6 Results of pheromone updating on one link between two chosen consecutive terms (asymmetric) and on both the links (symmetric) Datasets

Asymmetric

Symmetric – Consecutive

Balance-scale

91.04 ± 4.35

89.68 ± 4.81

Breast cancer – Wisconsin

97.28 ± 2.28

97.99 ± 1.93

Congress House Votes

95.20 ± 4.72

95.64 ± 2.71

Dermatology

94.54 ± 4.25

94.12 ± 4.25

Ecoli

86.85± 6.72

84.66 ± 7.40

Glass

74.74 ± 6.46

73.26 ± 10.89

Haberman

78.88 ± 6.54

76.92 ± 7.21

Hayes Roth

86.37 ± 8.70

87.09 ± 10.30

Heart

90.37 ± 4.35

89.74 ± 5.31

Hepatitis

94.25 ± 4.76

92.92 ± 5.67

Ionosphere

96.29 ± 2.35

95.57 ± 3.24

Tic-tac-toe

98.04 ± 1.91

98.02 ± 1.78

6.5.6 Rule Pruning: All Rules, No Rules or Best Rule? In AntMiner-CC we prune only the best rule found during an execution of the REPEATUNTIL loop before inserting it in the final rule set. In this section we report the results of two experiments. In the first experiment we prune all the rules discovered during the REPEAT-UNTIL loop (prior to pheromone update) and in the second experiment none of the rules is pruned. During the two experiments the rest of the algorithm and the parameters are kept the same as before (α = 1, β = 3, ρ = 0.10). The experiment is done on one fold of each dataset and the results are shown in Table 6.7. By observing Table 6.7 we note that rule pruning has a negative effect in only 1 out of 12 cases. In all other cases rule pruning increases the generalizing capability of the rule set. Furthermore we note that pruning only the best rule and pruning all the rules give almost similar results. Since pruning is a costly procedure, we retain pruning of only the best rule in our algorithm.

Chapter 6: Further Improvements and Investigations Table 6.7 Results of pruning each constructed rule and no pruning at all and pruning only the best rule one fold only Datasets

Pruning Only the Best

Pruning

All

Generated

Pruning None of the Rules

Balance-scale

93.65

93.65

91.12

Breast cancer – Wisconsin

98.57

98.57

97.14

Congress House Votes

84.09

84.09

85.36

Dermatology

97.29

98.78

94.59

Ecoli

94.12

94.12

94.12

Glass

81.82

81.82

77.27

Haberman

68.78

67.74

64.51

Hayes Roth

85.71

85.71

85.71

Heart

92.59

92.59

88.89

Hepatitis

93.75

93.75

87.50

Ionosphere

91.42

91.42

91.14

Tic-tac-toe

97.92

97.92

96.86

6.5.7 Average Probabilities Calls of AntMiner and AntMiner-CC Table 6.8 shows the average number of ants used and average number of probability calls in a single iteration of the main while loop of AntMiner and AntMiner-CC. These experimental results indicate that proposed approach uses less number of ants and probability calls when compared with original Ant Miner. It shows that the proposed approach is well suited for large dimensional data sets. Table 6.8 Average number ants and average number of probability calls of single iteration of repeat until loop with standard deviations of one fold

Datasets

AntMiner-CC Avg. # Ants

AntMiner Avg. Prob. Calls

Avg. # Ants

Avg. Prob. Calls

Balance-scale

75.54 ± 8.38

220.28 ± 40.50

195.86 ± 233.17

2958.60 ± 1213.12

Breast cancer – Wisconsin

54.45 ± 8.84

112.24 ± 26.67

250.45 ± 54.01

2984.48 ± 1123.54

Congress House Votes

108.77 ± 52.84

880.61 ± 545.32

90.60 ± 36.47

2430.0 ± 860.40

Dermatology

72.48 ± 16.86

445.76 ± 212.56

331.29 ± 110.16

5379.23 ± 1434.63

Ecoli

70.20 ± 17.34

265.27 ± 65.56

294.44 ± 98.53

2393.62 ± 681.24

Glass

68.92 ± 20.66

278.90 ± 145.73

60.44 ± 32.44

2272.06 ± 1395.80

Haberman

50.56 ± 6.37

134.10 ± 18.16

165.09 ± 45.41

5314.0 ± 1374.48

Hayes Roth

53.89 ± 8.26

145.92 ± 49.33

176.43 ± 68.37

4345.10 ± 1680.76

Heart

67.90 ± 15.39

222.36 ± 86.48

80.0 ± 43.78

4972.44 ± 1198.25

Hepatitis

48.98 ± 7.63

134.6 9 ± 16.50

71.45 ± 41.06

4831.30 ± 935.37

Ionosphere

62.75 ± 12.45

118.73 ± 30.44

352.86 ± 86.78

5894.54 ± 1504.08

Tic-tac-toe

122.58 ± 41.32

588.29 ± 198.26

491.91 ± 297.0

2142.34 ± 897.45

85

86

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

6.6

Comparison with Other Algorithms

We compare the results of our algorithm with those for AntMiner, AntMiner2, AntMiner3, C4.5, Ripper, AdaBoost, k-nearest neighbor, logistic regression, naive Bayes and support vector machines (SVM). Ant-Miner has been implemented by us in Matlab. For other algorithms we use the Weka machine learning tool [24]. The main characteristics of the extended suite of datasets are summarized in Table 6.1. As stated before in Section 4.1, we use unsupervised discretization filter of Weka-3.4 machine learning tool [24] for discretizing continuous attributes as a preprocessing step. The AntMiner–CC has five user defined parameters: number of ants, evaporation rate, convergence counter, alpha and beta. The values of these parameters are given in Table 6.9. The same parameters have been retained while obtaining results for previous Ant Miners. These values have been chosen as a result of experiments and also because they seem reasonable and have been used by other AntMiner versions reported in literature [65-70]. The predictive accuracies of the compared algorithms are shown in Table 6.10 and 6.11. Ten-fold cross validation is used to obtain the results. Table 6.9 Parameters used in experiments Parameter

Value

Number of Ants

1000

Max. uncovered cases (used in other Ant Miners) 10 Min. cases per rule (used in other Ant Miners)

10

Evaporation rate

0.10

No. of rules converged

10

Alpha

1

Beta

3

Chapter 6: Further Improvements and Investigations

Table 6.10 Average predictive accuracies obtained using 10-fold cross validation Datasets

Ant Miner–CC

Ant-Miner

Ant Miner 2

Ant Miner 3

C4.5

Ripper

Balance-scale

91.04 ± 4.35

75.32 ± 8.86

72.78 ± 9.23

75.06 ± 6.91

83.02 ± 3.24

80.93 ± 3.35

BC-W

97.28 ± 2.28

94.64 ± 2.74

92.70 ± 2.82

93.56 ± 3.45

94.84 ± 2.62

95.57 ± 2.17

Car

93.12 ± 1.44

82.38 ± 2.42

81.89 ± 2.63

78.82 ± 3.76

96.0 ± 2.13

89.17 ± 2.52

Congress House Votes

95.20 ± 4.72

94.54± 2.27

95.76 ± 2.43

94.72 ± 3.42

95.31 ± 2.57

95.66 ± 2.75

Credit(Aus)

90.0 ± 3.83

86.09 ± 4.69

84.20 ± 4.55

86.67 ± 5.46

81.99 ± 7.78

86.07 ± 2.27

Credit(Ger)

85.71 ± 1.38

71.62 ± 2.71

73.16 ± 5.21

72.07 ± 4.32

70.73 ± 6.71

70.56 ± 5.96

Dermatology

94.54 ± 4.25

58.72 ± 7.36

59.18 ± 14.91

67.47 ± 9.57

95.07 ± 2.80

93.96 ± 3.63

Ecolli

86.85± 6.72

47.52 ± 11.32

43.09 ± 9.76

44.61 ± 10.32

82.99 ± 7.72

79.18 ± 2.22

Glass

74.74 ± 6.46

53.33 ± 4.38

56.15 ± 10.32

46.36 ± 10.96

68.90 ± 8.98

70.48 ± 8.19

Haberman

78.88 ± 6.54

71.99 ± 7.57

73.94 ± 5.33

74.72 ± 4.62

73.88 ± 4.66

72.47 ± 6.74

Hayes Roth

86.37 ± 8.70

75.05 ± 10.62

70.88 ± 15.47

85.49 ± 7.76

80.93 ± 8.24

78.57 ± 14.47

Heart

90.37 ± 4.35

80.74 ± 4.94

78.15 ± 10.25

87.78 ± 6.77

78.43 ± 6.26

73.59 ± 9.57

Hepatitis

94.25 ± 4.76

80.67 ± 8.67

81.46 ± 11.89

80.17 ± 10.23

68.25 ± 11.63

73.46 ± 8.21

Image Segmentation

89.36 ± 5.24

70.48 ± 10.95

72.33 ± 9.34

72.98 ± 9.54

88.82 ± 8.04

83.74 ± 7.72

Ionosphere

96.29 ± 2.35

68.0 ± 11.09

66.0 ± 7.31

64.57 ± 8.85

89.98 ± 5.25

90.29 ± 6.90

Iris

98.0 ± 3.22

95.33 ± 4.50

94.67 ± 6.89

96.0 ± 4.66

94.0 ± 6.63

94.76 ± 5.26

Mammographic - Mass

68.37 ± 5.06

78.25 ± 3.48

79.40 ± 3.64

82.33 ± 3.56

82.44 ± 4.56

83.52 ± 4.52

Pima Indian Diabetes

87.26 ± 2.44

74.63 ± 6.65

72.56 ± 4.46

70.88 ± 5.05

72.11 ± 6.96

77.96 ± 7.47

SPECT (Heart)

85.38 ± 8.16

75.38 ± 5.50

83.56 ± 8.34

78.68 ± 5.17

80.03 ± 51.85

79.29 ± 19.43

TAE

80.79 ± 13.13

50.67 ± 6.11

53.58 ± 7.33

53.04 ± 10.67

51.33± 9.45

44.67 ± 10.35

Tic-tac-toe

98.04 ± 1.91

74.95 ± 4.26

72.54 ± 5.98

72.02 ± 4.50

94.03 ± 2.44

97.57 ± 1.44

Transfusion

47.66 ± 7.31

77.30 ± 6.37

79.44 ± 3.64

77.56± 3.31

77.71 ± 12.32

76.11 ± 14.54

Vehicle

82.66 ± 5.77

56.79 ± 9.56

53.55 ± 11.07

56.0 ± 10.24

65.86 ± 6.76

66.52 ± 11.86

WDBC

95.10 ± 3.65

85.26± 4.22

86.66 ± 4.56

88.05 ± 5.40

93.31 ± 2.72

94.03 ± 3.72

Wine

98.30 ± 2.74

90.0 ± 9.22

90.49 ± 10.13

94.44 ± 5.24

96.60 ± 3.93

94.90 ± 5.54

Zoo

96.09 ± 5.05

81.36 ± 11.30

79.36 ± 15.51

75.27 ± 11.69

94.0 ± 9.17

90.0 ± 11.55

Table 6.11 Average predictive accuracies obtained using 10-fold cross validation Datasets

Ant Miner –CC

AdaBoost

kNN

Logit

Balance-scale

NaiveBayes

SVM

91.04 ± 4.35

72.33 ± 2.39

89.26 ± 1.59

88.30 ± 2.69

91.04 ± 2.55

87.98 ± 1.80

BC-W

97.28 ± 2.28

94.83 ± 1.67

96.42 ± 1.54

96.56 ± 1.21

96.13 ± 1.19

96.70 ± 0.69

Car

93.12 ± 1.44

69.79 ± 0.48

93.75 ± 1.87

93.22 ± 2.10

86.04 ± 2.32

93.74 ± 2.65

Congress House Votes

95.20 ± 4.72

96.08 ± 3.27

92.28 ± 3.86

95.14 ± 3.0

90.54 ± 3.59

96.09 ± 3.06

Credit(Aus)

90.0 ± 3.83

85.76 ± 3.40

82.44 ± 7.31

85.77 ± 4.75

79.37 ± 4.57

85.17 ± 2.06

Credit(Ger)

85.71 ± 1.38

76.12 ± 2.01

74.43 ± 7.87

75.82 ± 4.24

74.87 ± 5.96

75.11 ± 3.63

Dermatology

94.54 ± 4.25

96.45 ± 2.88

95.62 ± 3.22

96.71 ± 2.83

96.98 ± 2.75

97.54 ± 2.99

Ecolli

86.85± 6.72

81.53 ± 7.25

80.86 ± 6.45

86.31 ± 5.38

85.66 ± 2.83

82.69 ± 5.04

Glass

74.74 ± 6.46

44.78 ± 1.89

70.95 ± 5.83

63.65 ± 6.72

51.69 ± 8.31

57.70 ± 8.10

87

88

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

Haberman

78.88 ± 6.54

69.76 ± 7.21

70.11 ± 11.95

73.08 ± 3.99

74.05 ± 4.76

73.44 ± 0.97

Hayes Roth

86.37 ± 8.70

83.30 ± 10.46

70.93 ± 9.59

53.52 ± 11.18

77.20 ± 11.08

58.13 ± 15.33

Heart

90.37 ± 4.35

82.33 ± 6.08

80.71 ± 6.17

77.0 ± 5.05

85.19 ± 8.27

80.32 ± 6.25

Hepatitis

94.25 ± 4.76

74.71 ± 15.02

67.62 ± 9.06

64.25 ± 8.87

74.79 ± 6.90

75.37 ± 8.62

Image Segmentation

89.36 ± 5.24

88.02 ± 6.15

86.57 ± 7.99

85.71 ± 9.28

78.48 ± 5.29

88.52 ± 5.70

Ionosphere

96.29 ± 2.35

90.0 ± 3.37

86.0 ± 6.38

86.57 ± 7.63

82.75 ± 4.94

87.14 ± 5.75

Iris

98.0 ± 3.22

95.33 ± 3.22

95.33 ± 4.50

97.33 ± 5.62

95.33 ± 3.22

96.67 ± 3.52

Mammographic – Mass

68.37 ± 5.06

81.88 ± 4.54

78.07 ± 5.99

82.92 ± 3.91

82.69 ± 3.82

80.32 ± 4.37

Pima Indian Diabetes

87.26 ± 2.44

74.33 ± 6.49

69.75 ± 7.13

77.98 ± 5.91

75.49 ± 4.50

77.06 ± 4.09

SPECT (Heart)

85.38 ± 8.16

84.52 ± 15.29

81.20 ± 16.77

81.94 ± 15.92

76.87 ± 12.72

87.20 ± 11.48

TAE

80.79 ± 13.13

41.34 ± 6.13

64.67 ± 7.73

53.33 ± 11.33

57.33 ± 10.91

58.67 ± 10.98

Tic-tac-toe

98.04 ± 1.91

98.23 ± 0.50

98.75 ± 0.66

98.23 ± 0.50

70.09 ± 5.78

98.33 ± 0.53

Transfusion

47.66 ± 7.31

75.84 ± 16.0

71.01 ± 14.98

77.17 ± 14.49

72.37 ± 17.22

76.24 ± 15.33

Vehicle

82.66 ± 5.77

69.36 ± 8.96

67.62 ± 5.63

77.18 ± 9.35

42.79 ± 2.54

66.57 ± 7.40

WDBC

95.10 ± 3.65

94.71 ± 2.89

95.06 ± 2.88

95.24 ± 2.07

93.48 ± 4.49

97.72 ± 1.45

Wine

98.30 ± 2.74

92.01 ± 4.93

96.08 ± 4.59

96.60 ± 4.03

98.30 ± 2.74

98.30 ± 2.74

Zoo

96.09 ± 5.05

91.0 ± 9.94

97.0 ± 6.75

97.0 ± 4.83

97.0 ± 6.75

95.0 ± 7.07

The results indicate that the AntMiner-CC achieves higher accuracy rate than the compared algorithms for most of the datasets.

6.7 Summary This chapter proposes improvements in the previously presented algorithm. These are the use of a slightly modified heuristic function and the inclusion of majority class rule as a default rule. Various possibilities are compared such as whether to use symmetric or a symmetric pheromone matrix, whether to prune or not to prune the rules or prune only the best rule. Number of probability equation calls that shows the speed of the algorithm is calculated. We analyze some aspects of the improved algorithm and check its performance on a suite of twenty six datasets. The experimental results show that the proposed improved version achieves higher accuracy rate than other conventional algorithms. Next, Chapter 7 presents an associative classification algorithm using ACO. It is a hybrid approach developed by combining association rules mining and classification.

Chapter 7: Associative Classification Using Ant Colony Optimization

7 Chapter 7: Associative Classification Using Ant Colony Optimization Classification rule discovery and association rule mining are two important data mining techniques. Association rule mining discovers all those rules from the training set that satisfy minimum support and confidence threshold while classification rule mining discovers a set of rules for predicting the class of unseen data [78]. In this chapter, we present a hybrid classification algorithm called ACO-AC, combining the idea of association rules mining and supervised classification using ACO. It is class based association rule mining, in which consequent of an association rule is always a class label. The proposed technique integrates classification with association rule mining to discover high quality rules for improving the performance of resulting classifier. ACO is used to mine only appropriate subset of class association rules instead of exhaustively searching for all possible rules. The mining process stops when the discovered rule set achieves a minimum coverage threshold. Strong association rules are discovered based on confidence and support and these rules are used to classify the unseen data. This integration finds more accurate and compact rules from the training set. Another advantage of the proposed approach is that we can mine association rules of each class in parallel in a distributed manner.

7.1

Associative Rules Mining and Associative Classification

There are different data mining techniques including supervised classification, association rules mining or market basket analysis, unsupervised clustering, web data mining, and regression. One technique of data mining is classification. The goal of classification is to build a model of the training data that can correctly predict the class of unseen or test objects. The input of this model learning process is a set of objects along with their classes (supervised training data). Once a predictive model is built, it can be used to predict the class of the objects of test cases for which class is not known. To

89

90

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

measure the accuracy of the model, the available dataset is divided into training and test sets. The training set is used to build the model and test set is used to measure its accuracy. There are several problems from a wide range of domains which can be cast into classification problems. Therefore there is always a need of algorithms for building comprehensible and accurate classifiers. Association rules mining (ARM) is another important data mining technique. It is used to find strong and interesting relationships among data items present in a set. A typical example of ARM is market basket analysis [79]. In market basket analysis each record contains a list of items purchased by a customer. We are interested to find out the set of items that are frequently purchased together. The objective is to search for interesting habits of customers. The sets of items occurring together can be written as association rules. These association rules can be written as “IF THEN” statements. IF part is called the antecedent of rule and THEN contains the consequent of the rule. In ARM the antecedent and consequent are sets of data items called item-set. An item set that contains k items is called k item set. An association rule is written as A => B, where A and B are set of items. There are different real world applications of ARM including market basket analysis, customer segmentation, electronic commerce, medical, web mining, finance, and bio informatics [80]. In ARM two factors are used to measure the importance of a rule [81-88], one is called support which is the ratio (or percentage) of transactions in which an item-set appears with respect to total number of transactions. Second factor is confidence, which is the percentage of the number of transactions that contain all items in the consequent as well as the antecedent to the number of transactions that contain all items in the antecedent. The aim of ARM is to find all rules whose support and confidence are greater than the minimum support and confidence threshold specified by the user. The formulas of calculating support and confidence of a rule X =>Y are calculated according to Equation (7.1) and (7.2). Support ( X => Y ) = P ( XUY ) Confidence ( X => Y ) = P (Y | X )

(7.1) (7.2)

Where P(XUY) is the probability of transaction contains X and Y together and P(X|Y) is

Chapter 7: Associative Classification Using Ant Colony Optimization

the probability of Y given X [89]. In other words, support is the probability that a selected transaction from the database will hold all items in the antecedent and the consequent, whereas the confidence is the probability that a randomly selected transaction will contain all the items in the consequent given that the transaction contains all the items in the antecedent. For example, if a supermarket database has 50,000 transactions, out of which 1,000 include both items I1 and I2 and 400 of these include item C, the association rule "If I1 and I2 are purchased then C is also purchased" has a support of 400 transactions which is 400/50,000 = 0.8%, and a confidence of 40% (400/1,000). Associative classification is a specific kind of ARM in which we are interested in finding class based association rules. A class based association rule is a rule in which consequent of the rule is always a class label. This is the problem tackled in the current chapter. Associative classification takes advantage of ARM for finding interesting relationship among items in the dataset. The support and confidence measures are used to find important rules. We are interested in those association rules that satisfy minimum support and confidence threshold specified by the user. Basic problem is that of mining association rules from large amounts of data. The dataset which is used to build the class association rules contains a set of transaction described by a set of attributes. Each transaction belongs to a predetermined class. The representation of class association rule is X => C, where X is a list of items and C is the class label. General association rules mining approach can predict any attribute not just the class attribute and can predict the values of more than one attributes. Another difference is that class based association rules are normally used together as a set for classification of unseen test cases. In associative classification a factor which is used with the support and confidence measure is called coverage. It is the percentage of the dataset that is (correctly) covered by a set of rules [89]. Coverage is also specified by the user. There are different challenges of associative classification. First problem is that rule generation is based on frequent item-set mining process and for large databases it takes a lot of time due to the large quantity of items and samples. Secondly, associative

91

92

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

classification generates more rules. There may be redundant rules included in the classifier which increases the time cost when classifying objects. There are different approaches developed for associative classification. An algorithm was proposed by B. Liu and et al. [90]. It has three main steps: rule discovery, rule selection and classification. Rule discovery process mines all rules from training dataset where consequent of the rule is a class label. These rules are called class association rules. Rule selection process selects the subset of rules from all discovered rules on the basis of their predictive accuracy to make a classifier. They use confidence measure for selecting rules. Higher confidence rules usually give higher predictive accuracy [91]. Finally classification process classifies the unseen data samples. An unseen data sample is assigned the class of the rule that has highest confidence value and which also matches with the data sample. The basic problem with their approach is that they mine all possible rules that satisfy minimum support and confidence threshold. This computation is very time expensive in large databases. Another class based ARM algorithm called “classification based on multiple class association rules” is proposed by W. Li and et al. [92]. They use multiple rules for classifying an unseen data sample. To classify a test sample the algorithm collects a small set of high confidence rules that match with the test sample and analyze correlation among these rules to assign the class label. They also use a tree structure for storing rules to improve the efficiency of rule retrieving process for classification purpose. The algorithm generates all possible association rules. Our proposed associative classification algorithm uses ACO algorithm for finding interesting relationships among data items. It uses its evolutionary capability to efficiently find more interesting subsets of association rules. It does not exhaustively search for all possible association rules as conventional ARM approaches does. In each generation of the algorithm a number of rules that satisfies minimum support and confidence threshold are selected for the final classifier. After each generation pheromones values are updated in such a way that better rules can be extracted in next coming generations. The final discovered rule set is the predictive model and is used to classify unseen test samples.

Chapter 7: Associative Classification Using Ant Colony Optimization

93

1

Discovered_RuleList = {};

/* initialize the rule list with empty set */

2

TrainingSet = {all training samples};

3

Initialize min_support, min_confidence, min_coverege, /* minimum support, confidence and coverge threshold */

4

Initialize No_ants;

*/ initialize the maximum number of ants */

5

FOR EACH CLASS C IN THE TRAINING SET

6

Rule_Set_Class = {}; /* initialize the rule set of the selected class with empty set */

7

Initialize pheromone value of all trails;

8

Initialize the heuristic values;

9

Calculate the support of all 1-itemset (item => C) of the training set;

10

IF(support(item) < min_support)

11

Set the pheromone value 0 of all those items;

12

END IF

13

g = 1; /* generation count */

14

WHILE(g != no_attributes && coverege < min_coverege)

15

Temp_Rule_Set_Class = {};

16

t = 1;

17

DO

18

Antt construct a class based association rule with a maximum g number of items in the rule;

19

t = t + 1;

20 21 22 23

/* counter for ants */

WHILE(t =min_support AND confidence(Rule)>=min_confidence) Insert the rule in Temp_Rule_Set_Class;

24

END IF

25

END FOR

26

Sort all the rules in Temp_Rule_Set_Class according to confidence and then support;

27

Insert the rule one by one from Temp_Rule_Set_Class into Rule_Set_Class until coverage of Rule_Set_Class is greater than or equal to min_coverage;

28

Update pheromones;

29

g = g + 1;

30

END WHILE

31

Insert Rule_Set_Class in Discovered_RuleList

/* increment generation count */

32 END FOR 33 Pruning discovered rule set; 34 Output: Final classifier;

Figure 7-1 Proposed ACO-AC algorithm

94

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

7.2

Differences with AntMiner-C and AntMiner-CC

The algorithms shown in Figures 4-2, 5-2, and 6-2 are sequential covering algorithms that discover one best rule at a time and remove the training samples covered by that rule. These algorithms also discover ordered rule list, in which a test case is assigned the class label of first rule that covers it. The proposed associative classification algorithm ACOAC is shown in Figure 7-1 and it combines association rules mining approach with classification rule discovery. This algorithm inherits, from the previous algorithms, pheromone initialization process shown in line 7, heuristic function calculation shown in line 8 and pheromone update method shown in line 24. The proposed ACO-AC discovers unordered rule list and has a different rule construction process, rule quality measure, pruning of rule set method, classifying unseen test cases method, etc. The details are given in Section 7.3.1.

7.3

Proposed Technique

In this section we describe the steps of our proposed ACO-AC approach in detail.

7.3.1 General Description The proposed approach finds a set of association rules from a training set to form a classifier. It does not mine all possible association rules but only a subset of them. Conventional association rules mining algorithms mine all possible rules, which are computationally expensive for large databases. The rules are selected on the basis of support and confidence. Each rule is in the form: IF (item1 AND item2 AND …) THEN class Each item is an attribute-value pair. An example of item is “weather = cold”. The attribute’s name is “weather” and “cold” is one of its possible values. The consequent of each association rule is a class label of a set of classes present in training dataset. We use only “=” operator as our algorithm only deals with categorical attributes. The proposed algorithm is shown in Figure 7-1 and flow chart is given in Figure 7-2. The algorithm for searching for the rules is ACO based. The search space is defined in the form of a graph, where each node of the graph represents a possible value of an attribute. Rules are discovered for each class separately. A temporary set of rules is discovered

Chapter 7: Associative Classification Using Ant Colony Optimization

during each generation of the algorithm and inserted in a set of rules reserved for the selected class label. This process continues until coverage of the set of rules of selected class is greater than or equal to a minimum coverage threshold specified by the user. When rule set of the selected class has sufficient rules to satisfy the minimum coverage threshold then rules are generated for another class. The algorithm stops when the rules of all classes have been generated. The final classifier contains rules of all classes. At the start of the algorithm, discovered rule set is empty and user defined parameters are initialized that include minimum support, minimum confidence, minimum coverage and number of ants used by the algorithm. As we mine the association rules of each class separately, therefore the first step is to select a class from the set of remaining classes. The pheromone values and heuristic values on links between items (attribute-value pairs) are initialized. The pheromone values on incoming links to all those items are set to zero that do not satisfy the minimum support threshold so that ants are not able to choose these items. The generation count “g” is set to 1. Generation count controls how many maximum numbers of items can be added by an ant in antecedent part of rule which it is constructing. For example when g = 2 an ant can add a maximum of two items in its rule antecedent part. This means that in the first generation we mine one-length association rules only. In the second generation we try to have two-length rules but we may not be able to reach two-length in some cases if the support of all candidate items is below the minimum threshold. Similarly we have third, fourth and subsequent generations. The maximum value of generation count is the number of attributes in dataset excluding class attribute.

95

96

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

Figure 7-2 Flow chart of proposed ACO-AC algorithm

Chapter 7: Associative Classification Using Ant Colony Optimization

The algorithm discovers association rules for a class based on support and confidence measures. In DO WHILE loop each ant constructs a rule. When all ants have constructed their rules then support and confidence of each rule is calculated and those rules are selected which meet minimum support and confidence threshold. Then we sort all selected rules in decreasing order on the basis of confidence and then on the basis of support before trying to insert them in a rule list of the selected class. The sorted rules are inserted one by one in the rule list of the selected class, according to a criterion described below, until coverage of the rule set is greater than or equal to a minimum coverage threshold specified by user. If minimum coverage criterion is not met then pheromone values are updated and a new generation starts. If minimum coverage criterion is met then the WHILE loop is exited and the rules of the selected class are copied in a final discovered rule list. Subsequently rules are built for another class and this process continues until there are no more classes left. After rules of all classes have been built, the rule set pruning procedure tries to remove redundant rules from the discovered rule set and the remaining set is the final classifier. The complete description of algorithm is given in next sub sections.

7.3.2 Rule Construction Each ant constructs a single item rule in the first generation. In the second generation each ant tries to construct a rule with two items. Similarly we have 3 item rules in 3rd generation and so on. Rules with a maximum k number of items are generated in the kth generation, where k is the number of attributes in training set excluding the class attribute.

7.3.3 Pheromone Initialization The pheromone values on all edges are initialized before the start of WHILE loop for each new class. The pheromone values on the edges between all items are initialized with the same amount of pheromone. The initial pheromone is: τ ij (t = 1) =

1 a

∑b

i

i =1

(7.3)

97

98

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

Where a is the total number of attributes in training set excluding the class attribute and bi is the number of possible values in the domain of an attribute ai. The pheromone values of all those items are set to zero which do not satisfy a minimum support threshold. The value of zero ensures that these items cannot be selected by ants during rule construction process.

7.3.4 Selection of an Item An ant incrementally adds an item in the antecedent part of the rule that it is constructing. When an item (i.e. an attribute-value pair) has been included in the rule then no other value of that attribute can be considered. The probability of selection of an item for current partial rule is given by the Equation (7.4):

Pij (t ) =

τ ij ( g )η ij (c) bi

a

∑ x ∑[τ i =1

i

j =1

ij

(7.4)

( g )η ij (c)]

Where τij(g) is the amount of pheromone associated between itemi and itemj in current generation. Furthermore, ηij(c) is the value of the heuristic function on the link between itemi and itemj for the current selected class. The total number of attributes in training dataset is a, and xi is a binary variable that is set to 1 if the attribute Ai was not used by current ant and otherwise set to 0, and bi is the number of possible values in the domain of attribute Ai. The denominator is used to normalize τij(g) ηij(c) value of each possible choice with the summation of τij(g) ηij(c) values of all possible choices. Those items which have higher pheromone and heuristic values are more likely to be selected.

7.3.5 Heuristic Function The heuristic value of an item indicates the quality or attractiveness of that item and is used to guide the process of item selection. We use a correlation based heuristic function that calculates correlation of candidate items with the last item (attribute-value pair) chosen by the current ant. The heuristic function is:

Chapter 7: Associative Classification Using Ant Colony Optimization

η ij =

item i , item j , class k item i , class k

.

item j , class k item j

(7.5)

The most recently chosen item is itemi and itemj is the item being considered for adding in the rule. The component |itemi, itemj, classk| is the number of uncovered training samples having itemi, and itemj with class label k for which ants are constructing rules. This value is divided by the number of uncovered training samples that have itemi with classk to find the correlation between the items itemi and itemj. The other component of the heuristic function indicates the oveall importance of itemj in determining the classk. The factor |itemj, classk| is the number of uncovered training samples having itemj with classk and is divivded by the factor |itemj| is the number of uncovered training samples having itemj. The heuristic function considers the relationship of the items to be added in the rule and also takes into consideration the overall distribution of the item to be added. As rules are built for a specific class labels therefore our heuristic function is dependent on the class chosen by the ant. Our heuristic function reduces the irrelevant search space during rule construction process in order to better guide the ant to choose the next item in its rule antecedent part. It assigns a zero value to the combination of those items which do not occur together for a given class, thus efficiently restricting the search space for the ants. Therefore, it can be very useful for large dimensional search spaces.

7.3.6 Heuristic Function for the 1st Item We use Laplace-corrected confidence for calculating the heuristic value of the first item of the rule antecedent given in Equation (7.6):

ηj =

item j , classk + 1 item j + No _ classes

(7.6)

Where, No_classes is the total number of classes present in the dataset. The advantage of this heuristic function is that it penalizes the items that would lead to very specific rules and thus helps to avoid over-fitting. For example, if an item occurs in just one training sample and its class is the selected class for which rules are built, then its confidence is 1

99

100

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

without the Laplace correction. If we use Laplace correction, its confidence is 0.66 if there are two classes in the data.

7.3.7 Rule Construction Stoppage An ant continues to add items in the rule in every generation, for example if generation counter is three then it can add maximum three items in the rule antecedent which it is constructing. The rule construction process can stop in two cases: one if value of generation counter is equal to total number of attributes present in the dataset (excluding class attribute) and second if in any generation the coverage of the rule set of that particular class reaches minimum coverage threshold.

7.3.8 Quality of a Rule The quality of a rule is calculated on the basis of its confidence which is calculated as:

Q=

TP Covered

(7.7)

Here Covered is the number of training samples that match with the rule antecedent part and TP is the number of training samples which match the antecedent of the rule and whose consequent is also same as the consequent of the rule. If the confidence value is high then the rule is considered more accurate. This value is also used in for updating the pheromone values.

7.3.9 Pheromone Update The pheromone values are updated after each generation so that in next generation ants can make use of this information in their search. The amount of pheromone on links between items occurring in those rules which satisfy minimum support threshold but whose confidence is below the minimum required confidence (and hence they were removed from the temporary rule set) are updated according to the Equation (7.8):

τij (g +1) = (1− ρ)τij (g) + (1−

1 )τij (g) 1+ Q

(7.8)

Where τij(g) is the pheromone value between itemi and itemj in current generation, ρ represents the pheromone evaporation rate and Q is the quality of the rule constructed by an ant. The pheromones of these rules are increased so that in next generation ants can

Chapter 7: Associative Classification Using Ant Colony Optimization

explore more search space instead of searching around those rules which are already inserted in the discovered rule set. This pheromone strategy increases the diversity of the search by focusing on new unexplored areas of search space. The pheromone update on the items occurring in those rules which are rejected due to low confidence but which have sufficient support is done in two steps. First a percentage of the pheromone value is evaporated and then a percentage of the pheromone (depending upon the quality of the rule) is added. If the rule is good then the items of that rule will become more attractive in next generation and more likely to be chosen by ants. Pheromones are evaporated to encourage exploration and to avoid early convergence. The pheromone values of other rules are updated by normalizing. Each pheromone value is normalized by dividing it by the summation of all pheromone values of its competing items. If the quality of a rule is good and there is a pheromone increase on the items used in the rule then the competing items will become less attractive in next generation due to normalization. The reverse is true if the quality of the rule is not good.

7.3.10 Rule Selection Process After all the ants have constructed their rules during a generation, these rules are placed in a temporary set. These rules are checked for minimum support and confidence criterion and those which do not fulfill them are removed. The next step is to insert these rules in the rule set reserved for the discovered rules of the current class. A rule is moved from the temporary rule set to the rule set of the current class only if it is found to enhance the quality of the later set. For this purpose the top rule from the temporary rule set, called R1, is removed. This rule R1 is compared, one by one, with all the rules already present in the discovered rule set of the selected class. The comparison continues until a rule from the discovered rule set satisfies a criterion described below, or until there are no more rules left in the discovered rule set with which R1 can be compared. In the later case, when no rules in the discovered rule set are able to fulfill the criterion, R1 is inserted into the discovered rule set. If a rule in the discovered rule set fulfills the criterion then the rule R1 is rejected and further comparison of R1 is stopped. The criterion is as follows. Let the compared rule of discovered rule set be called R2. If R2 is more general than R1 and confidence of R2 is

101

102

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

higher than or equal to R1 then R2 satisfies the criterion to reject the inclusion of R1. If R2 is exactly the same as R1 then also the criterion is satisfied. The logic of this criterion is that since R2 is already in the rule set any data sample that matches with R1 is also matched with R2 and since we assign the class label of highest confidence rule therefore the data sample will always be classified by R2 and R1 will not increase the coverage of rule set.

7.3.11 Discovered Rule Set When the coverage of discovered rule set of the selected class reaches a coverage threshold then we stop the rule discovery process for that class. This process is repeated for all classes. A final discovered rule set (or list) contains discovered rules of all classes.

7.3.12 Pruning Discovered Rule List The discovered rule list may contain a large number of rules and there may be redundant rules. Redundant rules are those rules which do not fire for any single training sample. We remove these rules from the final list. The discovered rule list is first sorted on the basis of confidence. Then it is applied to classify the samples of the training set. For each sample the rules in the discovered rule list are tested one by one in order of their sorting. If a rule fires for the test sample then the rules below it are not tested. The rule pruning process flags all those rules which are fired for at least one training sample. In this way it discovers those rules which are never used. All such rules are deleted from the rule set. The remaining rule set becomes the final classifier and used to predict unseen test cases. This pruning process increases the comprehensibility of classifier because a small number of rules can be easily understood by a domain expert. It also makes the classification process fast because for classifying a test case we check each rule one by one.

7.3.13 Use of Discovered Rule set for Classifying New unseen Cases A new test case unseen during training is assigned the class label of the rule that covers the test sample and also has the highest confidence among any other rules covering it. This is implemented by keeping the rules in a sorted order (from highest to lowest) on the basis of their confidence. For a test case the rules are checked one by one according to the

Chapter 7: Associative Classification Using Ant Colony Optimization

order of their sorting and the first rule whose antecedents match the new test sample is fired and the class predicted by the rule’s consequent is assigned to the sample. If none of the discovered rules are fired then the sample is assigned the majority class of the training set which is the default class of the classifier.

7.4

Experiments and Analysis

We have implemented the proposed algorithm in MatLab 7.0. We conduct experiments on a machine that has 1.75GHz dual processors with 1GB RAM. We compare the results of proposed approach with other state of the art, well known classification algorithms which are AntMiner, AntMinerC, C4.5 (a decision tree builder), Ripper, SVM (Support Vector Machine), Logistic Regression, K-nearest Neighbour, and Naïve Bayes. The performance measures for comparison are predictive accuracy, number of rules, and number of terms per rule. The experiments are performed using a ten-fold cross validation procedure. In ten-fold cross validation a dataset is randomly divided into ten equally sized, mutually exclusive subsets. Each of the subset is used once for testing and the other nine are used for training. The results of theses ten runs are averaged and this average is reported as the final result. We use twenty six datasets of UCI repository [77] for comparing different techniques. The datasets sorted on the basis of their main characteristics are shown in Table 7.1. These datasets include binary and multi-class problems. They also have diversity in terms of number of attributes, number of transactions, number of classes and types of the attributes. As our proposed approach works only for categorical attributes therefore we discretize the continuous attributes in a preprocessing step by using the unsupervised discretization filter.

103

104

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

Table 7.1 Datasets used in the experiment. The datasets are sorted on the basis of attributes, samples and classes Number of Attributes Dataset

Number of Samples

Attributes Dataset

Number of Classes

Samples Dataset

Classes

Haberman

3

Hayes Roth

132

Haberman

2

Iris

4

Iris

150

Transfusion

2

Balance-scale

4

TAE

151

Mammographic – Mass

2

Transfusion

4

Hepatitis

155

Pima Indian Diabetes

2

Mammographic - Mass

5

Wine

178

WBC

2

Car

6

Image Segmentation

210

Tic-tac-toe

2

TAE

6

Glass

214

Heart

2

Hayes Roth

6

SPECT (Heart)

267

Credit (Australia)

2

Ecolli

7

Heart

270

Congress House Votes

2

Pima Indian Diabetes

8

Zoo

282

Credit (Germany)

2

WBC

9

Vehicle

282

Hepatitis

2

Tic-tac-toe

9

Haberman

307

SPECT (Heart)

2

Glass

9

Ecolli

336

WDBC

2

Wine

13

Ionosphere

351

Ionosphere

2

Heart

13

Dermatology

366

Iris

3

Credit (Australia)

15

Congress House Votes

435

Balance-scale

3

Zoo

16

WDBC

569

TAE

3

Congress House Votes

17

Balance-scale

625

Hayes Roth

3

Vehicle

18

WBC

683

Wine

3

Credit (Germany)

19

Credit (Australia)

690

Car

4

Hepatitis

19

Transfusion

748

Vehicle

4

Image Segmentation

19

Pima Indian Diabetes

768

Dermatology

6

SPECT (Heart)

22

Tic-tac-toe

958

Glass

7

WDBC

31

Mammographic - Mass

961

Zoo

7

Dermatology

33

Credit (Germany)

1000

Image Segmentation

7

Ionosphere

34

Car

1728

Ecolli

8

The values of user defined parameters are given in Table 7.2. The parameters are: number of ants used, evaporation rate, the value of alpha and beta parameters that indicate the relative importance of pheromone and heuristic, minimum support, confidence and coverage used in the algorithm.

Chapter 7: Associative Classification Using Ant Colony Optimization Table 7.2 Parameters used in experiments Parameter

Value

Number of Ants

1000

Evaporation Rate

0.15

Alpha

1

Beta

1

Minimum Support

0.01

Minimum Confidence 0.50 Minimum Coverage

0.98

Table 7.3 Average predictive accuracies with standard deviations obtained after 10fold cross validation Datasets BC-W

ACO-AC 99.29 ± 1.01

AntMiner–C 97.85 ± 1.69

Ant-Miner 94.64 ± 2.74

C4.5 94.84 ± 2.62

Ripper 95.57 ± 2.17

Wine

100.0 ± 0.0

99.44 ± 1.76

90.0 ± 9.22

96.60 ± 3.93

94.90 ± 5.54

Credit(Aus)

98.84 ± 1.14

87.54 ± 3.21

86.09 ± 4.69

81.99 ± 7.78

86.07 ± 2.27

Credit(Ger)

97.80 ± 1.48

72.46 ± 5.13

71.62 ± 2.71

70.73 ± 6.71

70.56 ± 5.96

Car

91.43 ± 3.31

98.03 ± 1.17

82.38 ± 2.42

96.0 ± 2.13

89.17 ± 2.52

Tic-tac-toe

99.06 ± 0.92

100 ± 0.0

74.95 ± 4.26

94.03 ± 2.44

97.57 ± 1.44

Iris

98.67 ± 2.81

98.0 ± 4.50

95.33 ± 4.50

94.0 ± 6.63

94.76 ± 5.26

Bal-scale

91.53 ± 2.11

87.49 ± 6.34

75.32 ± 8.86

83.02 ± 3.24

80.93 ± 3.35

TAE

86.75 ± 10.89

81.38 ± 11.72

50.67 ± 6.11

51.33± 9.45

44.67 ± 10.35

Glass

91.60 ± 6.46

82.27 ± 6.67

53.33 ± 4.38

68.90 ± 8.98

70.48 ± 8.19

Heart

99.25 ± 1.56

80.74 ± 9.37

80.74 ± 4.94

78.43 ± 6.26

73.59 ± 9.57

Hepatitis

98.0 ± 4.50

87.17 ± 7.81

68.25 ± 11.63

73.46 ± 8.21

Zoo

97.0 ± 4.83

96.0 ± 5.16

81.36 ± 11.30

Haberman

86.31 ± 4.57

83.05 ± 7.80

71.99 ± 7.57

73.88 ± 4.66

72.47 ± 6.74

Ecolli

97.80 ± 1.48

83.64 ± 6.11

47.52 ± 11.32

82.99 ± 7.72

79.18 ± 2.22

Vehicles

85.80 ± 6.73

85.91 ± 5.62

56.79 ± 9.56

65.86 ± 6.76

66.52 ± 11.86

Mam-Mass

97.92 ± 2.40

78.67 ± 4.65

78.25 ± 3.48

82.44 ± 4.56

83.52 ± 4.52

Ind-Diabetes

96.88 ± 4.47

80.26 ± 6.19

74.63 ± 6.65

72.11 ± 6.96

77.96 ± 7.47

80.67 ± 8.67

94.0 ± 9.17

90.0 ± 11.55

Dermatology

97.28 ± 3.83

94.27 ± 4.51

58.72 ± 7.36

95.07 ± 2.80

93.96 ± 3.63

Ionosphere

98.29 ± 2.41

89.71 ± 7.56

68.0 ± 11.09

89.98 ± 5.25

90.29 ± 6.90

WDBC

98.95 ± 1.89

87.33 ± 5.46

85.26± 4.22

93.31 ± 2.72

94.03 ± 3.72

Image-Seg

91.43 ± 7.71

98.10 ± 2.61

70.48 ± 10.95

88.82 ± 8.04

83.74 ± 7.72

Spec-Heart

98.49 ± 1.94

87.41 ± 4.97

75.38 ± 5.50

80.03 ± 51.85

79.29 ± 19.43

Transfusion

85.71 ± 2.74

79.57 ± 3.04

77.30 ± 6.37

77.71 ± 12.32

76.11 ± 14.54

Hayes Roth

95.66 ± 11.29

91.65 ± 8.37

75.05 ± 10.62

80.93 ± 8.24

78.57 ± 14.47

House-Votes

99.30 ± 1.11

95.86 ± 3.71

94.54± 2.27

95.31 ± 2.57

95.66 ± 2.75

105

106

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

Datasets BC-W

ACO-AC 99.29 ± 1.01

KNN

Log Reg

Naïve Bayes

SVM

96.42 ± 1.54

96.56 ± 1.21

96.13 ± 1.19

96.70 ± 0.69

Wine

100.0 ± 0.0

96.08 ± 4.59

96.60 ± 4.03

98.30 ± 2.74

98.30 ± 2.74

Credit(Aus)

98.84 ± 1.14

82.44 ± 7.31

85.77 ± 4.75

79.37 ± 4.57

85.17 ± 2.06

Credit(Ger)

97.80 ± 1.48

74.43 ± 7.87

75.82 ± 4.24

74.87 ± 5.96

75.11 ± 3.63

Car

91.43 ± 3.31

93.75 ± 1.87

93.22 ± 2.10

86.04 ± 2.32

93.74 ± 2.65

Tic-tac-toe

99.06 ± 0.92

98.75 ± 0.66

98.23 ± 0.50

70.09 ± 5.78

98.33 ± 0.53

Iris

98.67 ± 2.81

95.33 ± 4.50

97.33 ± 5.62

95.33 ± 3.22

96.67 ± 3.52

Bal-scale

91.53 ± 2.11

89.26 ± 1.59

88.30 ± 2.69

91.04 ± 2.55

87.98 ± 1.80

TAE

86.75 ± 10.89

64.67 ± 7.73

53.33 ± 11.33

57.33 ± 10.91

58.67 ± 10.98

Glass

91.60 ± 6.46

70.95 ± 5.83

63.65 ± 6.72

51.69 ± 8.31

57.70 ± 8.10

Heart

99.25 ± 1.56

80.71 ± 6.17

77.0 ± 5.05

85.19 ± 8.27

80.32 ± 6.25

Hepatitis

98.0 ± 4.50

67.62 ± 9.06

64.25 ± 8.87

74.79 ± 6.90

75.37 ± 8.62

Zoo

97.0 ± 4.83

97.0 ± 6.75

97.0 ± 4.83

97.0 ± 6.75

95.0 ± 7.07

Haberman

86.31 ± 4.57

70.11 ± 11.95

73.08 ± 3.99

74.05 ± 4.76

73.44 ± 0.97

Ecolli

97.80 ± 1.48

80.86 ± 6.45

86.31 ± 5.38

85.66 ± 2.83

82.69 ± 5.04

Vehicles

85.80 ± 6.73

67.62 ± 5.63

77.18 ± 9.35

42.79 ± 2.54

66.57 ± 7.40

Mam-Mass

97.92 ± 2.40

78.07 ± 5.99

82.92 ± 3.91

82.69 ± 3.82

80.32 ± 4.37

Ind-Diabetes

96.88 ± 4.47

69.75 ± 7.13

77.98 ± 5.91

75.49 ± 4.50

77.06 ± 4.09

Dermatology

97.28 ± 3.83

95.62 ± 3.22

96.71 ± 2.83

96.98 ± 2.75

97.54 ± 2.99

Ionosphere

98.29 ± 2.41

86.0 ± 6.38

86.57 ± 7.63

82.75 ± 4.94

87.14 ± 5.75

WDBC

98.95 ± 1.89

95.06 ± 2.88

95.24 ± 2.07

93.48 ± 4.49

97.72 ± 1.45

Image-Seg

91.43 ± 7.71

86.57 ± 7.99

85.71 ± 9.28

78.48 ± 5.29

88.52 ± 5.70

Spec-Heart

98.49 ± 1.94

81.20 ± 16.77

81.94 ± 15.92

76.87 ± 12.72

87.20 ± 11.48

Transfusion

85.71 ± 2.74

71.01 ± 14.98

77.17 ± 14.49

72.37 ± 17.22

76.24 ± 15.33

Hayes Roth

95.66 ± 11.29

70.93 ± 9.59

53.52 ± 11.18

77.20 ± 11.08

58.13 ± 15.33

House-Votes

99.30 ± 1.11

92.28 ± 3.86

95.14 ± 3.0

90.54 ± 3.59

96.09 ± 3.06

Table 7.4 Average number of rules per discovered rule set and average number of terms per rule. The results are obtained using 10-fold cross validation Datasets BC-W

#R/Rule Set AM

#T/R

ACO-AC

AMC

C4.5

Ripper

33.90

20.40

11.0

10.50

5.10

ACO-AC

AMC

1.0

1.42

AM 1.02

C4.5

Ripper

2.32

1.79

Wine

12.0

6.0

5.50

5.30

3.90

1.0

1.25

1.04

1.41

1.62

Credit(Aus)

26.80

5.50

3.90

74.80

4.60

1.56

1.57

1.0

3.22

1.81

Credit(Ger)

81.80

13.50

8.50

73.60

4.20

2.16

1.88

1.13

3.21

2.36

Car

32.10

58.0

11.40

80.26

41.10

2.68

2.50

1.03

2.59

4.01

Tic-tac-toe

30.90

12.30

6.60

38.60

10.30

1.82

2.74

1.09

2.64

2.82

Iris

18.90

12.80

9.20

5.50

3.90

1.19

1.05

1.0

1.22

1.03

Bal-scale

13.50

108.7

17.70

40.10

11.10

1.0

2.43

1.0

2.85

2.91

TAE

43.30

48.70

20.90

18.30

3.90

1.82

1.44

1.0

2.69

1.64

Chapter 7: Associative Classification Using Ant Colony Optimization Glass

58.60

42.20

15.50

15.40

7.20

1.92

1.93

1.01

2.83

2.33

Heart

23.10

13.10

5.60

12.60

5.60

1.0

1.79

1.08

1.73

1.86

Hepatitis

25.80

11.30

3.90

11.60

4.60

1.07

2.41

1.11

1.70

1.0

Zoo

11.80

7.0

5.10

7.60

6.80

1.05

1.59

1.11

1.60

1.67

Haberman

35.10

58.20

20.70

3.40

2.30

1.32

1.56

1.0

1.58

2.0

Ecolli

81.80

57.60

8.60

14.0

9.10

2.16

1.75

1.01

2.84

2.98

Vehicles

78.50

43.0

14.20

20.10

8.40

1.87

1.82

1.02

3.13

1.92

Mam-Mass

31.30

20.30

15.90

8.90

4.30

1.39

1.82

1.0

2.47

2.15

Ind-Diabetes

86.40

45.60

15.30

7.80

3.90

1.84

1.93

1.0

2.18

2.59

Dermatology

39.10

20.0

10.40

9.30

8.70

1.66

2.69

1.07

2.23

2.99

Ionosphere

50.60

6.67

4.20

9.30

5.80

1.0

1.15

1.0

2.54

2.33

WDBC

45.60

15.40

8.40

7.20

4.70

1.0

1.67

1.0

1.75

2.60

Image-Seg

58.50

26.60

16.20

10.60

9.80

1.34

1.41

1.03

1.99

2.83

Spec-Heart

19.0

25.40

5.60

12.50

2.40

1.66

7.17

1.28

3.02

3.23

Transfusion

17.10

24.80

10.10

4.20

3.10

1.36

1.45

1.0

1.35

2.07

Hayes Roth

18.90

19.80

8.0

11.70

7.20

1.46

1.76

1.02

2.56

2.40

House-Votes

5.80

7.30

3.0

7.40

2.60

1.0

1.95

1.0

1.84

2.05

Our performance metrics are average predictive accuracy that indicates the predictive power of classifier, number of rules (#R) and number of terms/rule (#T/R). The predictive accuracies of compared algorithms and proposed approach are given in Table 7.3. The best performance is shown in bold. The experimental results indicate that the ACO-AC achieves higher accuracy rates than the compared algorithms on almost all the datasets. For example on breast-cancer dataset the proposed approach has 99.29 average accuracy. However, the numbers of rules are larger than the other rule based classifiers (Table 7.4). The numbers of terms per rule are higher than for AntMiner but less than C4.5, AntMinerC and Ripper. Table 7.5 Average number of associative rules discovered without applying redundant rules pruning procedure for ten-fold cross validation with support = 1% and confidence = 50% Dataset

#R without pruning

Dataset

#R without pruning

Breast-cancer-w

40.80

Haberman

40.20

Wine

13.10

Ecolli

214.40

Credit(Aus)

113.80

Vehicles

638.10

Credit(Ger)

214.40

Mammographic_Masses

34.50

Car

49.30

Pima_Indians_Diabetes

160.60

Tic-tac-toe

41.60

Dermatology

182.90

Iris

35.20

Ionosphere

108.20

107

108

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

Balance-scale

13.50

WDBC

89.90

TAE

146.40

Image Segmentation

273.90

Glass

215.90

SPECT (Heart)

50.0

Heart

25.50

Transfusion

24.0

Hepatitis

49.30

Hayes_Roth

30.20

Zoo

24.40

Congress_House_Votes

7.50

Table 7.5 shows the average number of class association rules discovered (average of tenfold cross validation). These results are before applying redundant rule pruning procedure. Comparing them with the results of Table 7.4 (which are obtained after rule pruning) shows that the rule pruning procedure reduces the rule sets significantly. For example glass dataset has 215.90 rules without pruning and after pruning it has 58.60 rules. Table 7.6 Average accuracy, number of rules, number of terms/rule with different coverage threshold obtained with ten-fold cross validation with support = 1% and confidence = 50% Tic-tac-toe

Credit(Aus) Coverage

Accuracy

#R

#T/R

0.98

98.84 ± 1.14

26.80

0.95

97.68 ± 1.70

14.90

0.90

96.81 ± 1.82

0.85

95.45 ± 2.14

0.80

95.20 ± 2.18

Mammographic Masses

Accuracy

#R

#T/R

Accuracy

#R

#T/R

1.56

99.06 ± 0.92

30.90

1.82

97.92 ± 2.40

31.30

1.39

1.0

98.22 ± 1.71

23.30

1.78

97.60 ± 1.63

29.40

1.41

11.10

1.0

97.39 ± 1.79

20.20

1.78

95.94 ± 1.59

17.70

1.0

8.0

1.0

95.41 ± 2.26

17.20

1.74

93.86 ± 1.66

15.10

1.0

6.80

1.0

93.84 ± 2.75

14.70

1.76

91.79 ± 3.79

11.90

1.0

Table 7.6 shows the results of credit (Aus), tic-tac-toe and mammographic masses datasets for different values of coverage threshold. The experimental results indicate that coverage has direct impact on number of rules and predictive accuracy of the classifier. If we increase the value of minimum coverage then we have higher accuracy rate and large number of rules and if we decrease the value of coverage then accuracy rate and number of rules are also decreased. The reason is that if we have low value of coverage then the algorithms is stopped early and unable to find adequate rules for some of the training samples. The best coverage threshold is 0.98 and we have adopted this value for our final experiments. Table 7.7 shows the experimental results on teacher assistant evaluation dataset (TAE) by varying the support threshold. The threshold is varied from 0.5% to 6%. From the table

Chapter 7: Associative Classification Using Ant Colony Optimization

we can see that the best results obtained are for support threshold equal to 1%. For support equal to 0.5% more specific rules are discovered which do not have generalization capabilities and as a result accuracy is decreased. When support threshold is increased beyond 1% the accuracy rate again gets decreased. The reason is that some of the items that can increase the accuracy of the rules are not inserted in the antecedent part of the rules because they do not meet the minimum support threshold.

Table 7.7 Average accuracy, number of rules, number of terms/rule with different values of support threshold for TAE dataset after ten-fold cross validation with confidence = 50% & coverage = 98%

7.5

Support

Accuracy

No.Rules

No.terms/rule

0.5%

82.64 ± 11.32

49.70

1.92

1%

86.75 ± 10.89

43.30

1.82

2%

80.17 ± 8.76

38.40

1.68

3%

72.83 ± 13.52

22.60

1.84

4%

68.83 ± 16.40

18.60

1.88

5%

66.88 ± 18.60

16.10

1.97

6%

63.45 ± 15.89

13.18

1.94

Time Complexity of ACO-AC

The time complexity of ACO-AC is calculated in this section.

7.5.1 Computational Complexity of a Single Iteration of Main FOR Loop In each iteration of FOR loop, the algorithm builds rules for a specific class. In each iteration the heuristic values ηij of all items are initialized. This step takes O(v2), where v is the number of possible items present in the training set, because we compute heuristic values on links from an item to all other items. Next pheromone values τij of all items are initialized. We use a pheromone matrix for storing the pheromone values on the links between each pair of items. This step takes O(v2) time, where v is the number of possible items in the training dataset. The main feature of the FOR loop is the WHILE loop whose computational complexity is given below.

109

110

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

7.5.2 Computational Complexity of Single Iteration of Main WHILE Loop The computational complexity of a single iteration of the main WHILE loop is calculated by considering all major steps within this loop. These are given below. 7.5.2.1 Computational complexity of inner while loop In inner WHILE loop the algorithm first constructs rules containing only a single

condition and then two conditions and then up to k number of conditions, where k is the number of attributes present in the training set excluding class attribute. 7.5.2.1.1 Rule construction In each iteration of this loop, we run t number of ants. In order to construct a rule an ant

can add a maximum k number of items in its rule antecedent part. This process is repeated k number of times, hence rule construction process for a single ant takes O(k2) time. The time complexity for t number of ants takes O(t. k2) time. 7.5.2.1.2 Calculating support and confidence When all ants have constructed their rules then support and confidence of each rule is

calculated. For measuring support and confidence of a rule, a rule with maximum k conditions and training set size of n, takes O(n.k) time. This process is repeated k number of times, hence this step takes O(n.k2) time for a single ant. For t number of ants the time complexity is O(t.n.k2). 7.5.2.1.3 Pheromone updating The amount of pheromone of each item occurring in those rules which are not inserted in

the discovered rule list and are not covered by the discovered rules and that satisfy minimum support threshold is updated. The pheromone values of all these items are updated by first evaporating the previous pheromone values and then adding a percentage of the pheromone dependant on the quality (confidence) of the rule. Next normalizing of pheromone values is done by dividing a raw value with summation of pheromone values of all its competing terms. This step takes O(k.v) time, because a rule can have maximum k number of conditions and each condition will take O(v) time because it requires normalizing of all its competing terms. This process is repeated k number of times

Chapter 7: Associative Classification Using Ant Colony Optimization

therefore this takes O(v.k2) time for a single ant. For t number of ants the time complexity is O(t.v.k2). By adding, the computational complexities of inner while loop we have: O(t. k2) + O(t.n.k2) + O(t.v.k2) = O(t.n.k2)

7.5.3 Computational Complexity of Entire Algorithm Each run of FOR loop constructs rules for a single class. For calculating the computational complexity of the entire algorithm, the computational complexity of single run of each loop is multiplied with c, where c is the number of classes present in the dataset and adding the computational complexities of initializing steps. O(c.t.n.k2) + O(c.v2) + O(c.v2) Hence the worst case computational complexity of associative classification algorithm ACO-AC is: Computational complexity (ACO-AC) = O(c.t.n.k2)

Where c is the number of classes, t is the number of ants used, n is the size of the training set and k is the number of attributes present in the dataset excluding class attribute.

7.6 Summary In this chapter, an ACO based associative classification algorithm was proposed, which combines two primary data mining paradigms, classification and association rules mining. It is a supervised learning approach for discovering association rules. ACO is used to find the most suitable set of association rules. ACO searches only a subset of association rules to form an accurate classifier instead of massively searching all possible association rules from the dataset. The proposed approach avoids exhaustive search in rule discovery process, using its evolutionary capability. It has the ability of efficiently dealing with complex search spaces. The set of discovered rules is evaluated after each generation of the algorithm. Better rules are generated in subsequent generations by adjusting of the pheromone values. The main challenge of any association rule mining technique is the computational complexity of the approach. Our proposed approach does not mine all association rules and it can also discover association rules in parallel for each class because there is no dependency between association rules of each class. These two factors of non-exhaustive

111

112

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

search and parallelism (if used) make the approach well suited for large dimensional databases. We compare our ACO-AC approach with eight other popular classification techniques on a large number of datasets. The experimental results indicate that proposed ACO-AC approach performs better then the state of the art classification approaches. Next chapter will discuss proposed features selection algorithm using ACO and presents a number of experiments for demonstrating the worth of proposed approach.

Chapter 8: Feature Selection Based on Ant Colony optimization

8 Chapter 8: Feature Selection Based on Ant Colony optimization Feature subset selection (FSS) is the technique of selecting a subset of relevant features for building robust learning models. It is commonly used in machine learning. FSS provides better understanding of the data by selecting important features within the data. FSS can be implemented as exhaustive evaluation of all possible feature subsets. However, except for datasets with only a very small set of features, it is computationally expensive and hence infeasible. In this chapter, a FSS technique is proposed, which combines Ant Colony Optimization (ACO) with ID3 decision tree builder. It is a wrapper based FSS approach, in which evaluation function calculates the fitness of a particular subset. We use ID3 decision tree for building rules, given a feature subset, and then the accuracy of the rules is determined which is considered the fitness of the obtained feature subset. An ACO search environment is set up for a given dataset and each ant probabilistically selects features on the basis of pheromone and heuristic values associated with each link. When an ant completes its tour then for evaluating the fitness of the sub-set of features selected by it, we use ID3 algorithm for constructing a rule set based only on the features in the sub-set and then evaluate the accuracy of the rule set which is considered the fitness of the solution found by the ant. For evaluating the accuracy of the rule set we use 10-fold cross validation. The experiment results show that we have better accuracy rate if we use the subset of features selected by our proposed approach instead of full feature set. The number of rules is also decreased substantially. It also reduces the time for training a classifier. We also compare the proposed approach with naïve Bayes feature subset selection method, and the experimental results indicates that the proposed approach selects more valuable features that increase the predictive accuracy when compared with naïve Bayes.

113

114

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

8.1 Introduction Feature subset selection is a process of finding a subset of features that represents the full dataset from a much larger set [93]. There may be thousands of features present in a real world datasets and each feature may carry only a little bit of information, it would be very difficult to treat all the features. Therefore it is very important to extract or select important features from the dataset [93]. There are many benefits of feature subset selection (FSS). It facilitates data visualization and provides better data understanding. It also reduces the complexity of training data that leads to reduced training times of the learning algorithm. Another very important factor of FSS is to reduce the curse of dimensionality and improve the performance of prediction. This is achieved by removing the irrelevant features from the total features so that the above mentioned advantages can be achieved. Another benefit of the FSS is that the learned model is more comprehensible, because it facilitates the discovery of rules with less number of terms. FSS algorithms can be grouped into two main classes on the basis of their evaluation criteria or method. If an algorithm accomplishes FSS independently of learning algorithm then it falls in the category of the filter approach [93]. The second approach is known as wrapper approach. In wrapper approach the evaluation criteria is bound or tied to the learning algorithm. The main procedure using wrapper approach relies on the fitness function of the learning algorithm [94-100]. The fitness function returns the accuracy of the particular subset which is used as the quality of the subset. In the end, we have a subset of features whose accuracy is better than the rest of the subsets produced. The proposed ACO-FSS approach, presented in this chapter, is a wrapper based approach. The two categories mentioned above are further divided into five sub categories. These five methods are forward selection, backward elimination, forward and backward combination, random choice and sample based method [93]. In forward selection we iteratively add features until a stopping criterion is met. In backward elimination we iteratively remove the features. In the third method, which is combination of forward and backward, features can both be added or removed from the subset. Feature subset selection may start with no features at all, just an empty set, or it may start with all

Chapter 8: Feature Selection Based on Ant Colony optimization

features or start with some random feature subset. The initial feature subset is usually selected with the help of some heuristic function. Different methods have been proposed for FSS. A method for FSS using ACO has been proposed in [102]. This is a hybrid approach and not only gives importance to the overall performance of the features but also gives importance to the local features. The technique combines wrapper approach for overall performance of the features and filter approach for calculating local importance of the feature. In the first run the ants randomly choose feature subsets of certain number of features. Only the best among them are allowed to influence the next iteration. In the next iteration the features are selected by using filter approach; the features which can maximize the selection measure are selected. Filter approach is used to provide the heuristic information about a feature, so that better quality features can be selected in the next iterations. The authors apply the technique to speech segments to get the best feature set from them. FSS using genetic algorithm is presented in [103]. It is a multi-criteria optimization problem of FSS. This approach is also helpful in feature subset selection in the design of neural networks automatically. The neural network is used for pattern classification and knowledge discovery. The learning function of neural network classifies the given input in a finite set of classes. Decision trees or neural networks can be used as a learning algorithm. This technique is also a wrapper based multi-criteria technique for feature subset selection. The genetic algorithm is used in combination with a comparably fast inter-arrangement distance based learning algorithm of neural network. The standard genetic algorithm is used in the implementation of the technique with a rank based selection strategy. The population consists of individuals representing a candidate solution of the feature subset selection problem. The authors of [104] also present a wrapper based approach for FSS using GA. Yet another reference is [105] in which a method for dimensionality reduction with the help of genetic algorithm is presented. An approach for FSS based on Bayesian network is presented in [106]. It builds the Markov blanket of class variable. It is one of the important local structures in Bayesian network. FSS is done based on local dependency analysis method that finds relationship between variables.

115

116

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

In [107] a wrapper based approach for FSS using fuzzy logic is presented. The authors of [108] also present a wrapper based technique in which particle swarm optimization is used to build better fuzzy rules (which are used as classifiers). The authors of [109] present a method for FSS with using fuzzy classification. Fuzzy logic allows the use of overlay class definitions and makes better the interpretability of the results by supply more precision in the classifier architecture and decision making process. This help in improving classification [110]. E. Xing, M. I. Jordan, and S. Russel proposed a feature selection algorithm for high-dimensional genomic microarray data [111]. We have proposed a hybrid FSS algorithm using ACO and decision tree builder ID3. It is a wrapper based FSS approach, in which each ant incrementally constructs a candidate solution which is the selection of some features from whole feature set present in the dataset. Theses feature are selected on the basis of pheromone and heuristic value associated with each feature. We used information gain as a heuristic function. ID3 builds decision tree of these selected features and then ten-fold cross validation procedure is followed to check the worth of these features. Pheromone values are updated according to fitness returned by the evaluation of the learned classifier. At the end we extract the best found feature subset from the data.

8.2 Decision Trees A decision tree is a tree-like structure of decisions and their possible consequences. Decision trees are commonly used in operational research, specifically in decision analysis. A decision tree is represented by a set of nodes. Each internal node tests an attribute. Each node has branches and each branch corresponds to the corresponding attribute’s value. Each leaf node assigns a class label. For constructing a decision tree, entropy measure can be used to test an attribute that best splits the training samples. It is a measure borrowed from Information Theory and used in the ID3 algorithm and many others algorithms designed for decision tree construction. Informally, the entropy of a set of data samples can be considered as its disorder or non-homogeneity of class labels. Entropy belongs to information, if higher the entropy of data more the information required to describe the data. When we build a decision tree, our aim is to decrease the entropy of the dataset by attribute based decisions until we reach leaf nodes at which

Chapter 8: Feature Selection Based on Ant Colony optimization

point the subset of samples left has ideally zero entropy and has samples of one class only. The entropy of a dataset S is measured, with respect to one attribute, in this case the target attribute, with the following equation: c

∑ p log

Entropy(S ) = −

i

2

pi

(8.1)

i =1

1

Create a root node for the tree.

2

IF all examples belong to same class, THEN

3

Return single-node tree Root, with that class label

4

END IF

5

IF set of predicting attributes is empty, THEN

6 7

Return single node tree Root, with label = most common value of the target attribute in the examples ELSE

8

A = the attribute that best classifies examples.

9

Decision Tree attribute for Root = A

10

FOR EACH positive value, vi, of A,

11

Add a new tree branch below Root, corresponding to the test A = vi

12

Let Examples(vi), be the subset of examples that have the value vi for A

13

IF Examples(vi) is empty

14

Then below this new branch add a leaf node with label = most common target value in the examples

15

ELSE below this new branch add the subtree

16

END IF

17

ID3 (Examples(vi), Target_Attribute, Attributes – {A})

18

END FOR

19

END IF

20 Return Root 21 END IF

Figure 8-1 ID3 Algorithm

Where, pi is the proportion of samples in the dataset that take the ith value of the target attribute. This probability measure gives an indication of how uncertain we are about the data and we use a log2 measure because this represents how many bits we would need to

117

118

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

use in order to specify the class of a random sample. The ID3 algorithm is given in Figure 8-1.

8.3 Proposed Technique This section describes our proposed ACO based FSS technique.

8.3.1 General Description ACO is used for designing algorithms for complex combinatorial optimization problems [112-117]. The main idea of the proposed approach is to provide a fully connected N*N graph (where N is the total number of the attributes or features present in the dataset). The graph behaves like a search space for the ants to move, where links represent the connection between features of a particular dataset and nodes are the features. Each ant constructs a candidate solution in this search space by traversing a path of nodes and links. This path is actually the subset of the features. After an ant has completed its tour, the fitness of the traversed path (selected features) is calculated by running ID3 algorithm for the selected features and then checking the accuracy of the learned model. We perform ten-fold cross validation procedure for checking the accuracy of classifier. The average accuracy after performing ten-fold cross validation is the fitness of that particular feature subset and is used to update the pheromone values. This process continues until a stopping criterion is met. After termination of the algorithm the features set that has the best accuracy is returned as the solution. In our proposed approach the basic ACO algorithm is used, which is given in Figure 8-2.

1

Initialize parameters and search space.

2

WHILE (not terminated)

3

Generate Solutions

4

Calculate fitness of the generated solution.

5

Evaporation and update the pheromone values.

6

END WHILE

Figure 8-2 ACO Algorithm

Chapter 8: Feature Selection Based on Ant Colony optimization

The main factors involved in the ACO are the setting up of search space, initialization of pheromone values, and generation of solutions, fitness evaluation of the generated solutions, pheromone evaporation and pheromone updating. All these steps for the proposed approach are discussed in detail in next sub sections.

8.3.2 Search Space for ACO in Proposed Algorithm The FSS, like any other problem, needs a corresponding ACO search space. Defining search space is one of the most important factors for getting better results from the algorithm. The algorithm is heavily dependent on the provided search space. Our search space is according to the given dataset and is a N*N graph where N is the total number of attributes present in the dataset excluding target attribute. The nodes represent the features and the connection between the nodes i.e. edges, when traversed by the ant, denote the choice of next node, i.e. next feature. The search space contains a Start node and a Sink node. The Sink node is used to terminate the search and is connected to each node of the graph. When an ant selects Sink node on its path, it stops adding further nodes and its path is considered complete. Figure 8-3 shows the search space of our proposed approach. Attributes

Edges between the Nodes

Figure 8-3 NxN search space for ant traversal

119

120

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

In Figure 8-3, A1, A2, A3, etc. are the names of the attributes present in the dataset. There are as many layers as the number of attributes and each layer has all the attributes. Each attribute is connected to all other attributes in the next layer. Once an ant selects an attribute then it cannot select that attribute again in its current path. This constraint is applied to avoid adding duplicate attributes in the set. The selected subset always will comprise of distinct attributes. If an ant selects the Sink node then the tour of that particular ant is terminated. Another termination criterion is reached when all the attributes have been selected by an ant and there are no further attributes to select. The ant then necessarily goes to the Sink node. The links between the nodes have pheromone values associated with them. When an ant reaches an attribute it stores the name of that attribute in its current partial path. Visited attributes of that ant constitute a feature subset. The ACO algorithm when used with this search space is capable of providing feature subsets of variable size and random order of attributes.

8.3.3 Initialization of Pheromone values The presence of pheromone values on the edges is the basic component of the ACO. Initially it is initialized by some small random value. In our experiments, the pheromone values on all edges are initialized at the start of the algorithm with the same amount of pheromone. In this way no attribute is preferred over other attributes by the first ant. The initial pheromone is calculated according to Equation (8.2):

τ ij (t = 1) =

1 N

(8.2)

Where N is the total number of attributes present in the dataset excluding the class attribute.

8.3.4 Generation of a Candidate Solution of Subsets In this section we present the generation of subsets from the search space. Figure 8-4 is the pictorial representation of generation of a candidate subset.

Chapter 8: Feature Selection Based on Ant Colony optimization

Ant

Ant’s path

Figure 8-4 Selection of subset of features by an ant

Figure 8-4 shows an ant which starts its journey from the initial point. Suppose that first it goes to the attribute A1, then it visits A2, A3 and then selects the sink node. The tour is terminated as soon as the sink node is selected. The found subset contains A1, A2, and A3. In Figure 8-4 the black arrow line is showing the path traversed by the ant.

8.3.5 Selection of an Attribute An ant uses two components to calculate the probability of moving from the present node to the next node. The first component is the amount of pheromone present on the edge between nodei and nodej, and second is the heuristic value that describes the worth of a node. The probability with which an ant chooses node ‘j’ as the next node, after it has arrived at node ‘i’, is shown in Equation (8.3). Node ‘j’ has to be in the set ‘S’ of nodes that have not been visited. Pij =

[τ ij ]α ⋅ [η ij ]β

∑ ε [τ k S

α

ij ]

⋅ [η ij ]β

(8.3)

Where τij is the pheromone value associated between nodei and nodej and ηij is the value of heuristic function. The parameters α and β are influencing factors of pheromone value and heuristic value respectively.

121

122

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

8.3.6 Heuristic Function Heuristic function indicates the quality of an attribute. Its value greatly influences an ant’s decision to move from one node to another. A good heuristic function is very helpful in solving problems by ACO. We have used the information gain of each attribute as a heuristic function. We calculate the information gain for each attribute in the dataset. When an ant wants to make decision about the next node, the corresponding attribute’s information gain is used as heuristic value and is used in Equation (3) for probability calculation. The algorithm for calculating information gain is shown in the Figure 8-5. 1

Information gain (Samples S, Attributes A, Target-Attribute T)

2

Calculate entropy of T using (1).

3

Loop

4

For each attribute A

5

Calculate the information gain using (2).

6

Store the Corresponding information gain for A.

7

End Loop

8

END

Figure 8-5 Information gain calculation algorithm

In the above algorithm, first entropy of the target attribute is calculated. Then the average conditional entropy is calculated for each attribute and subtracted from target attribute’s entropy to calculate information gain. The entropy of target attribute is calculated by using Equation (8.1). Information gain for every attribute is calculated by using Equation (8.4). Gain( S , A) = Entropy( S ) −

∑ε

v V

| Sv | Entropy( S v ) |S|

(8.4)

Where V is the set of all possible values of attribute A and |Sv| is the subset of samples of S where A takes the value v, and |S| is the number of samples.

8.3.7 Fitness Function Fitness function helps us in finding out the worth of a specific set of features. We use ID3 decision tree for building a model from the selected feature subset and then evaluate the

Chapter 8: Feature Selection Based on Ant Colony optimization

learned model. When an ant generates a particular subset, we retain the particular attributes from the complete dataset and run the ID3 algorithm. As a result, we have a classifier in the form of a decision tree which is evaluated. We perform this procedure ten times using ten-fold cross validation. In 10 fold cross validation, the dataset is randomly divided into ten equally sized, mutually exclusive subsets. Each of the subset is used once for testing and the remaining nine are used for training. The results of these ten runs are then averaged and this average is used as fitness of the feature subset. The un-averaged fitness of a particular feature subset is measured by Equation (8.5). fitness = σ / N

(8.5)

Where σ is the number of examples correctly classified by the classifier and N is the total number of test cases. This fitness is calculated for each fold and then averaged.

8.3.8 Pheromone Updating The pheromone values are updated after each ant completes its tour so that future ants can make use of this information in their search. The amount of pheromone on each link occurring in the current feature subset selected by an ant is updated according to Equation (8.6).

τij (t +1) = (1− ρ) ⋅τij (t) + (1−

1 ) ⋅τij (t) 1+ fitness

(8.6)

Where τij(t) is the pheromone value between nodei and nodej in current iteration and ρ represents the pheromone evaporation rate and fitness is the quality of the current path constructed by an ant. The pheromone update is done by reducing the old pheromone value and then increasing it in proportion to the quality of selected features. If these features are good then they will become more attractive for the future ants and are more likely to be chosen. The pheromone values of other paths are updated by normalizing.

8.3.9

Proposed Algorithm for FSS Using ACO

The proposed algorithm for FSS is given in Figure 8-6. The algorithm starts by loading the dataset and then the initialization phase begins. In initialization phase the dataset is

123

124

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

divided into ten equal parts for ten-fold cross validation. After initialization of the folds the search space is constructed. The search space is an N*N graph where N is number of attributes in the dataset excluding target attribute. The pheromone values are initialized on all edges and a population of ants is created. In an iteration of the algorithm one ant completes its tour. When an ant starts its journey it moves to the next node according to Equation (8.3). When an ant completes its tour it has a subset of distinct features. These feature subsets are then used to construct a decision tree classifier. 1

Load the dataset.

2

Calculate information gain (heuristic value) of each attribute.

3

Generate a population of Ants.

4

Initialize the parameters of ACO.

5

FOR each Ant, Generate a subset S.

6

Evaluate each feature sub set S.

7

IF the fitness or accuracy is better than previous global best.

8

Set the current subset S accuracy as global best accuracy.

9

END IF

10

Update the pheromone values.

11

Repeat this process until stopping criteria do not meet.

12 END FOR 13 Report best feature subset as final more appropriate set. 14 END

Figure 8-6 Proposed feature subset selection algorithm

When a classifier is constructed, its fitness is evaluated by testing it on the remaining one fold. Then again a classifier is constructed by using nine folds but this time nine folds contain the fold which is used for testing in the previous step and testing is done on one of the remaining fold which is not used for testing before. In this way we make a classifier ten times and each time a different fold is used for testing. Then the average accuracy is calculated for all folds and the pheromone values are updated for the path.

Chapter 8: Feature Selection Based on Ant Colony optimization

Figure 8-7 Flow chart of proposed FSS approach

If the average accuracy of current subset is higher than previously found best subset then this subset will become the global best subset and its accuracy will be the global best accuracy. In the next iteration the same procedure is applied for the next ant. This process continues until a stopping criterion is met. There are two different stopping criterions. First criterion is the completion of a user defined number of iterations and second

125

126

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

criterion is that ants converge to a particular path. If ten consecutive ants return same set of features, then we consider that the ants have converged to a path and we stop the algorithm. Finally the global best subset is returned as final feature subset.

8.4 Experimentation and Analysis We have implemented the proposed algorithm in Microsoft Visual Studio 2005 using C# as the programming language. The values of user defined parameters are given in Table 8.1. It includes number of ants used, evaporation rate, convergence threshold and the values of alpha and beta parameters which indicate the relative importance of pheromone and heuristic values. Table 8.1 Parameters used in our experiments Parameters

Values

No of ants

1000

Evaporation rate

0.15

Convergence threshold

10

Alpha

1

Beta

1

For experimentation we use selected datasets from UCI machine learning repository [77]. We have tried to select datasets with diverse characteristics; some of them have binary classes and others are multi-class, some of them have lesser number of attributes while others have a relatively higher number, some of them have lesser number of samples while others have more. Continuous attributes are discretized in a preprocessing step. The details of datasets are given in Table 8.2. In Table 8.2, first column gives the name of dataset, second column contains the number of samples present, third column contains the number of attributes, and the last column contains the number of classes present in the dataset. Table 8.3 shows experimental results with and without feature subset selection. The results are compared on the basis of average predictive accuracy (10 fold cross validation is used), number of rules and number of terms per rule. The best performance is shown in bold. These experimental results indicate that the proposed feature subset selection method selects relevant features from the datasets causing an increase in the

Chapter 8: Feature Selection Based on Ant Colony optimization

accuracy rate and a significant decrease in the number of rules. The performance improvement has been observed on almost all the datasets. Table 8.4 shows the total number of features and number of reduced features after performing feature subset selection algorithm. Table 8.2 Datasets used in experiments Datasets

No. of samples

No. of Attributes

No. of classes

Zoo

101

18

7

Wine

178

13

3

Car

1728

6

4

Dermatology

366

34

6

Ecoli

336

8

8

German

1000

20

2

Glass

214

10

7

Haberman

307

3

2

Hayes_Roth

132

5

3

Heart

270

13

2

Hepatitis

155

19

2

House_Votes 84

435

16

2

Iris

150

4

3

Mamographic_Masses

961

5

2

Parkinson-Table

195

22

2

Pima-Indians-Diabetes

768

8

2

Segmentation

210

19

7

SPECT_Heart

267

22

2

TAE

151

5

3

Tic_tac_toe

958

9

2

Transfusion

748

4

2

Vehicles

282

18

4

Agaricus-Lepiota

8124

22

2

ANN-Train

3772

21

3

Chess-Kr-Vs-Kp

3196

36

2

Credit

690

15

2

Flare_Data

323

10

9

Liver_Disease

345

6

2

Sick-Euthyroid

3163

25

2

Soybean

307

35

15

WDBC

569

32

2

WPBC

198

34

2

127

128

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

Table 8.3 Average accuracies, number of rules, number of terms/rule without and after feature subset selection. With all features Datasets

After Feature Subset Selection (ACO)

Accuracy

# Rules

# Terms/rule

Accuracy

# Rules

# Terms/rule

Zoo

59.5 ± 7.07

60.0

2.0

98.00 ± 4.20

12.0

2.53

Wine

95.29 ± 1.82

16.0

2.34

95.88 ± 3.97

16.0

2.34

Car

94.53 ± 0.82

160.0

2.38

95.11 ±1.75

138.0

2.39

Dermatology

94.17 ± 7.8

30.0

2.22

94.72 ± 5.3

23.0

2.24

Ecoli

53.63 ± 6.42

195.0

2.0

60.30 ± 9.19

13.0

2.37

German

74.54 ± 2.14

25.0

2.24

76.66 ± 3.17

19.0

2.33

Glass

54.76 ± 3.37

9.0

2.0

58.73 ± 11.66

13.0

2.07

Haberman

41.0 ± 4.71

11.0

2.0

43.67 ± 9.61

20.0

2.50

Hayes_Roth

69.23 ± 10.87

19.0

2.14

71.538 ± 7.29

15.0

2.24

Heart

74.81 ± 2.61

14.0

2.16

82.59 ± 7.20

30.0

2.34

84 ± 4.71

14.0

2.16

92.00 ± 8.19

11.0

2.24

House_Votes 84

96.04 ± 2.10

13.0

2.22

97.21 ± 2.64

30.0

2.93

Iris

94.66 ± 6.88

10.0

2.0

94.66 ± 6.88

10.0

2.0

Mamographic_Masses

80.21 ± 4.41

15.0

2.1

80.42 ± 3.43

25.0

2.32

Parkinson-Table

54.73 ± 14.88

110.0

2.0

84.21 ± 3.50

16.0

2.0

Pima-Indians-Diabetes

72.10 ± 0.93

22.0

2.1

72.5 ± 3.42

10.0

2.04

Segmentation

63.80 ± 5.67

137.0

2.00

64.02 ± 9.20

121.0

2.0

Hepatitis

SPECT_Heart

90.00 ± 5.43

39.0

2.95

92.67 ± 4.94

36.0

2.94

TAE

46.66 ± 4.71

27.0

2.11

57.33 ± 14.12

16.0

2.43

Tic_tac_toe

85.47 ± 2.23

160.0

2.48

88.63 ± 3.67

125.0

2.49

Transfusion

74.86 ± 4.77

16.0

2.06

76.62 ±7.20

16.0

2.06

Vehicles

65.35 ± 12.82

143.0

2.0

65.35 ± 12.82

136.0

2.0

Agaricus-Lepiota

98.53 ± 0.26

9.0

2.0

98.53 ± 0.26

15.0

1.06

ANN-Train

94.13 ± 0.93

45.0

2.14

94.48 ± 1.02

28.0

1.09

Chess-Kr-Vs-Kp

98.43 ± 0.64

42.0

1.95

99.65 ± 0.22

81.0

2.90

Credit

85.94 ± 2.74

31.0

2.12

85.94 ± 4.26

17.0

1.16

Flare_Data

83.39 ± 3.12

20.0

2.33

84.05 ± 3.57

11.0

1.88

Liver_Disease

60.0 ±12.47

16.0

2.07

84.05 ± 3.57

12.0

1.06

Sick-Euthyroid

99.96 ± 0.32

10.20

1.10

99.39 ± 0.37

10.0

1.10

Soybean

75.86 ± 9.75

56.9

2.38

81.72 ± 6.30

33.0

1.47

WDBC

94.28 ± 1.26

10.8

1.0

94.46 ± 2.44

10.0

1.0

WPBC

71.05 ± 3.72

11.0

1.0

77.89 ± 9.21

10.0

1.0

Chapter 8: Feature Selection Based on Ant Colony optimization Table 8.4 Number of reduced features after feature subset selection Total Features

Reduced Features

Zoo

18

5

Wine

13

9

Car

6

1

Dermatology

34

4

Ecoli

8

6

German

20

3

Glass

10

7

Haberman

3

2

Hayes_Roth

5

2

Heart

13

7

Hepatitis

19

10

House_Votes 84

16

2

Iris

4

1

Mamographic_Masses

5

2

Parkinson-Table

22

16

Pima-Indians-Diabetes

8

4

Segmentation

19

10

SPECT_Heart

22

6

TAE

5

3

Tic_tac_toe

9

1

Transfusion

4

3

Vehicles

18

8

Agaricus-Lepiota

22

20

ANN-Train

21

18

Chess-Kr-Vs-Kp

36

4

Credit

15

5

Flare_Data

10

4

Liver_Disease

6

4

Sick-Euthyroid

25

23

Soybean

35

3

WDBC

32

24

WPBC

34

30

Datasets

129

130

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

Table 8.5 Comparison of predictive accuracies after selection of feature subsets by ACO and naïve Bayes FSS using ACO Datasets

FSS using naïve Bayes

Decision Tree Classifier

Naïve Bayesian Classifier

Decision Tree Classifier

(Accuracies)

(Accuracies)

(Accuracies)

Heart_disease

82.69 ± 7.20

82.22 ± 2.39

80.84 ± 2.07

Hepatitis

92.0 ± 8.19

86.00 ± 3.10

75.50 ± 3.39

Soybean

81.72 ± 6.30

74.19 ± 2.47

78.30 ± 2.63

Chess_kr_vs_kp

98.43 ± 0.64

92.97 ± 1.23

99.45 ± 0.64

House_Votes 84

97.21 ± 2.64

96.36 ± 1.52

95.27 ± 0.78

Iris

94.66 ± 6.88

97.33 ± 2.10

97.33 ± 2.16

Wpbc

77.89 ± 9.21

77.00 ± 3.28

79.93 ± 3.31

Wdbc

94.46 ± 2.44

94.21 ± 1.85

97.51 ± 1.42

Wine

95.88 ± 3.97

96.67 ± 2.26

93.24 ± 2.53

Credit

85.94 ± 4.26

86.38 ± 1.32

84.80 ± 1.73

Agricus_lepoita

99.00 ± 0.33

98.90 ± 1.23

99.80 ± 0.69

Ann_train

94.48 ± 1.02

92.81 ± 1.23

92.81 ± 1.20

Flair_data

84.05 ± 3.57

65.47 ± 2.15

69.48 ± 2.27

Tic-tac-toe

88.63 ± 3.67

73.12 ± 2.37

84.86 ± 2.25

Sick euthyroid

99.39 ± 0.37

90.76 ± 1.07

93.21 ± 1.12

Llever_disease

65 ± 11.13

63.71 ± 2.81

64.42 ± 2.62

Glass

58.73 ± 11.6

74.03 ± 2.54

79.36 ± 2.77

Car

95.11 ± 1.75

84.50 ± 1.91

88.63 ± 1.08

Table 8.5 shows the comparison of proposed feature subset selection approach with naïve Bayes feature subset selection approach. Naïve Bayes is used for both feature subset selection and classifier decision (3rd column) and only for feature subset selection followed by decision tree classifier (4th column). The experimental results indicate that the decision tree classifier achieves higher accuracy rates on most of datasets when it uses features selected by our proposed approach.

8.5

Summary

FSS can reduce curse of dimensionality and thus decrease the computational cost of building models. The performance of a classifier and the time required to train the classifier is sensitive to the features used to build the classifier. ACO is an attractive approach for FSS problem. This chapter proposed a technique for FSS by using ACO.

Chapter 8: Feature Selection Based on Ant Colony optimization

ACO is used to search for more appropriate features from the complete set of features and decision tree is used to build learning models. We compare the accuracy results obtained before and after selection of features. These results indicate that predictive accuracy is improved after performing FSS. The number of rules also decreases significantly. The less number of rules make a classification system more comprehensible because they are easy to understand by domain experts. We also compared the proposed approach with the naïve Bayes approach for FSS. Comparison is performed on the basis of predictive accuracy after selection of features by both approaches. The experimental results indicate that the proposed approach finds those features from datasets that improve predictive accuracy of the learned model. Hence ACO is a powerful approach in the field of data mining, including FSS. The next chapter concludes the thesis and provides directions for future research.

131

132

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

9 Chapter 9: Conclusions & Future Work

9.1 Conclusion Data is a vital and valuable asset. Data mining is an active area of research. Without applying automatic data mining techniques it is difficult to effectively analyze large amounts of data. Researchers are interested in finding efficient and accurate classification models that achieve higher accuracy rate, are comprehensible and can be learnt in reasonable time, even for large databases. The primary goal of this thesis is to develop accurate, comprehensible and efficient classification algorithm based on ACO. Towards this end we have proposed and evaluated two main algorithms. We have also proposed a method for feature subset selection based on ACO. Our experiments show that the proposed algorithms exhibit better accuracies on most of the datasets when compared with other algorithms commonly used for such purposes. They are thus candidates for more thorough investigation and acceptance for real world and commercial applications. The proposition of these algorithms and the resulting experimentation has also advanced and contributed towards the suite of ACO based applications in particular and swarm intelligence in general.

9.2

Future Work

For future work we recommend the following:

9.2.1 AntMiner-C •

Conceiving and experimenting with a heuristic function which considers compatibility between all (or at least more than one) of the previously selected terms and the next term to be added (currently only the compatibility between most recently selected term and the next term is considered).

Chapter 9: Conclusions & Future Work



There are other variants of ACO that can be tested in order to improve the performance (accuracy, time complexity, etc.) of the classifier.



The proposed approach can be applied to hierarchical and multi label classification after suitable modifications.



Domain knowledge can be incorporated during the learning of the algorithm to ensure the interpretability of the resulting model. For this purpose a new definition of the vertices and edges is needed so that the rules that are not compatible with the domain knowledge are penalized. We would like to refer to the knowledge fusion problem which deals with the cohesion of extracted knowledge with the domain knowledge provided by experts. The rules extracted by the algorithm for a given dataset may not completely adhere to the available domain knowledge. There might be missing, redundant and unjustifiable (misleading) rules. In [118], the AntMiner+ is extended to incorporate hard constraints of the domain knowledge by modifying the search space and the soft constraints by influencing the heuristic values. For continuous attributes, the domain knowledge will have to be incorporated in the discretization algorithm. The improvement in the real world usability of the algorithm by incorporation of domain knowledge is an interesting direction for future study.

9.2.2 ACO-AC •

The various parameters used in the proposed hybrid associative classification ACO-AC algorithm, specially support, confidence and coverage can be more thoroughly investigated.



Currently ACO-AC approach can only deal with categorical attributes and continuous attributes need to be discretized in a preprocessing step. A direction of further research can be to find a method which deals with continuous attributes directly in the algorithm.



Work can be done to see if the approach can be extended to find general association rules instead of class association rules.

133

134

Classification and Associative Classification Rule Discovery using Ant Colony Optimization



Another future direction can be to use the proposed ACO-AC approach for finding feature subsets from data.



Different rule selection, rule pruning and pheromones update strategies can be also are tested in an effort to further improve the performance of the algorithm.

9.2.3 ACO-FSS •

Our feature subset selection technique (ACO-FSS) uses ID3 as the base classifier. The approach can be evaluated by using some other classifiers.



Also search space can be modified to improve the performance of the algorithm.



Different heuristic functions can also be tried instead of information gain.

In our current work we do not attempt to optimize the various parameters used in the algorithm, for example, evaporation rate, alpha and beta parameters. These parameters can be optimized to further improve the performance of proposed approach.

References

References [1]

J. Han, and M. Kamber, Data Mining: Concepts and Techniques, 2nd ed., Morgan Kaufmann Publishers, 2006.

[2]

M.J. Berry, and G. Linoff. Data Mining Techniques for Marketing, Sales, and Customer Support. New York: John Wiley, 1997.

[3]

J. Pesce, “Stanching hospitals, “Financial hemorrhage with information technology,” Health Management Technology, Vol. 24, No. 8, pp. 6-12, 2003.

[4]

W. Ceusters, “Medical natural language understanding as a supporting technology for data mining in healthcare,” Chapter 3 in: Cios K.J., eds. Medical Data Mining and Knowledge Discovery, Heidelberg: Springer-Verlag, pp. 32-60, 2000.

[5]

A.C. Tessmer, "What to learn from near misses: an inductive learning approach to credit risk assessment," Decision Sciences, Vol. 28, No. 1, pp. 105-120, 1997.

[6]

A.P. Engelbrecht, Computational Intelligence, an Introduction, 2nd edition. John Wiley & Sons, 2007.

[7]

A.P. Engelbrecht, Fundamentals of Computational Swarm Intelligence. John Wiley & Sons, 2005.

[8]

J. Kennedy, R.C. Eberhart, and Y. Shi, Swarm Intelligence. Morgan Kaufmann/ Academic Press, 2001.

[9]

M. Dorigo, and T. Stützle, Ant Colony Optimization. Cambridge, MA: MIT Press, 2004.

[10]

M. Dorigo, V. Maniezzo, and A. Colorni, “Ant System: Optimization by a colony of cooperating Agents,” IEEE Transactions on Systems, Man, and Cybernetics, Part B, Vol. 26, No. 1, Feb. 1996.

[11]

M. Dorigo and L.M. Gambardella, “Ant colony system: a cooperative Learning approach to the travelling salesman problem,” IEEE Transactions on Evolutionary Computation, Vol. 1, No. 1, April 1997.

[12]

Y. Yaginuma, “High-performance data mining system,” Fujitsu Scientific and Technical Journal, Special Issue: Information Technologies in the Internet Era, Vol. 36, No. 2, pp.201-210, 2000.

135

136

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

[13]

A. Abraham, C. Grosan, and V. Ramos, “Swarm Intelligence in Data Mining,” Studies in Computational Intelligence, Vol. 34, pp. 1-20, Springer 2006.

[14]

C.T. Hardin, and J.S. Usher, “Facility layout using swarm intelligence,” in Proceedings of IEEE Swarm Intelligence Symposium, pp. 424-427, June 2005.

[15]

S. Lorpunmanee, M.N. Sap, A.H. Abdullah, and C. Chompoo-inwai, “An ant colony optimization for dynamic job scheduling in grid environment,” World Academy of Science, Engineering and Technology, pp. 314-321, 2007.

[16]

B. Chakraborty, “Feature subset selection by particle swarm optimization with fuzzy fitness function,” in 3rd International Conference on Intelligent System and Knowledge Engineering, ISKE, pp. 1038-1042, 2008.

[17]

K. Mong Si, and W. Hong Sun, “Multiple ant-colony optimization for network routing,” in First International Symposium on Cyber Worlds Proceedings, pp. 277-281, 2002.

[18]

X. Tan, X. Luo Chen, and W.N. Jun Zhang, “Ant colony system for optimizing vehicle routing problem with time windows,” in International Conference on Computational Intelligence for Modeling, Control and Automation and International Conference on Intelligent Agents, Web Technologies and Internet Commerce, pp. 209-214, 2005.

[19]

E. Salari, and K. Eshghi “An ACO algorithm for graph coloring problem,” in Congress on Computational Intelligence Methods and Applications, pp. 659-666, 2005.

[20]

M. Lee,

S. Kim,

W. Cho, S. Park, and J. Lim, “Segmentation of brain MR

images using an ant colony optimization algorithm,” in Ninth IEEE International Conference on Bioinformatics and Bioengineering, pp. 366-369, 2009. [21]

C.J. Lin, C. Chen, and C. Lee, “Classification and medical diagnosis using wavelet-based fuzzy neural networks,” International Journal of Innovative Computing, Information and Control (IJICIC), Vol.4, No.3, pp 735-748, March 2008.

[22]

R.S. Parpinelli, H.S. Lopes, and A.A. Freitas, “An ant colony based system for data mining: applications to medical data,” in Proceedings of Genetic and

References

Evolutionary Computation Conference (GECCO-2001), Morgan Kaufmann, San Francisco, California, pp. 791–798, 2001. [23]

R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification. John Wiley & Sons, 2000.

[24]

I.H. Witten, and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, 2nd ed. Morgan Kaufmann, 2005.

[25]

J.R. Quinlan, “Generating production rules from decision trees,” in Proceedings of International Joint Conference of Artificial Intelligence, pp. 304-307, San Francisco, USA, 1987.

[26]

M. Omran: Particle Swarm optimization methods for pattern recognition and image processing, Ph.D. Thesis, University of Pretoria, 2005.

[27]

M. Omran, A. Salman, and A.P. Engelbrecht “Image classification using particle swarm optimization,” in Proceedings of the 4th Asia-Pacific Conference on Simulated Evolution and Learning, Singapore, pp. 370-374, 2002.

[28]

J.L. Deneubourg, S. Goss, N. Franks, A.S. Franks, C. Detrain, and L. Chretien, “The dynamics of collective sorting: robot-like ants and ant-like robots,” in Proceedings of the First International Conference on Simulation of Adaptive Behaviour: From Animals to Animates, Cambridge, MA: MIT Press, 1, pp. 356365, 1991.

[29]

J. Valdes, “Building virtual reality spaces for visual data mining with hybrid evolutionary-classical optimization: application to microarray gene expression data,” in Proceedings of the IASTED International Joint Conference on Artificial Intelligence and Soft Computing, pp. 713-720, 204.

[30]

S.S. Weng, and Y.H. Liu, “Mining time series data for segmentation by using ant colony optimization,” European Journal of Operational Research, Vol. 173, No. 3, pp. 921-937, 2006.

[31]

P.S. Shelokar, V.K Jayaraman, and B.D. Kulkarni, “An ant colony classifier system: application to some process engineering problems,” Computers & Chemical Engineering, Vol. 28. No. 9, pp. 1577-1584, 2004.

137

138

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

[32]

A. Abraham, and V. Ramos, “Web usage mining using artificial ant colony clustering and genetic programming,”, in IEEE Congress on Evolutionary Computation (CEC2003), Australia, IEEE Press, pp. 1384-1391, 2004.

[33]

Y. Wang, P. Chang, and C. Fan, “Database classification by integrating a casebased reasoning and support vector machine for induction,” Journal of Circuits, Systems, and Computers (JCSC), Vol. 19, No. 1, pp.31-44, Feb. 2010.

[34]

B. Liu, H.A. Abbass, and B. McKay, “Classification rule discovery with ant colony optimization,” in Proceedings of IEEE/WIC International Conference of Intelligent Agent Technology, pp. 83–88, 2003.

[35]

P. Eklund, and A. Hoang, “A performance survey of public domain machine learning algorithms,” Technical Report, School of Information Technology, Griffith University, 2002.

[36]

W. Shahzad and A.R. Baig, “Compatibility as a heuristic for construction of rules by artificial ants,” Journal of Circuits, Systems, and Computers, Vol. 19, No. 1, pp. 297-306, Feb. 2010.

[37]

A.R. Baig and W. Shahzad, “A correlation based ant miner for classification rule discovery, compatibility as a heuristic for construction of rules by artificial ants” Neural Computing and Applications, 2010. (Under 2nd review).

[38]

J. Catlett, “Over-pruning large decision trees,” in Proceedings of International Joint Conference of Artificial Intelligence, San Francisco, CA, 1991, pp. 764–769.

[39]

Y. Kusunoki, M. Inuiguchi, “Rule induction via clustering decision classes,” International Journal of Innovative Computing, Information and Control (IJICIC), Vol. 4, No. 10, pp. 2663-2677, Oct. 2008.

[40]

O.T. Yıldız, and O. Dikmen, “Parallel uni-variate decision trees,” Pattern Recognition Letters, Vol. 28, No. 7, pp. 825-832, May 2007.

[41]

R. Rastogi, and K. Shim, “A decision tree classifier that integrates building and pruning,” Data Mining and Knowledge Discovery, Vol. 4, pp. 315–344, 2000.

[42]

D.R. Carvalho, and A.A. Freitas, “New results for a hybrid decision tree/genetic algorithm for data mining,” in J. Garibaldi (Ed.), Proceedings of 4th International Conference on Recent Advances in Soft Computing (RASC-2002), Nottingham Trent University, pp. 260–265, 2002.

References

[43]

G.J. Williams, “Inducing and combining multiple decision trees,” PhD Thesis, Australian National University, Canberra, Australia, 1990.

[44]

J.R. Quinlan, “Improved use of continuous attributes in C4.5,” Journal of Artificial Intelligence Research, Vol. 4, pp. 77-90, 1996.

[45]

T. Oates, and D. Jensen, “The effects of training set size on decision tree complexity,” in Proceedings of the 14th International Conference on Machine Learning, pp. 254-262, 1997.

[46]

P. Clark, and T. Niblett, “The CN2 induction algorithm,” Machine Learning, pp. 261–283, 1989.

[47]

W. Cohen, “Fast effective rule induction,” in Machine Learning: Proceedings of the Twelfth International Conference (ML95), pp. 852-857, 1995.

[48]

Y.W. Chen. and C.J. Lin, “Combining SVMs with various feature selection strategies,” in Feature extraction, foundations and applications, Springer-Verlag, Berlin, 2006.

[49]

M.L. Zhang and Z.H. Zhou, “A k-nearest neighbor based algorithm for multilabel classification,” in 1st IEEE International Conference on Granular Computing, Vol. 2, pp 718–721, 2005.

[50]

T. Seidl, and H. Kriegel “Optimal multi-step k-nearest neighbor search,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 154–165, 1998.

[51]

J.C. Bezdek, S.K. Chuah, and D. Leep, “Generalized k-nearest neighbor rules,” Fuzzy Sets System, Vol. 18, No.3, pp. 237–256, 1986.

[52]

S. Cost, and S. Salzberg, “A weighted nearest neighbor algorithm for learning with symbolic features,” Machine Learning, Vol. 10, No.1, pp. 57-78, 1993.

[53]

L.J. Wang, X.L. Wang and Y.C. Liu, “Combination of multiple real-valued nearest neighbor classifiers based on different feature subsets with fuzzy integral,” International Journal of Innovative Computing, Information and Control (IJICIC), Vol.4, No.2, pp 369-379, Feb. 2008.

[54]

T. Hastie, and R. Tibshirani, “Discriminant adaptive nearest neighbor classification.” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 18, pp. 607–616, 1996.

139

140

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

[55]

I.W. Tsang, J.T. Kwok, and P.M. Cheung “Core vector machines: fast SVM training on very large datasets,” Journal of Machine Learning Research, Vol. 6, pp. 363–392, Dec. 2005.

[56]

Z. Wu and C. Li, “Feature selection for classification using transductive support vector machines”, in Feature Extraction, Foundations and Applications, Springer-Verlag, Berlin, 2006.

[57]

N. Friedman, D. Geiger, and M. Goldszmidt, “Bayesian network classifiers,” Journal of Machine Learning Research, Vol. 29, pp. 131–163, Dec. 1997.

[58]

R. Bouckaert, “Naive bayes classifiers that perform well with continuous variables,” Lecture Notes in Computer Science, Vol. 3339, pp. 1089 – 1094, 2004.

[59]

S.L. Cessie, and V. Houwelingen, “Ridge estimators in logistic regression,” Applied Statistics, Vol. 41, No. 1, pp. 191-201, 1992.

[60]

R.E. Schapire, and Y. Singer, “Improved boosting algorithms using confidencerated predictions,” Machine Learning, Vol. 37(3), pp. 297–336, 1999.

[61]

A.E. Eiben, and J.E. Smith, Introduction to Evolutionary Computing. Natural Computing Series, 2nd edition, 2007.

[62]

A. Freitas, “Survey of evolutionary algorithms for data mining and knowledge discovery,” in A. Ghosh, S. Tsutsui (Eds.), Advances in Evolutionary Computation, Springer-Verlag, pp. 151-160, 2001.

[63]

M. Dorigo. Optimization, Learning and Natural Algorithms (in Italian). PhD thesis, Dipartimento di Elettronica, Politecnico di Milano, Milan, Italy, 1992.

[64]

M. Dorigo, M. Birattari and T. Stützle, “Ant colony optimization, artificial ants as a computational intelligence technique,” IEEE Computational Intelligence Magazine, Vol. 1, No. 4, pp. 28-39, 2006.

[65]

R.S. Parpinelli, H.S. Lopes, and A.A. Freitas, “Data mining with an ant colony optimization algorithm,” IEEE Transactions on Evolutionary Computation, Vol. 6, No. 4, pp. 321–332, Aug. 2002.

[66]

B. Liu, H.A. Abbass, and B. McKay, “Density-based heuristic for rule discovery with ant-miner,” in Proceedings of 6th Australia-Japan Joint Workshop on Intelligent Evolutionary Systems. Canberra, Australia, 2002, pp. 180–184.

References

[67]

B. Liu, H.A. Abbass, and B. McKay, “Classification rule discovery with ant colony optimization,” in Proceedings of IEEE/WIC International Conference on Intelligent Agent Technology, 2003, pp. 83–88.

[68]

B. Liu, H.A. Abbass, and B. McKay, “Classification rule discovery with ant colony optimization,” IEEE Computational Intelligence Bulletin, Vol. 3, No. 1, Feb. 2004.

[69]

D. Martens, M. de Backer, R. Haesen, J. Vanthienen, M. Snoeck, and B. Baesens, “Classification

with

ant

colony

optimization,”

IEEE

Transactions

on

Evolutionary Computation, Vol. 11, No. 5. Oct. 2007. [70]

J. Smaldon, and A.A. Freitas, “A new version of the Ant-Miner algorithm discovering unordered rule sets,” in Proceedings of Genetic and Evolutionary Computation Conference, (GECCO-2006), pp. 43-50, 2006.

[71]

N. Holden, and A.A. Freitas, “A hybrid PSO/ACO algorithm for classification,” in Proceeding of GECCO-2007 Workshop on Particle Swarms: The Second Decade, ACM Press, New York, 2007.

[72]

D. Bratton, and J. Kennedy, “Defining a standard for particle swarm optimization,” in Proceedings of the IEEE Swarm Intelligence Symposium (SIS ’07), pp. 120–127, Honolulu, Hawaii, USA, April 2007.

[73]

S. Swaminathan: Rule induction using ant colony optimization for mixed variable attributes, MSc Thesis, Texas Tech. Univ., 2006.

[74]

A. Chan, A. Freitas, “A new ant colony algorithm for multi-label classification with applications in bioinformatics,” in Proceedings of Genetic and Evolutionary Computation Conference (GECCO-2006), pp. 27-34, 2006.

[75]

M. Galea, and Q. Shen,” Simultaneous ant colony optimization algorithms for learning linguistic fuzzy rules,” in Agraham, A.; Grosan, C. and Ramos, V. (eds.), Swarm Intelligence in Data Mining, pp. 75-99. Berlin: Springer, 2006.

[76]

N. Holden, and A.A. Freitas, “Web page classification with an ant colony algorithm,” in Proceedings of Parallel Problem Solving from Nature, LNCS 3242, pp.1092-1102, Springer, 2004.

[77]

S. Hettich, and S.D. Bay, “The UCI KDD Archive”. Irvine, CA: Dept. Inf. Comput. Sci., Univ. California, 1996 [Online]. Available: http:// kdd.ics.uci.edu.

141

142

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

[78]

W. Li. “Classification based on multiple association rules,” MSc Thesis, Simon Fraser University, April 2001.

[79]

R. Agrawal, T. Imielinski, and A. Swami, “Mining association rules between sets of items in large databases,” in Proceedings of ACMSIGMOD International Conference on Management of Data, Washington, DC, pp. 207–216, 1993.

[80]

A. Savasere, E. Omiecinski, and S. Navathe, “An efficient algorithm for mining association rules in large databases,” in Proceedings of 21st International Conference Very Large Databases (VLDB), pp. 432–444, 1995.

[81]

T.P.

Hong, C.W Lin, and Y.L. Wu, “An efficient Fufp-tree maintenance

algorithm for record modification,” International Journal of Innovative Computing, Information and Control (IJICIC), Vol.4, No.11, pp 2875-2887, Nov. 2008. [82]

B. Liu, Y. Ma, and C.K. Wong, “Improving an association rule based classifier,” in Proceedings of 4th European. Conference of Principles Practice Knowledge Discovery Databases (PKDD-2000), Vol. 1910, pp. 293-217, 2000.

[83]

J.L. Koh, and P. Yo, “An efficient approach for mining fault-tolerant frequent patterns based on bit vector representations,” in Proceedings of 10th International Conference DASFAA, pp. 179-184 , 2005.

[84]

Y.L. Cheung, and A.W. Fu, “An FP-tree approach for mining n-most interesting item-sets,” in Proceedings of the SPIE Conference on Data Mining, pp. 460-471, 2002.

[85]

C.J. Chu, V.S. Tseng, and T. Liang, “Mining temporal rare utility item-sets in large databases using relative utility thresholds,” International Journal of Innovative Computing, Information and Control (IJICIC), Vol. 4, No. 11, pp. 2775-2792, Nov. 2008.

[86]

G. Chen, H. Liu, L. Yu, Q. Wei, and X. Zhang, “A new approach to classification based on association rule mining,” Decision Support Systems, Vol. 42, No. 2, pp. 674-689, 2006.

[87]

R. Shettar, and G.T. Shobha, “Finding frequent structures in semi-structured data,” ICIC Express Letters, Vol. 3, No. 2, pp. 135-140, June 2009.

References

[88]

K.R. Seeja, M.A. Alam, S.K. Jain, “An association rule mining approach for coregulated signature genes identification in cancer,” Journal of Circuits, Systems, and Computers (JCSC), Vol. 18, No. 8, pp.1409-1423, Feb. 2010.

[89]

J. Han, J. Pei, and Y. Yin, “Mining frequent patterns without candidate generation,” in Proceedings of ACM SIGMOD Intl. Conference on Management of Data, pp. 1–12, 2000.

[90]

B. Liu, H. Hsu, and Y. Ma, “Integrating classification and association rule mining,” in Proceedings of 4th International Conference on Knowledge Discovery Data Mining, pp. 80–86, 1998.

[91]

B. Liu, Y. Ma, and C.K.Wong, “Classification using association rules: Weaknesses and enhancements,” in Data Mining for Scientific and Engineering Applications, R. L. Grossman, C. Kamath, P.Kegelmeyer,V. Kumar, and R. R. Namburu, Eds., Berlin, Germany: Springer-Verlag, 2001.

[92]

W. Li, J. Han, and J. Pei, “CMAR: Accurate and efficient classification based on multiple class-association rules,” in Proceedings of IEEE International Conference on Data Mining. (ICDM ’01), pp. 369–376, 2001.

[93]

H.L. Wei and S.A. Billings, “Feature subset selection and ranking for data dimensionality reduction,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 29, No.1, Jan. 2007.

[94]

A.L. Blum and P. Langley, “Selection of relevant features and examples in machine learning,” Artificial Intelligence, Vol. 97, No. 2, pp. 245–271, Dec. 1997.

[95]

A. Al-Ani, M. Deriche, and J. Chebil, “A new mutual information based measure for feature selection,” Intelligent Data Analysis, Vol. 7, No. 1, pp. 43-57, 2003.

[96]

V. Gunes, M. Menard, P. Loonis, and S. Petit-Renaud, “Combination, cooperation and selection of classifiers: A state of the art,” International Journal of Pattern Recognition, Vol. 17, No. 8, pp. 1303-1324, March 2003.

[97]

J. Bins, and B.A. Draper, “Feature selection from huge feature sets,” in Proceedings of the Eighth International Conference on Computer Visio”, Vol. 2, pp. 159–165, 2001.

143

144

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

[98]

K. Dune, P. Cunningham, and F. Azuaje, “Solutions to instability problems with sequential wrapper based approaches to feature selection,” in Technical Report2002-28, Department of Computer Science Courses, Trinity College, Dublin, 2002.

[99]

S.K. Lee, S.J. Yi, and B.T. Zhang, “Combining information-based supervised and unsupervised feature selection,” in Feature Extraction, Foundations and Applications, Springer-Verlag, pp. 517-520 2006.

[100] S. Singhi, and H. Liu, “Feature subset selection bias for classification learning,” in Proceedings of the 23rd International Conference on Machine Learning, pp. 849 – 856, 2006. [101] H.R. Rashidy Kanan, and K. Faez, “Feature selection using ant colony optimization (ACO) a new method and comparative study in the application of face recognition system,” in Proceedings of ICDM, Lecture Notes for Artificial Intelligence, Springer, Vol. 4597, pp. 63-76, 2007. [102] M. Deriche, “Feature selection using ant colony optimization,” in 6th International Multi-Conference on Systems, Signals and Devices”, pp. 1-4, 2009. [103] J. Yahang, and V. Honavar, “Feature subset selection using a genetic algorithm,” IEEE Intelligent Systems, Vol. 13, No. 2, pp. 44-49, 1997. [104] H. Zhou, J. Wu, Y. Wang and M. Tian, “Wrapper approach for feature subset selection using genetic algorithm,” IEEE Intelligent Systems and Their Applications, Vol. 13, Issue. 2, pp. 44-49, March 1998. [105] M. Raymer, W. Punch, E. Goodman, L. Kuhn, and A. Jain, “Dimensionality reduction using genetic algorithms,” IEEE Transactions On Evolutionary Computation, Vol. 4, No.2, pp. 164-171, July2000. [106] S. Wang , C. Len, and R. Du, “Feature subset selection based on Bayesian networks,” in 6th International Conference on Fuzzy Systems and Knowledge Discovery, pp. 184-187, Aug. 2009. [107] M.E. Cintra, T. Martiny, and M.C. Monardz, “Feature subset selection using a fuzzy method,” in International Conference on Intelligent Human-Machine Systems and Cybernetics, pp. 214-217, 2009.

References

[108] B. Chakraborty, “Feature subset selection by particle swarm optimization with fuzzy fitness function,” in 3rd International Conference on Intelligent System and Knowledge Engineering, pp. 1038-1042, Nov. 2008. [109] J. Li, “Feature selection based on correlation between fuzzy features and optimal fuzzy-valued feature subset selection,” in International Conference on Intelligent Information Hiding and Multimedia Signal Processing, pp. 775-778, 2008. [110] S.M. Viera, J. Souza, and T.A. Runkler, “Fuzzy classification in ant feature selection,” in IEEE International Cconference on Computational Intelligence, pp. 1763-1769, 2008. [111] E. Xing, M. I. Jordan, and S. Russel ”Feature selection for high-dimensional genomic microarray data,” in Proceedings of the Eighteenth International Conference on Machine Learning, pp. 601–608, 2001. [112] E. Bonabeau, M. Dorigo, and G. Theraulaz. “Inspiration for optimization from social insect behavior,” Nature, 406, pp. 39–42, 2000. [113] M. Dorigo and G. Di Caro, “The ant colony optimization meta-heuristic,” in D. Corne, M. Dorigo, and F. Glover, Editors, New Ideas in Optimization, pp. 11–32. McGraw Hill, London, UK, 1999. [114] M. Dorigo, G. Di Caro, and L.M. Gambardella, “Ant algorithms for discrete optimization,” Artificial Life, Vol. 5, No.2, pp. 137–172, 1999. [115] V. Maniezzo, and A. Colorni, “The ant system applied to the quadratic assignment problem,” IEEE Transactions of Knowledge and Data Engineering, Vol. 11, No. 5, pp. 769-778, Oct. 1999. [116] L.M. Gambardella, and M. Dorigo, “Ant colony system hybridized with a new local search for the sequential ordering problem,” INFORMS Journal on Computing, Vol. 2, No. 3, pp. 237–255, 2000. [117] A. Colorni, M. Dorigo, and V. Maniezzo, "An investigation of some properties of an ant algorithm," in Proceedings of the Parallel Problem Solving from Nature Conference (PPSN 92), Brussels, Belgium, R.Männer and B.Manderick (Eds.), Elsevier Publishing, pp. 509-520, 1992. [118] D. Martens, M. de Backer, R. Haesen, B. Baesens, C. Mues, and J. Vanthienen, “Ant-based approach to the knowledge fusion problem,” in Ant Colony

145

146

Classification and Associative Classification Rule Discovery using Ant Colony Optimization

Optimization and Swarm Intelligence (ANTS 2006), LNCS 4150, pp. 84-95, Springer, 2006.