SOME STUDIES ON DIFFERENT DATA MINING APPROACHES Thesis Submitted for the Degree of
DOCTOR OF PHILOSOPHY (ENGINEERING) UNDER THE FACULTY OF ENGINEERING, TECHNOLOGY & MANAGEMENT
by
Soumadip Ghosh UNDER THE SUPERVISION OF
Dr. Sushanta Biswas Associate Professor Department of Engineering & Technological Studies University of Kalyani and
Dr. Debasree Chanda (Sarkar) Associate Professor Department of Engineering & Technological Studies University of Kalyani
DEPARTMENT OF ENGINEERING & TECHNOLOGICAL STUDIES
UNIVERSITY OF KALYANI KALYANI NADIA, WEST BENGAL INDIA 2016
Telephone: 2582-3457(Direct) 2582 8750 Ext. 297/298 Fax: 0091-033-2582 8282 E-mail:
[email protected]
UNIVERSITY OF KALYANI Department of Engineering & Technological Studies Kalyani, West Bengal - 741 235, India Dr. Sushanta Biswas Associate Professor
This is to certify that the thesis entitled “Some Studies on Different Data Mining Approaches” submitted by Mr. Soumadip Ghosh who got his name registered on 29th June 2011 for the award of Ph. D. degree (Engineering) under the faculty of engineering, technology & management of University of Kalyani, is absolutely based upon his own work under the supervision of Dr. Sushanta Biswas, Associate Professor and Dr. Debasree Chanda (Sarkar), Associate Professor, Department of Engineering & Technological Studies, University of Kalyani and that neither this thesis nor any part of it has been submitted for any degree/diploma or any other academic award anywhere before.
--------------------------------------------Dr. Sushanta Biswas, Associate Professor, Department of Engineering &Technological Studies, University of Kalyani, Kalyani, Nadia-741235, WB, INDIA
i
UNIVERSITY OF KALYANI Department of Engineering & Technological Studies
Telephone: 2582-3457(Direct) 2582 8750 Ext. 297/298 Fax: 0091-033-2582 8282 E-mail:
[email protected]
Kalyani, West Bengal - 741 235, India Dr. Debasree Chanda (Sarkar) Associate Professor
This is to certify that the thesis entitled “Some Studies on Different Data Mining Approaches” submitted by Mr. Soumadip Ghosh who got his name registered on 29th June 2011 for the award of Ph. D. degree (Engineering) under the faculty of engineering, technology & management of University of Kalyani, is absolutely based upon his own work under the supervision of Dr. Sushanta Biswas, Associate Professor and Dr. Debasree Chanda (Sarkar), Associate Professor, Department of Engineering & Technological Studies, University of Kalyani and that neither this thesis nor any part of it has been submitted for any degree/diploma or any other academic award anywhere before.
--------------------------------------------Dr. Debasree Chanda (Sarkar), Associate Professor, Department of Engineering & Technological Studies, University of Kalyani, Kalyani, Nadia-741235, WB, INDIA
ii
ACKNOWLEDGMENT First and foremost I would like to thank the Almighty God for his blessings for making me stand to this stage. I also acknowledge my indebtedness towards my country, India, which has given me all the opportunities that I have craved for. I take this opportunity to show my gratitude to all my teachers, starting from my school days to till date, who have guided me throughout my quest for knowledge, who have given me the opportunity to express my imagination, who gave patient hearing to my ideas even though many a time it was ridiculous. It was sheer belief of my teachers that gave me the encouragement to pursue my Ph.D. I am especially grateful to my supervisor Dr. Sushanta Biswas, for his extremely valuable guidance during this research work. Despite his extremely busy schedule, he has gone through the thesis in depth. It is due to his effort that I am able to complete my thesis. I also owe thanks to Dr. Debasree Chanda (Sarkar), my co-supervisor, for her insightful and encouraging comments and suggestions throughout this research. Her involvement proved vital to the completion of this thesis. Finally I would like to thank Dr. Partha Pratim Sarkar, HOD, Department of Engineering & Technological Studies, for his continuous support and encouragement. On the personal front, I would like to express my gratitude to my parents, who imbibed the art of reasoning, question and learning in me. Without them, I would have never been able to learn this very basic and vital method of science. I am also indebted to my uncles, aunts, brothers and sisters for their continuous love, support and sincere prayers during the course of this research. I am also thankful to my beloved wife, Sananda, and daughter, Suditi, who are always there to support me. Without their continuous sacrifices and patience, the completion of this research project could not be possible. I must also express my thanks to all the teachers, my co-researchers and office personnel of the Department of Engineering & Technological Studies, University of Kalyani. Special thanks to Mr. J. P. Singh, Mr. Amitava Nag, Mr. Arindrajit Pal, Mr. K. Bhowal and Mr. Sandip Ghosh, who constantly helped me in my research. Lastly but not the least, I acknowledge the Academy of Technology, Hooghly. Place: Date:
Soumadip Ghosh
iii
Contents ACKNOWLEDGEMENTS
iii
LIST OF PUBLICATIONS
viii
LIST OF FIGURES
x
LIST OF TABLES
xii
Chapter 1 INTRODUCTION TO DATA MINING 1.1
INTRODUCTION
1
1.2
DATA MINING SYSTEM ARCHITECTURE
5
1.3
DATA MINING TECHNIQUES
6
1.3.1 Association Rule Mining
7
1.3.2 Classification
7
1.3.3 Regression
9
1.3.4 Clustering
9
1.3.5 Summarization
10
1.3.6 Outlier Detection
10
1.4
APPLICATIONS OF DATA MINING
10
1.5
PREVIOUS RESEARCH WORK
11
1.5.1 Study of the previous research work on Knowledge Mining from data
11
1.5.2 Study of the previous research work on Association Rule Mining
12
1.5.3 Study of the previous research work on Weather Prediction
13
1.5.4 Study of the previous research work on Breast Cancer Detection
14
1.5.5 Study of the previous research work on Soil classification from Large Imagery Soil Databases
15
1.5.6 Study of the previous research work on Tennis Match Result Prediction
17
1.6
THESIS MOTIVATION
18
1.7
OBJECTIVE OF THE RESEARCH
19
1.8 ORGANIZATION OF THE THESIS
20
1.9 REFERENCES
22
iv
Chapter 2 DETERMINATION OF ASSOCIATION RULES USING GENETIC ALGORITHM 2.1
INTRODUCTION
27
2.2
ASSOCIATION RULE MINING
29
2.3
GENETIC ALGORITHM
32
2.4
PERFORMANCE MEASURES
34
2.5
PROPOSED METHOD
35
2.6
RESULTS AND DISCUSSION
36
2.6.1 Supermarket database
37
2.3.2 Mushroom database
39
2.3.3 Plants database
41
2.7
CONCLUSION
43
2.8
REFERENCES
43
Chapter 3 WEATHER DATA MINING USING ARTIFICIAL NEURAL NETWORK 3.1
INTRODUCTION
46
3.2
CLASSIFICATION TECHNIQUES USED
48
3.2.1 Artificial Neural Network
49
3.2.2 Support Vector Machine
53
3.2.3 Decision Tree
54
3.3
PROPOSED TECHNIQUE
57
3.4
PERFORMANCE MEASURES
61
3.4.1 Root-mean-square error
61
3.4.2 Kappa statistic
61
3.4.3 Confusion Matrix
62
3.5
RESULTS AND PERFORMANCE ANALYSIS
63
3.6
CONCLUSION
67
3.7
REFERENCES
67 v
Chapter 4 BREAST CANCER DETECTION USING A NEURO-FUZZY BASED CLASSIFICATION METHOD 4.1 INTRODUCTION
70
4.2
72
PROPOSED NEURO-FUZZY BASED METHOD
4.3 METHODOLOGY
76
4.4 RESULTS AND DISCUSSION
77
4.4.1 Wisconsin Breast Cancer Database
78
4.4.2 Wisconsin Diagnostic Breast Cancer Database
82
4.4.3 Mammographic Mass Database
85
CONCLUSION
89
4.6 REFERENCES
89
4.5
Chapter 5 SOIL CLASSIFICATION FROM LARGE IMAGERY DATABASES USING A NEURO-FUZZY CLASSIFIER 5.1
INTRODUCTION
92
5.2
PROPOSED NEURO-FUZZY CLASSIFIER
95
5.3
DETAILED PROCEDURE
100
5.4
RESULTS AND DISCUSSION
101
5.4.1 Statlog Landsat Satellite Database
102
5.4.2 Covertype Database
106
5.4.3 Wilt Database
110
5.5
CONCLUSION
114
5.6
REFERENCES
115
Chapter 6 TENNIS MATCH RESULT PREDICTION USING AN ADAPTIVE NEURO FUZZY INFERENCE SYSTEM 6.1
INTRODUCTION
118 vi
6.2
ABOUT THE DATASET
120
6.3
PROPOSED METHOD
123
6.4
DETAILED PROCEDURE
127
6.4.1 Data preprocessing
128
6.4.2 Data classification
128
RESULTS AND DISCUSSION
130
6.5.1 Men‘s Tennis Match Tournaments
130
6.5.2 Women‘s Tennis Match Tournaments
133
6.6
CONCLUSION
135
6.7
REFERENCES
136
6.5
Chapter 7 CONCLUSION AND FUTURE SCOPE OF WORK 7.1
CONCLUSION
139
7.2
FUTURE SCOPE OF WORK
140
vii
List of Publications Peer Reviewed Journal 1. Soumadip Ghosh, Sushanta Biswas, Debasree Sarkar, and Partha Pratim Sarkar. A Novel Neuro-fuzzy Classification Technique for Data Mining. Egyptian Informatics Journal, Elsevier, Vol. 15, No. 3, pp. 129 – 147, November 2014. 2. Soumadip Ghosh, Sushanta Biswas, Debasree Sarkar, and Partha Pratim Sarkar. A Tutorial, on Different Classification Techniques for Remotely Sensed Imagery Datasets. Smart Computing Review Journal, Vol. 4, No. 1, pp. 34 – 43, February 2014. 3. Soumadip Ghosh, Sushanta Biswas, Debasree Sarkar, and Partha Pratim Sarkar. Soil Data Mining Using Decision Tree Classifier. Computer Science & Engineering Research Journal (CSERJ), Vol. 8, pp. 27 – 31, 2012-13. 4. Soumadip Ghosh, Shayak Sadhu, Sushanta Biswas, Debasree Sarkar, and Partha Pratim Sarkar. A Comparison of Different Classifiers for Tennis Match Result Prediction. Smart Computing Review Journal, Vol. 6, No. 1, February 2016. 5. Soumadip Ghosh, Sushanta Biswas, Debasree Sarkar, and Partha Pratim Sarkar. Breast Cancer Detection using a Neuro-fuzzy based Classification Method. Indian Journal of Science and Technology, Vol. 9, 2016 (accepted). 6. Soumadip Ghosh, Debasish Biswas, Sushanta Biswas, Debasree Sarkar, and Partha Pratim Sarkar. Soil Classification from Large Imagery Databases using a Neuro-fuzzy Classifier. Canadian Journal of Electrical and Computer Engineering (communicating: revision submitted). 7. Soumadip Ghosh, Shayak Sadhu, Sushanta Biswas, Debasree Sarkar, and Partha Pratim Sarkar. A Comparison between Different Classifiers for Tennis Match Result Prediction. Malaysian Journal of Computer Science (communicating).
Reviewed International Conference 1. Soumadip Ghosh, Sujoy Mondal, Bhaskar Ghosh. A Comparative Study of Breast Cancer Detection Based on SVM and MLP BPN Classifier, ACES 2014 1st International Conference, pp. 1 – 4, 2014, IEEE. 2. Soumadip Ghosh, Sushanta Biswas, Debasree Sarkar, Partha Pratim Sarkar. Association Rule Mining Algorithms and Genetic Algorithm: A Comparative Study, EAIT 2012 IEEE Conference, pp. 202 – 205, 2012, IEEE.
3. Soumadip Ghosh, Amitava Nag, Debasish Biswas, Jyoti Prakash Singh, Sushanta Biswas, Debasree Sarkar, Partha Pratim Sarkar. Weather Data Mining Using Artificial Neural Network, RAICS 2011 IEEE Conference, pp. 192 – 195, 2011, IEEE.
viii
List of figures Figure 1.1: Data mining as a phase in the KDD process
3
Figure 1.2: The architecture of a data mining system
5
Figure 2.1: Scalability of Apriori and GA using Supermarket database
37
Figure 2.2: Confidence vs. Time plot using Supermarket database
38
Figure 2.3: Scalability of Apriori and GA using Mushroom database
39
Figure 2.4: Confidence vs. Time plot using Mushroom database
40
Figure 2.5: Scalability of Apriori and GA using Plants database
41
Figure 2.6: Confidence vs. Time plot using Plants database
42
Figure 3.1: A multilayer feed-forward network
49
Figure 3.2: Hidden or output layer of backpropagation neural network
51
Figure 3.3: A typical artificial neuron used in a Hopfield network
52
Figure 3.4: A SVM showing maximum marginal hyperplane between two classes
53
Figure 3.5: A sample decision tree classifier
55
Figure 3.6: The surface stations on the earth
57
Figure 3.7: The sixteen equi-space regions of the earth
59
Figure 3.8: Comparison of RMSE and Kappa statistic using weather database
64
Figure 3.9: Comparison using TP-Rate/Recall, FP-Rate, Precision, and F-Measure
66
Figure 4.1: A typical sigmoidal membership function
74
Figure 4.2: The proposed Neuro-fuzzy system model
76
Figure 4.3: Broad level stages of the proposed methodology
77
Figure 4.4: Comparison of RMSE and Kappa statistic using the WBC dataset
81
Figure 4.5: Comparison between classifiers using confusion matrix measures for WBC dataset
82
Figure 4.6: Comparison of RMSE and Kappa statistic using the WDBC dataset
84
Figure 4.7: Comparison between classifiers using confusion matrix measures for WDBC dataset
85
Figure 4.8: Comparison of RMSE and Kappa statistic using the MM dataset
87
Figure 4.9: Comparison between classifiers using confusion matrix measures for MM dataset
88
Figure 5.1: The Gaussian curve membership function used in fuzzification
97
Figure 5.2: The proposed Neuro-fuzzy classifier
99
Figure 5.3: Broad level stages of the detailed procedure
101
Figure 5.4: Comparison of RMSE and Kappa statistic for Landsat Satellite data set
104
ix
Figure 5.5: Comparison of TP-Rate/Recall, FP-Rate, Precision, and F-Measure for Landsat Satellite data set
105
Figure 5.6: Comparison of RMSE and Kappa statistic for Covertype data set
108
Figure 5.7: Comparison of TP-Rate/Recall, FP-Rate, Precision, and F-Measure for Covertype data set
109
Figure 5.8: Comparison of RMSE and Kappa statistic values for Wilt data set
112
Figure 5.9: Comparison of TP-Rate/Recall, FP-Rate, Precision, and F-Measure for Wilt data set
113
Figure 6.1: Proposed ANFIS architecture equivalent to a fuzzy inference system
125
Figure 6.2: Major steps of the detailed classification procedure
129
Figure 6.3: Comparisons of RMSE and Kappa statistic based on averages using Men‘s datasets
131
Figure 6.4: Comparisons of confusion matrix metrics based on averages using
132 Men‘s datasets
Figure 6.5: Comparisons of RMSE and Kappa based on averages using Women‘s datasets
134
Figure 6.6: Comparisons of confusion matrix metrics based on averages using Women‘s datasets
135
x
List of tables Table 1.1: Comparison of Clustering and Classification
9
Table 2.1: Execution Times for Different Number of Instances for Supermarket database
37
Table 2.2: Execution Times for Confidence Thresholds using Supermarket database
38
Table 2.3: Execution Times for Different Number of Instances for Mushroom database
39
Table 2.4: Execution Times for Confidence Thresholds using Mushroom database
40
Table 2.5: Execution Times for Different Number of Instances for Plants database
41
Table 2.6: Execution Times for Confidence Thresholds using Plants database
42
Table 3.1: A confusion matrix for a two-class classifier
62
Table 3.2: Comparison based on Accuracy, RMSE, and Kappa statistic
64
Table 3.3: Comparison based on TP-Rate/Recall, FP-Rate, Precision, and F-Measure
65
Table 4.1: The attributes of the WBC data set
79
Table 4.2: Comparison based on Accuracy, RMSE, and Kappa statistic using WBC
80
Table 4.3: Comparison of TP-Rate/Recall, FP-Rate, Precision, and F-Measure using WBC
81
Table 4.4: Comparison based on Accuracy, RMSE, and Kappa statistic using WDBC
83
Table 4.5: Comparison of TP-Rate/Recall, FP-Rate, Precision, F-Measure using WDBC
84
Table 4.6: Comparison based on Accuracy, RMSE, and Kappa statistic using MM
86
Table 4.7: Comparison of TP-Rate/Recall, FP-Rate, Precision, and F-Measure using MM
87
Table 5.1: Configuration parameters used in the MLP model
98
Table 5.2: Comparison of Accuracy, RMSE, and Kappa statistic for Landsat Satellite dataset
103
Table 5.3: Comparison of TP-Rate/Recall, FP-Rate, Precision, and F-measure for Landsat Satellite dataset
105
Table 5.4: Attribute Information of Forest Covertype data set
107
Table 5.5: Comparison based on Accuracy, RMSE, and Kappa statistic for Covertype data set
108
Table 5.6: Comparison of TP-Rate/Recall, FP-Rate, Precision, and F-measure for Covertype data set
109
Table 5.7: Attribute Information of Wilt data set
111
Table 5.8: Comparison based on Accuracy, RMSE, and Kappa statistic for Wilt data set
111
xi
Table 5.9: Comparison of TP-Rate/Recall, FP-Rate, Precision, and F-measure for Wilt data set
112
Table 6.1: Dataset attribute list along with their descriptions
121
Table 6.2: Comparisons of the classifiers on the test datasets of Men‘s Tennis Major Tournaments
131
Table 6.3: Detailed accuracy for classifiers on Men‘s Tennis Major Tournaments datasets
132
Table 6.4: Comparisons of the classifiers on the test datasets of Women‘s Tennis Major Tournaments
133
Table 6.5: Detailed accuracy for classifiers on Women‘s Tennis Major Tournaments datasets
134
xii
CHAPTER 1 INTRODUCTION TO DATA MINING
1.1
INTRODUCTION Data mining [1] [2] [3] [4] is a moderately new and multidisciplinary field of
applied computer science. It is the procedure of discovering patterns representing knowledge from large databases by combining methods from artificial intelligence, machine learning and statistics with database management system. The primary goal of data mining is to transform the gathered information or knowledge into a comprehensible form for further use. With recent incredible technical advances in different aspects of information technology, data mining has emerged as a progressively significant tool to transform unforeseen quantities of digital data into meaningful knowledge. Thus, it adds immensely to business policies and scientific research. The researchers presently use this field in a varied range of areas, such as scientific research, market analysis, forensic investigation and criminology. Of late, the growing requirement of the information industry has led to an explosion in demand for new data mining technologies. Data are normally facts, texts, or numbers processed by the computing devices. Typical sources of data may include databases, flat files, on-line transaction records, spreadsheets, or any other information repositories. In earlier days of information processing, researchers used to follow manual extraction of patterns from data. In that era, major approaches to recognize patterns in data comprise the statistical methods like Bayes' theorem and regression analysis. The growth, ubiquity, and emerging power of computer technology have sensationally increased data collection, storage, and management capability. As real-world databases have developed in size and complexity, direct practical data analysis has increasingly been improved with unanticipated, computerized data processing. The data processing techniques might include genetic algorithms (GAs), clustering analysis, fuzzy logic,
1
rough sets, artificial neural networks (ANNs), decision trees, and support vector machines (SVMs). Data mining is the procedure of using these methods with the intent of revealing hidden patterns in large databases. The research domain related to data mining reduces the gap between artificial intelligence and applied statistics to database management system by utilizing various data management and data indexing techniques in databases. The goal is to accomplish the actual learning and discovery algorithms more proficiently thereby permitting such approaches to use in larger databases [5] [6] [7]. The plenty of data, combined with the requirement for potential software tools for data analysis, has been designated as a ―data rich but information poor condition‖. The rapid-growing, vast amount of data, gathered and retained in large and numerous data sources, has far exceeded the human capability for understanding without powerful tools. The widening gap between data and information always requires a systematic development of data mining tools. Data patterns and relationships among all collected data typically used to develop meaningful information after analysis. The basic idea is to transform this information into useful knowledge denoting future trends or historical patterns so as to make the analysis task easier. Data mining tools essentially perform data analysis and may reveal significant information, bestowing remarkably to knowledge bases, business policies, scientific, and medical research. However, a centralization of data is still needed to make the most of user access and strategic analysis. Dramatic progress in different aspects of information technology is allowing business groups to assimilate their organizational databases into larger centralized data warehouses. A data warehouse is a repository of information brought together from multiple heterogeneous data sources, retained under a standard schema, and that typically exist at a physical site. It mostly deals with read-only data and therefore just needs two basic operations related to data: primary loading of data and access to data. It serves as a physical implementation of a decision support system and stores the information for management decision-making. A data warehouse uses a multidimensional data model for representation, called a data cube. Data warehousing [8] [9], like data mining, is a moderately new term while the concept itself has been there for decades. It is the overall procedure of constructing and using data warehouses. In 2
essence, data warehousing signifies a perfect way of maintaining a central storehouse of all organizational data. From a data warehouse viewpoint, data mining can be regarded as an advanced form of on-line analytical processing (OLAP). Data mining is essentially the procedure of extracting interesting information from large amount of data and transforming it into meaningful knowledge. Considering the above definition, the term ―data mining‖ is certainly inappropriate. The appropriate name should have been like knowledge mining from data; though it is inelegantly somewhat lengthy. But somehow such an inappropriate term that brings both ―data‖ and ―mining‖ came to be a popular choice in the research community. Many other terms are in the pieces of literature that give a similar otherwise marginally different sense to it, for example, knowledge extraction, data/pattern analysis, and data archeology. Data Mining is one of the phases in Knowledge Discovery in Databases [3] process or KDD process in short. The terms knowledge discovery in databases, and data mining are entirely distinct from each other. The KDD process denotes the general procedure to determine worthwhile knowledge from large amount of data; while data mining refers to the method of extracting interesting patterns from data based on analysis. Data mining as a phase in the KDD process is shown below in Figure 1.1.
Figure 1.1: Data mining as a phase in the KDD process
3
The knowledge discovery in databases process contains an iterative sequence of the following steps as described in Figure 1.1: 1. Data selection: This is an essential procedure where the data related to the investigation task assembled from different data sources. The data sources may include data warehouses, on-line transaction records, relational databases, flat files, spreadsheets, or other kinds of information repositories. The resulting data is the target data set. 2. Data preprocessing: This denotes several preprocessing tasks applied to the target data set to ensure consistency in naming conventions, encoding structures, and attribute measures. Preprocessing includes data integration and data cleaning. Data integration is a form of preprocessing that may combine multiple data sources. The data set then cleaned where the term 'cleaning' denotes the processing of data for reducing noise and the treatment of missing values. 3. Data transformation: The procedure applies to preprocessed dataset prior to data mining. For example, this method is in use to normalize the dataset as because neural network and regression-based techniques require distance measurements for analysis. It transforms database attribute values to a small-scale range such as [-1.0, +1.0] or [0.0, 1.0]. Occasionally researchers follow aggregation or consolidation approaches for performing data transformation. 4. Data mining: Data mining is the procedure of discovering interesting patterns and knowledge from large amount of data. The procedure might refer to a knowledge base, which is a repository of information related to a particular domain that would help the searching procedure for finding the interesting patterns. 5. Pattern evaluation: This denotes the process applied to recognize the interesting knowledge based on some thresholds or interestingness measures. Finally, the extracted knowledge is presentable to the user using some visualization techniques. Steps 1 to 3 denote different forms of earlier tasks, where the data typically prepared for mining. Though data mining is a step in the KDD process; however, the term data mining is becoming a universal standard compared to the elongated expression of ―knowledge discovery in databases‖. 4
1.2
DATA MINING SYSTEM ARCHITECTURE The data mining system architecture [8] [9] consists of the following modules
as shown below in Figure 1.2.
Figure 1.2: The architecture of a data mining system
1. Data sources: The data mining system assembles data from different data sources for performing investigation task. The sources of data are data warehouses, flat files, databases, World Wide Web (WWW), spreadsheets, or other kinds of information repositories. Data selection and data preprocessing techniques are in use with the data. 2. Database or Data Warehouse Server: The database or data warehouse server is the central storage accountable for extracting the related data, according to the data mining request or query issued by the user.
5
3. Knowledge Base: Data mining procedure might refer to a knowledge base, which is a repository of knowledge related to a particular domain that would help the searching procedure for finding the interesting patterns. This kind of knowledge may include ―concept hierarchies‖ which organizes features or feature values into several levels of abstraction. It may also include ―user beliefs‖, which can evaluate the interestingness measure of a data pattern according to its suddenness or unexpectedness. The other instances of domain knowledge are any added thresholds or interestingness constraints and metadata (i.e., data about data). 4. Data Mining Engine: This is an important part of the data mining system. It contains a set of functional modules for performing several tasks such as summarization, association analysis, classification, regression, cluster analysis, and outlier detection. 5. Pattern Evaluation: This module usually applies some thresholds or interestingness constraints to determine the interesting knowledge. It also communicates with the data mining module so as to help focus the search for interesting patterns. 6. Graphical User Interface: The module interacts with users and the data mining system. It allows the user to communicate with the system by providing a data mining request or query and offers the necessary information to guide the search. Based on the users‘ data mining application, the mined knowledge is presented to the user using some visualization techniques.
1.3
DATA MINING TECHNIQUES Data mining techniques [8] [9] specify the type of data patterns discovered in
data mining procedure. The researchers categorized data mining techniques into two types: descriptive and predictive. The descriptive data mining techniques emphasize on identifying human-interpretable patterns describing the data. They mainly outline the characteristic features of the database records. The examples are data summarization, association rule mining and clustering analysis. The predictive data mining techniques make inference on the present data to perform predictions. They use some properties or attributes in the database to predict unidentified or future 6
values of other features of interest. The examples are regression analysis, classification and prediction. The present section describes different data mining techniques, and the types of data patterns they can determine.
1.3.1 Association Rule Mining Association rule mining (alternatively known as frequent pattern mining) [10] [11] [12] is an important data mining technique. The frequent patterns are patterns that occur in a data set regularly. Given a set of objects or items, for instance, bread, butter and milk that frequently occur together in a transaction record called a frequent itemset. Discovering these types of frequent patterns plays a vital role in mining relationships among data. Additionally, the procedure can be used for classification, clustering analysis, and other types of data mining tasks. It is a descriptive data mining technique. Association rule mining (ARM) denotes the procedure of determining captivating and unforeseen rules from vast databases. The field indicates a very general model that helps in finding the relations between items of a database. An association rule is an if-then-rule supported by data. Initially, the association rule mining algorithms used for solving the market-basket problem [4]. The problem was like this: given a set of objects and large amount of transaction records, the goal was to find associations between the objects confined to several transactions. A typical association rule resulting from such a research study can be ―60 percent of all consumers who purchase a personal computer also buy antivirus software" – which reveals a very vital information. So, this analysis can provide new understandings of customer behavior and thereby leading to higher profits via better customer dealings, customer retaining and better product settlements. Association rule mining technique is described in detail in Chapter 2.
1.3.2 Classification Data classification [8] [9] [13] is the method of determining a classifier or model that describes and discriminates several data classes from each other. Initially, the classification procedure applies some preprocessing tasks (data cleaning, data selection, data transformation etc.) to the original data. Then, the method divides the preprocessed data set into two different sections namely the training data set and the 7
test data set. These data sets should be independent of each other to avoid biases. A classification technique is alternatively known as a classifier. Classification basically consists of two different steps. The first step develops a classification model (i.e., classifier) indicating a well-defined set of classes. Therefore, this is the training phase, where the classification technique constructs the model by learning from a given training data set accompanied by their related class label attributes. That is why it is a form of supervised learning technique. After that, the classification model is suitable for prediction called the testing phase. This step estimates the accuracy of the derived model using the test data set. The classification method is a predictive data mining task. The classification procedure applies to huge information repositories for building models identifying diverse data classes. This kind of analysis can provide profound insight for better understanding of different large-scale databases. The resulting model based on the analysis of a training data set. The model can use several procedures, such as mathematical formulas, simple if-else rules, artificial neural networks, or decision trees. The software applications related to classification technique analyze large databases and develop meaningful classifications and patterns in the databases for scientific research, industrial, and commercial purposes. The performance of a classification model depends on the following criteria: Accuracy: The classification accuracy or accuracy of a classification model denotes its capability to forecast the class label of new or previously unknown data appropriately. Speed: This means the computational costs required to develop and use a given classification model or classifier. Robustness: This is the ability of a classification model to make correct predictions in the presence of noisy data or data with missing values. Scalability: This denotes the capability to build a classification model proficiently given vast amount of data. Interpretability: This indicates the level of comprehension and vision that is offered by a given classification model. 8
1.3.3 Regression In statistics and machine learning, regression analysis [8] [14] [15] is a procedure for assessing the inter-variable relationships. It includes several methods for demonstrating and examining numerous variables, where the goal is to analyze the association amongst a dependent variable and one or more independent variables. Specifically, regression analysis can recognize which among the independent variables closely associated with the dependent variable, and to discover the forms of these relationships. It is in use for numeric prediction and forecasting. It tries to determine a function that represents data with least possible error. Regression technique is a predictive data mining task.
1.3.4 Clustering Unlike classification, clustering [8] [9] [16] examines data objects without referring a known class label. The class labels are absent in the training data set as because they are unknown initially. Researchers employ the clustering analysis to create such labels. So, clustering is a form of unsupervised learning technique and is a preprocessing step for classification. Clustering groups objects according to the rule of maximizing the intra-class similarity and minimizing the inter-class similarity. To be specific, one should develop clusters in a manner that the objects belonging to the same cluster have great similarity compared to one another but are very different to the objects in other clusters. Every cluster generated is a group of entities or objects, leading towards the formation of new rules. Clustering is a descriptive data mining task. The comparisons between clustering and classification techniques are presented below in Table 1.1. Table 1.1: Comparison of Clustering and Classification
Clustering
Classification
It is a descriptive data mining technique.
(1) It is a predictive data mining technique.
It is an unsupervised learning.
(2) It is a supervised learning.
Class label attribute is not present.
(3) Class label attribute is present.
Examples of clustering techniques are (4) Examples of classifiers are Multi-layer K-Means, K-Medoids etc. Perceptron, Support Vector Machine etc. 9
1.3.5 Summarization Data relates to classes or concepts very often. It could be beneficial to designate specific classes or concepts in summarized, brief, and yet precise terms. Such kind of description related to a class or a concept is the class description or concept description. Data summarization [8] [9] is an essence of the general features of the data in databases. The data conforming to the user-specified class is usually collected by executing queries on databases. For instance, to investigate the characteristics of electronic products whose sales increased by 20% in the year 2015, the data associated with such products could be accumulated by executing an SQL query. Summarization technique is a descriptive data mining task.
1.3.6 Outlier Detection A database may consist of items or data objects that do not obey the general characteristics of the data considered as outliers or exceptions. This kind of analysis is called outlier detection or outlier analysis [7] [15]. In some applications, such as fraud detection, outlier analysis might expose the deceitful usage of credit cards. Outlier analysis detects this by identifying purchases of vast amount for a given account number compared to the usual transaction limit set by the account holder. Statistical tests based on probability distribution model and different forms of clustering can identify noise or outliers efficiently. Outlier detection is a predictive data mining task.
1.4
APPLICATIONS OF DATA MINING Data mining commonly applies to any form of extensive data or information
processing like data collection, data extraction, data warehousing, and statistical analysis. It may also include the use of computer-based decision support system, comprising artificial intelligence, statistics, machine learning, database management, and business intelligence [6] [7] [8] [9]. The field has a varied range of application areas, such as: Business: Different business groups integrate their organizational databases into larger centralized data warehouses for ease of analysis of data by the decision makers. Apart from that, other notable uses of data mining in this area are risk assessment,
10
customer relationship management, price evaluation of specific products, modeling and forecasting credit fraud, and stock price prediction. Scientific and Technical Research: Data mining applications are in use with areas like scientific and technical research, including electrical engineering, medicine, bioinformatics, genetics, software engineering, and education. Spatial Data Mining: This is the use of data mining techniques to spatial data. The spatial data mining tries to discover data patterns with respect to geography. The combination of data mining and Geographic Information Systems (GIS) can accomplish some form of decision-making. Temporal Data Mining: Data might consist of features or attributes created and noted at different times. In this case discovering meaningful associations in the data might have to consider the temporal order of the features. Sensor data mining: Wireless sensor networks (WSNs) are in use to simplify the assembly of data for spatial data mining for applications like air pollution monitoring.
1.5
PREVIOUS RESEARCH WORK There has been splendid advancement in the area of data mining during the
last few decades. It is the consequence of the exceptional research that the data mining domain has reached a great height rapidly. Some of the previous notable research works done in this field are discussed in this section. 1.5.1 Study of the Previous Research Work on Knowledge Mining from Data The research works in knowledge discovery from data or data mining published in numerous books, reports, conferences, and journals on statistics, machine learning, databases, and data visualization. This subsection presents a number of research works done in this domain. Many scientists, researchers, analysts and knowledge workers consider these research studies [1] [2] as the pioneering works done in the field of data mining. The research works edited by Piatetsky-Shapiro and Frawley [3] designate some of the well-known articles on knowledge discovery from data. Another assembly of texts
11
revised by Fayyad, Piatetsky-Shapiro, Smyth, and Uthurusamy [4], is a group of advanced research results on knowledge discovery and data mining. These research studies [5] [6] [7] [8] also made some significant contributions in this area that worth mentioning.
1.5.2 Study of the Previous Research Work on Association Rule Mining Data mining research in the association rule mining [10] [11] [12] field has received a lot of attention in the present decades. The primary association rule mining algorithm, Apriori, not only created waves in the data mining research community, but it also inspired other data mining fields as well. Apriori algorithm mainly depends on the apriori property which states that a subset of a frequent itemset is always a frequent itemset. Other association rule mining algorithms like Incremental, PincerSearch, Partition, Border algorithm, Frequent-Pattern tree (FP-tree), etc. can also compute all possible frequent itemsets from large databases efficiently. R. Agrawal, T. Imielinski, and A. Swami [10] first proposed the association rule mining concept for finding frequent itemsets in large databases. R. Agrawal and R. Srikant [11] [12] proposed the apriori property in the Apriori algorithm for association rule mining. Numerous algorithms in association rule mining domain have utilized this feature. R. Agrawal and R. Srikant [12] also described a method for generating association rules from frequent itemsets. J. S. Park et al. [17] used hash tables for improving the efficiency of association rule mining. The transaction reduction techniques described in R. Agrawal and R. Srikant [11], J. S. Park et al. [17], and J. Han and Y. Fu [18]. A. Savasere, E. Omiecinski, and S. Navathe [19] proposed the partitioning technique. H. Toivonen [20] discussed the sampling approach to association rule mining algorithms. S. Brin et al. [21] explained the Dynamic itemset counting method. D. W. Cheung, J. Han, V. Ng, & C. Y. Wong [22] proposed an efficient incremental updating of mined association rules. J. S. Park et al. [23], R. Agrawal and J. C. Shafer [24], and D. W. Cheung et al. [25] worked with parallel and distributed version of Apriori algorithm. Data mining researchers also recommended other scalable frequent itemset mining methods as an alternative to the Apriori-based approach. In this regard, J. Han, J. Pei, and Y. Yin [26] introduced an FP-tree approach for mining frequent itemsets 12
without candidate set generation. The data mining researchers refer to this approach as the Frequent Pattern-Growth (FP-Growth) algorithm, which, indeed, is a powerful technique for finding the frequent itemsets in large databases. 1.5.3 Study of the Previous Research Work on Weather Prediction Weather Prediction using classification is a significant data mining task applied to vast weather databases to build classification models describing different weather conditions. The subsequent sub-section has presented the related works done in the area of developing classification models for weather prediction. Y. Radhika and M. Shashi [27] described an application of Support Vector Machine (SVM) based approach for weather prediction. They used time series data of daily highest temperature at a given location to forecast the highest temperature for the next day at the same location. The study used the daily highest temperatures for the duration of previous n days denoted as the order of the input. They witnessed the performance of the forecasting model for various periods of two to ten days with optimal values of the kernel function being used. M. Hayati et al. [28] used the Multi-layer Perceptron (MLP) based technique on a database of ten years meteorological data that was collected from 1996 to 2006. The results confirmed that MLP network had the least prediction error and was a worthy technique to predict the short-term temperature forecasting systems. B. A. Smith et al. [29] developed ANN based models with a reduced average forecast error. For conducting the research study, they increased the number of separate trials used in training and added additional input terms that designate the date of a trial. They also increased the duration of prior weather data included in each test and revised the numeric counts of neurons used in the hidden layer of the neural network. They constructed model to predict the air temperature at hourly intervals from one to twelve hours forward. They used thirty neural networks for training, each having its network architecture and a set of related constraints. After that, they estimated the performance of the whole system by computing the mean absolute error of the resulting networks for some set of input patterns. S. Chattopadhyay [30] proposed an ANN-based model to predict the average summer-monsoon rainfall over India. The work used various factors such as monthly 13
summer-monsoon rainfall totals, tropical rainfall indices and sea-surface temperature anomalies to develop the ANN model. The proposed model was compared with persistence forecast and Multiple Linear Regression forecast processes. Finally, the supremacy of ANN was established over the other processes. S. S. Baboo and I. K. Shereef [31] presented an application of Backpropagation Neural Network (BPNN) based approach for weather prediction using a real-time dataset. The compared their model with practical workings of the meteorological department. The simulation results confirmed that the BPNN based weather forecast showed improvement not only over guidance predictions from numerical models, but also over official local weather service predictions as well. S. Kotsiantis et al. [32] described how to investigate the efficiency of data mining techniques in estimating lowest, highest and average temperature values. They conducted many trials with long-familiar regression algorithms using temperature databases collected from the city of Patras in Greece. The performance of their algorithm was evaluated using several well-known statistical metrics, such as correlation coefficient, root-mean-squared error, etc. 1.5.4 Study of the Previous Research Work on Breast Cancer Detection Data mining applications are used in medical science and the Bioinformatics research field for diagnosis of breast cancer. Breast cancer detection with the help of data mining approach is an area of research that has earned a lot of attention in the recent decades. This sub-section here presents the related works done in this field by other researchers. The research work [33] proposed by W. J. Clancey, E. H. Shortliffe, Eds., introduced the rule-based expert systems in earlier days of the clinical diagnosis made by several computer algorithms. The work motivated a broad range of activity for developing the rule-based expert systems, presentation of knowledge, and furthermore the subsequent development of Computing in Medicine and Bioinformatics. O. Anunciacao et al., [34] presented an application of a decision tree approach to breast cancer prediction for a group of people involved in tobacco and alcohol consumption in their past. They proved that multiple genetic variations handled such type of complex diseases. 14
W.-Y. Cheng et al., [35] invented a robust predictive model to predict breast cancer endurance using a training data set that included gene expression from breast cancer patients. They initially identified several such signatures called attractor metagenes in an analysis of multiple tumour types. Then they tested these signatures as contributors in the computing framework named Sage Bionetworks to prove the effectiveness of their proposed method. R. R. Janghel et al., [36] used different ANN-based models to detect and analyze breast cancer disease. They implemented four models of artificial neural network, namely Multilayer Perceptron (MLP), Radial Basis Function Network (RBFN), Leaning Vector Quantization (LVQ) and Competitive Learning Network (CLN). The simulated results indicated that LVQ showed the best performance in the test data set, followed in order by CLN, MLP, and RBFN. A. KELEŞ and A. KELEŞ [37] analyzed procedures for extracting strong fuzzy rules using a familiar neuro-fuzzy based software tool named NEFCLASS. The rule base that was used for diagnosis consists of 9 rules and positive predictive value of this rule base was 75% and negative predictive value of the rule base was 93%. D. Nauck et al., [38] combined neural network and fuzzy rule-based system in their proposed design. The research study proved that such a combined approach could enhance the performance of data analysis and decision-making systems. The research works [39] [40] [41] also contributed immensely in the field of breast cancer diagnosis using various other data modeling techniques that worth mentioning. 1.5.5 Study of the Previous Research Work on Soil classification from Large Imagery Soil Databases Data mining techniques typically analyze large imagery databases and set up useful classification and patterns to develop geographical information systems (GIS) based frameworks for industrial, scientific and commercial purposes. Classification as an important data mining technique typically used to perform soil survey for building GIS-based frameworks. Thus, soil classification as a form of soil survey analysis can analyze large imagery databases and develop worthwhile knowledge for GIS-based frameworks. The present sub-section discusses some of the notable research works done in the field of soil classification using large imagery databases.
15
J. A. Shine and D. B. Carr [42] offered significant contributions to this field. They envisioned the statistical relationships between land covers and spatial compression techniques for mapping and categorizing high-resolution digital satellite imagery. They conducted their research using a multispectral imagery of New York. As the data set was massive in size, they used lossless compression techniques for reducing computational complexity. The research study contributed significantly to the field of soil classification domain. The research study contributed significantly to the field of soil classification domain. Besides this, the research work was an elongation of the previous works done by J. A. Shine himself [43] [44]. Later, they developed useful techniques for classification of satellite imageries for soil mapping [45]. They employed different classification methods to the Statlog Landsat imaging database of Australia and the multispectral database of the southcentral Virginia. These classifiers were classification tree (CT), artificial neural network (ANN), discriminant analysis (DA), k-nearest neighbour (k-NN), and support vector machine (SVM). According to the performance, the k-NN performed the highest; then in precision were CT, ANN, SVM, and DA successively. O. Rozenstein and B. Karnieli [46] created a low-cost geographical information systems (GIS) model. They combined the native GIS data with the satellite remote sensing data to generate a moderately accurate land map of northern Negev region in Israel. Besides this, additional land-use data were used to improve the remote sensing classification accuracy inside the GIS framework. The research work also integrated unsupervised and supervised learning methods, which resulted in a classification accuracy of 81%. D. Lu and Q. Weng [47] considered classifying large-scale imagery data sets using diverse aspects of the image classification methods used. Their work showed that by using efficient utilization of multiple characteristics of a large imaging data set, the classification accuracy improved significantly. The research work [48] studied primary remotely sensed image classifiers, involving object-based, pixel-wise, and sub-pixel-wise image classifiers and emphasized the significance of including spatial and contextual based information in remote sensing image classification. Furthermore, this study grouped the spatial and contextual analysis techniques into three broad groups, containing: i) texture mining, 16
ii) Markov random fields modeling, and iii) image segmentation and object-based image analysis. The work also explained the need for developing and analyzing geographical information models for spatial and contextual classifications using two specific case studies. The research works [49] [50] also contributed enormously in the field of soil classification using various other data mining techniques. 1.5.6 Study of the Previous Research Work on Tennis Match Result Prediction Tennis is a very popular sport in the world. Due to the growth of technology, predictions are widely used in tennis matches, especially by coaching stuffs, news agencies and spectators. The tennis prediction model is developed to evaluate the chance of winning match that the players will face. Many researchers have worked in the fields of forecasting the outcome of tennis matches using past statistical data records. Tennis match prediction using independent and identical distributed Markov Chain and Revised Markov Chain is a significant work in this field proposed by T. Barnett, A. Brown, and S. R, Clarke [51]. The proposed model took each set as independent and identical distributed and set up a Markov Chain with any two players, such as A and B with the current score of (a, b), where a≥0, b≥0. Using this initial score the procedure tried to find the probability of wining either A or B. As Markov chain is memory-less so the next state depends on the current state. Time series analysis is also helpful in extracting useful statistical data that can be used for prediction. Use of time series history to extract useful pattern and predicting the result has been used by A. Somboonphokkaphan, S. Phimoltares, and C. Lursinsap [52]. They derived attributes based on the time series history of the data and used MLP based method to predict the winner of the tennis match. In this method, they used data warehousing to analyze the data for classification. As the environmental conditions also affect the match so they used both environmental and statistical data for estimation of the match result. Use of player‘s characteristics can be considered as another useful way-out to predict the outcome of the match. The research study by J. D. Corral, and J. PrietoRodr´ıguez [53] demonstrated the use of physical and mental attributes of a player to calculate the chance of winning using the Probit model. The model is a type of
17
regression in which the dependent variable can only take one of the two values (either yes or no). It is possible to predict match result solely based on player‘s characteristics as the input parameter. This has been shown by A. Panjan, N. Šarabon, and A. Filipčič [54], where the dataset contains the various player physical attributes such as body weight. These parameters are taken into consideration to produce the result by using classification and machine learning procedures. Spatio-temporal data are useful in predicting shots in Tennis as well as in forecasting match result. Predicting the outcome of other games has also been attempted X. Wei, P. Lucey, S. Morgan, and S. Sridharan [55]. Soccer prediction algorithm as designed by A. S. Timmaraju, A. Palnitkar, and V. Khanna [56] used KPP (k-past performance) for estimation of football match result. D. Buursma [57] used past statistical records to predict results of various other games while keeping the game constraints in mind.
1.6 THESIS MOTIVATION Nowadays data mining is considered to be a very promising field in the information industries and in the research organizations. With recent growth in different aspects of information technology, data mining has become an increasingly significant tool to transform raw data into meaningful knowledge. As real-world databases have developed in size and complexity, data mining has increasingly been improved with unanticipated, computerized data processing, associated with artificial intelligence, machine learning, and statistics. But, in spite of all, there might be some undiscovered patterns that remained untouched so far by the older, existing techniques. So, there is still a gap remains between data and information. The data mining researchers would like to address this issue by devising better methods. And this way the evolution continues. The research domain related to data mining tries to minimize the gap between data and information by designing new data mining algorithms, techniques or tools. If employed, these methods or tools could be used to discover the new information after analysis. The researchers may transform the gathered information into significant knowledge indicating new future trends. The goal is to invent new learning and 18
knowledge discovery algorithms more proficiently thereby permitting such approaches to use in larger databases. Furthermore, the novel data mining techniques mostly perform innovative data analysis and may reveal valuable information, adding immensely to business policies, knowledge bases, and scientific researches. Therefore, with the present scenario and the forecast of the information technology expansion, there is always a need for new data mining techniques.
1.7
OBJECTIVE OF THE THESIS The primary aim of the thesis is to design data mining techniques presently
required for dealing with many real-world problems. These problems include diverse fields like market basket analysis, weather prediction, cancer diagnosis, soil categorization, remote sensing, and so on. In this thesis, complete design for new data mining techniques has been proposed. The research study also provides comparative discussions among some of the major data mining techniques. Many error estimation methods have been applied to measure the correction of the proposed design methodologies in each of its applications. The research study also considers ways to improve the effectiveness of different data mining algorithms. Hence, the fundamental objective of this thesis is to investigate the various applications of data mining techniques for discovering significant knowledge from large real-world databases. In finding the prime objective of this dissertation, there are also different sub-objectives as stated below:
To analyze, extract, convert, and load data from various databases, formats, and data sources.
To use empirical data analysis techniques to recognize significant associations, patterns, or trends from complex data sets.
To make comparisons between different techniques to determine whether the data mining field can emerge as a means for researchers or analysts to implement knowledge bases.
To represent an outline of several existing data mining algorithms on large databases and find out the advantages and disadvantages of each of such algorithms. 19
To propose new data mining techniques employed to facilitate knowledge discovery from data.
To discuss the combination of data mining with other soft computing based techniques like fuzzy logic, genetic algorithm etc.
To analyze the occurrence of any modification to the existing preprocessing techniques prior to data mining.
To verify the validity of any new technique being used.
To categorize untagged data or forecast about the future with applied statistics and machine learning algorithms.
To communicate data analysis and findings well through effective data visualizations.
1.8
ORGANIZATION OF THE THESIS The thesis is compiled into seven chapters. The chapter wise descriptions are
mentioned below. Chapter one contains the basic introduction to data mining. It describes the knowledge discovery process in databases along with different types of data mining tasks. This chapter also discusses the applications of data mining. Several published research articles have been discussed in the chapter. Chapter two presents a comparative study among Association Rule Mining (ARM) based and Genetic Algorithm (GA) based approaches to data mining. Apriori is a well-known ARM-based algorithm for frequent itemset mining and association rule learning over transactional databases. The GA-based approach can become a better alternative to determine the frequent patterns from vast databases. GA can perform a global searching and its computational complexity is smaller compared to the Apriori algorithm as because it employs a greedy technique. The prime aim of this study is to design a GA-based approach and then compare its performance with the Apriori algorithm. Among the two algorithms used, GA turns out to be the best. Chapter three deals with a novel weather prediction approach based on data mining. The research study proposes an artificial neural network (ANN) based 20
approach for weather prediction. The approach uses a Multilayer Perceptron (MLP) for initial modeling. The MLP model then applies to a Hopfield Network. The performance of the proposed technique evaluated using five years weather data set containing 25,000 records. The database consists of attributes like temperature, wind speed, humidity, precipitation, and air pressure. The research work mainly focuses on predictive data mining by using which one can perform weather prediction accurately from vast amount of meteorological data. The research work compares the performance of the proposed method with classification algorithms like Decision Tree (DT) and Support Vector Machine (SVM). The performance evaluation measures are root-mean-square error (RMSE), Accuracy, False Positive Rate (FP-Rate), True Positive Rate (TP-Rate), Kappa statistic, Recall, Precision, and F-Measure. The result shows that the proposed neural network based approach proved to be the best. Chapter four proposes a neuro-fuzzy system (NFS) for breast cancer disease analysis. The NFS classifier uses a sigmoidal membership function for fuzzification. The breast cancer is a severe disease found among females all over the world. The disease can be very harmful to all women around. Diagnosis of breast cancer is a significant area of data mining research. Classification as an essential data mining process also helps in clinical diagnosis and analysis of this disease. The work presents a comparative study of Multilayer Perceptron (MLP) and Support Vector Machine (SVM) classification algorithms with the proposed NFS. The study applies NFS, MLP, and SVM classifiers to three benchmark UCI data sets for detecting breast cancer disease. The evaluation metrics are RMSE, Accuracy, FP-Rate, TP-Rate, Kappa statistic, Recall, Precision, and F-Measure. All these measures support the superiority of the proposed NFS classifier. Chapter five develops a neuro-fuzzy classifier for soil data mining. The classifier uses a Gaussian membership function for fuzzification. The research study offers a performance comparison of this method with other classifiers like Radial Basis Function Network (RBFN), Adaptive Neuro-fuzzy Inference System (ANFIS), k-Nearest Neighbour (k-NN), and Support Vector Machine (SVM). Each of these classifiers applies to three imagery soil databases to identify the soil classes. As usual, the evaluation measures are RMSE, Accuracy, FP-Rate, TP-Rate, Kappa statistic,
21
Recall, Precision, and F-Measure. Among the five classification techniques studied, the proposed neuro-fuzzy based method outperformed the others. Chapter six proposes an Adaptive Neuro-fuzzy Inference System (ANFIS) model for tennis match result prediction. The work presents a performance comparison of ANFIS with two other powerful classifiers namely, Radial Basis Function Network (RBFN) and Support Vector Machine (SVM). The research study aims to predict the result of tennis matches using eight UCI databases of grand slam tennis tournaments in 2013 and evaluate the classification performance using various measures such as the RMSE, Accuracy, FP-Rate, TP-Rate, Kappa statistic, Recall, Precision, and F-Measure. All these performance measures confirm the supremacy of the ANFIS classifier compared to the other classifiers. In the final or the seventh chapter of this thesis, the conclusion and the future scope of the overall research work mentioned in the thesis has been drawn. The future work may be extended in two directions. Firstly, the future work introduces an adaptive method to construct a neuro fuzzy rule-based classification system for network anomaly classification problems. The proposed method might select the significant rules by pruning unnecessary rules, which consists of the error correctionbased learning procedure. Thus, it may improve the reliability of TCP/IP networks. Secondly, the future work intends to combine data mining within the cloud computing based framework. Data mining is typically used to extract potentially useful information from data warehouses. Cloud computing can provide a robust, scalable and dynamic infrastructure into which one can incorporate the methods and techniques of data mining to reduce the expenses of storage and infrastructure.
1.9 REFERENCES [1]
R. Agrawal, T. Imielinski and A. Swami, ―Database Mining: A Performance Perspective,‖ IEEE Transactions on Knowledge and Data Engineering, vol. 5, no. 6, pp. 914–925, December 1993.
[2]
M. S. Chen, J. Han and P. S. Yu, ―Data Mining: An Overview from a Database Perspective,‖ IEEE Transactions on Knowledge and Data Engineering, vol. 8, no. 6, pp. 866-883, December 1996.
[3]
G. Piatetsky-Shapiro and W. J. Frawley, Knowledge Discovery in Databases. AAAI/MIT Press, 1991. 22
[4]
U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (eds.), Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996.
[5]
T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer, Second Edition, 2009.
[6]
M. Kantardzic, Data Mining: Concepts, Models, Methods, and Algorithms, John Wiley & Sons, 2002.
[7]
I. H. Witten, F. Eibe, and M. A. Hall, Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann, Third Edition, 2011.
[8]
J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan and Kaufmann, Second Edition, 2006.
[9]
A. K. Pujari, Data Mining Techniques, Universities Press (India) Private Limited, First Edition, 2001.
[10] R. Agrawal, T. Imielinski, and A. Swami, ―Mining Association rules between sets of items in large databases,‖ In Proceedings of the ACM SIGMOD International Conference on Management of Data (ACM SIGMOD‘93), pp. 207-216, 1993. [11] R. Agrawal and R. Srikant, ―Fast algorithms for mining association rules,‖ In Proceedings of the 20th International Conference on Very Large Data Bases, (VLDB‘94), edited by J. B. Bocca, M. Jarke, and C. Zaniolo, Morgan Kaufmann, pp. 487–499, 1994. [12] R. Agrawal and R. Srikant. Fast algorithm for mining association rules in large databases. In Research Report RJ 9839, IBM Almaden Research Center, San Jose, CA, June 1994. [13] E. Alpaydin, Introduction to Machine Learning, Second Edition, MIT Press, 2010. [14] L. Breiman, J. H. Freidman, R. A. Olshen, and C.J. Stone, ―Classification and Regression Trees,‖ Belmont, Wadsworth, 1984. [15] P. J. Rousseeuw and M. Hubert, Recent Development in PROGRESS, Computational Statistics and Data Analysis, Third Edition, 1987. [16] K. Bailey, Numerical Taxonomy and Cluster Analysis, Typologies and Taxonomies, pp. 34, 1994. [17] J. S. Park, M. S. Chen, and P. S. Yu, ―An effective hash-based algorithm for mining association rules,‖ In Proceedings of the ACM-SIGMOD International Conference on Management of Data (SIGMOD‘95), pp. 175–186, May 1995. [18] J. Han and Y. Fu, ―Discovery of multiple-level association rules from large databases,‖ In Proceedings of the International Conference on Very Large Data Bases (VLDB‘95), pp. 420–431, September 1995. [19] A. Savasere, E. Omiecinski, and S. Navathe, ―An efficient algorithm for mining association rules in large databases,‖ In Proceedings of the International Conference on Very Large Data Bases (VLDB‘95), pp. 432–443, September 1995. 23
[20] H. Toivonen, ―Sampling large databases for association rules,‖ In Proceedings of the International Conference on Very Large Data Bases (VLDB‘96), pp. 134– 145, September 1996. [21] S. Brin, R. Motwani, J. D. Ullman, and S. Tsur, ―Dynamic itemset counting and implication rules for market basket analysis,‖ In Proceedings of the ACMSIGMOD International Conference on Management of Data (SIGMOD‘97), pp. 255–264, May 1997. [22] D. W. Cheung, J. Han, V. Ng, and C. Y. Wong, ―Maintenance of discovered association rules in large databases: An incremental updating technique,‖ In Proceedings of the International Conference on Data Engineering (ICDE‘96), pp. 106–114, February 1996. [23] J. S. Park, M. S. Chen, and P. S. Yu, ―Efficient parallel mining for association rules,‖ In Proceedings of the 4th International Conference on Information and Knowledge Management, pp. 31–36, November 1995. [24] R. Agrawal and J. C. Shafer, ―Parallel mining of association rules: Design, Implementation, and experience,‖ IEEE Transactions on Knowledge and Data Engineering, vol. 8, pp. 962–969, 1996. [25] D. W. Cheung, J. Han, V. Ng, A. Fu, and Y. Fu, ―A fast distributed algorithm for mining association rules,‖ In Proceedings of the International Conference on Parallel and Distributed Information Systems, pp. 31–44, December 1996. [26] J. Han, J. Pei, and Y. Yin, ―Mining frequent patterns without candidate generation,‖ In Proceedings of the International Conference on Management of Data (SIGMOD‘00), pp. 1–12, May 2000. [27] Y. Radhika and M. Shashi, ―Atmospheric Temperature Prediction using Support Vector Machines,‖ International Journal of Computer Theory and Engineering, vol. 1, no. 1, pp. 55-58, 2009. [28] M. Hayati and Z. Mohebi, ―Application of Artificial Neural Networks for Temperature Forecasting,‖ World Academy of Science, Engineering and Technology, International Journal of Electrical, Computer, Electronics and Communication Engineering, vol. 1, no. 4, pp. 654-658, 2007. [29] B. A. Smith, R. W. McClendon, and G. Hoogenboom, ―Improving Air Temperature Prediction with Artificial Neural Networks,‖ International Journal of Computational Intelligence, vol. 3, no. 3, pp. 179-186, 2007. [30] S. Chattopadhyay, ―Feed forward Artificial Neural Network model to predict the average summer-monsoon rainfall in India,‖ Acta Geophysica, vol. 55, no. 3, pp. 369-382, September 2007. [31] S. S. Baboo and I. K. Shereef, ―An Efficient Weather Forecasting System using Artificial Neural Network,‖ International Journal of Environmental Science and Development, vol. 1, no. 4, pp. 321-326, October 2010. [32] S. Kotsiantis, A. Kostoulas, S. Lykoudis, A. Argiriou, and K. Menagias, ―Using data mining techniques for estimating minimum, maximum and average daily
24
temperature values,‖ International Journal of Mathematical, Physical and Engineering Sciences, vol. 1, no. 1, pp. 16-20, 2008. [33] W. J. Clancey, E. H. Shortliffe, eds., Readings in medical artificial intelligence: the first decade. Reading, Mass., Addison-Wesley, pp. 339-360, 1984. [34] O. Anunciacao, B. C. Gomes, S. Vinga, J. Gaspar, A. L. Oliveira, and J. Rueff, ―A Data Mining Approach for the detection of High-Risk Breast Cancer Groups,‖ Springer-Verlag, Berlin Heidelberg, Advances in Intelligent and Soft Computing, vol. 74, pp. 43-51, 2010. [35] W.-Y. Cheng, T.-H. O. Yang and D. Anastassiou, ―Development of a Prognostic Model for Breast Cancer Survival in an Open Challenge Environment,‖ Sci. Transl. Med., vol. 5, no. 181, pp. 1-12, April 2013. [36] R. R. Janghel, A. Shukla, R. Tiwari, and R. Kala, ―Breast cancer diagnosis using Artificial Neural Network model,‖ In Proceedings of the 3rd IEEE International Conference on Information Sciences and Interaction Sciences, pp. 89-94, 2010. [37] A. KELEŞ and A. KELEŞ, ―Extracting fuzzy rules for diagnosis of breast cancer,‖ The Turkish Journal of Electrical Engineering & Compute Science, vol. 21, no. 1, pp. 1495-1503, August 2013. [38] D. Nauck, F. Klawonn, and R. Kruse, Foundations of Neuro fuzzy Systems. Wiley, Chichester, pp. 33-171, 1997. [39] D. Venet, J. E. Dumont, V. Detours, ―Most random gene expression signatures are significantly associated with breast cancer outcome,‖ PLoS Computational Biology, vol. 7, no. 10, pp. 1-8, October 2011. [40] D. Hanahan and R. A. Weinberg, ―Hallmarks of cancer: The next generation,‖ Elsevier, vol. 144, no. 5, pp. 646–674, 2011. [41] S. Ghosh, S. Mondal, B. Ghosh, ―A Comparative Study of Breast Cancer Detection Based on SVM and MLP BPN Classifier,‖ In Proceedings of the 1st International IEEE Conference ACES, pp. 1 – 4, 2014. [42] J. A. Shine, and D. B. Carr, ―Relationships between land cover and spatial statistical compression capabilities in high-resolution imagery,‖ In Proceedings of the 34th International Conference on Interface Symposium, April 2002. [43] J. A. Shine, ―Mapping and modeling 1-Meter multispectral imagery data‖, In Proceedings of the American Statistical Association, Alexandria, VA: American Statistical Association, 2000. [44] J. A. Shine, ―Compression and analysis of very large imagery data sets using spatial statistics‖, In Proceedings of the 33rd International Conference on Interface Symposium, June 2001. [45] J. A. Shine, and D. B. Carr, ―A comparison of classification methods for large imagery data sets,‖ Joint Statistical Meetings 2002 Statistics in an ERA of Technological Change-Statistical computing section, New York City, pp. 32053207, August 2002.
25
[46] O. Rozenstein, and A. Karnieli, ―A comparison of methods for land-use classification incorporating remote sensing and GIS inputs,‖ EARSeL eProceedings, vol. 10, no. 1, pp. 27-45, 2011. [47] D. Lu, and Q. Weng, ―A survey of image classification methods and techniques for improving classification performance,‖ International Journal of Remote Sensing, vol. 28, no. 5, pp. 823–870, March 2007. [48] M. Li, S. Zang, B. Zhang, S. Li, and C. Wu, ―A Review of Remote Sensing Image Classification Techniques: the Role of Spatio-contextual Information,‖ vol. 47, pp. 389-411, 2014. [49] S. Ghosh, S. Biswas, D. Sarkar, and P. P. Sarkar, Soil Data Mining Using Decision Tree Classifier. Computer Science & Engineering Research Journal (CSERJ), Vol. 8, pp. 27 – 31, 2012-13. [50] S. Ghosh, S. Biswas, D. Sarkar, P. P. Sarkar, ―A Tutorial on Different Classification Techniques for Remotely Sensed Imagery Datasets,‖ Smart Computing Review Journal, Vol. 4, No. 1, pp. 34 – 43, February 2014. [51] T. Barnett, A. Brown, and S. R, Clarke, ―Developing a tennis model that reflects outcomes of tennis matches,‖ In Proceedings of the 8th Australasian Conference on Mathematics and Computers in Sport, pp. 178 – 188, 2006. [52] A. Somboonphokkaphan, S. Phimoltares, and C. Lursinsap, ―Tennis Winner Prediction based on Time-Series History with Neural Modeling,‖ In Proceedings of the International MultiConference of Engineers and Computer Scientists (IMECS 2009), March 2009. [53] J. D. Corral, and J. Prieto-Rodr´ıguez, ―Are differences in ranks good predictors for Grand Slam tennis matches?,‖ International Journal of Forecasting, vol. 26, no. 1, pp. 551–563, 2010. [54] A. Panjan, N. Šarabon, and A. Filipčič, ―Prediction of the Successfulness of Tennis Players with Machine Learning Methods,‖ Kinesiology, vol. 42, no. 1, pp. 98-106, 2010. [55] X. Wei, P. Lucey, S. Morgan, and S. Sridharan, ――Sweet-Spot‖: Using Spatiotemporal Data to Discover and Predict Shots in Tennis,‖ In Proceedings of the 7th Annual MIT Sloan Sports Analytics Conference 2013, Boston Convention and Research Center, March 2013. [56] A. S. Timmaraju, A. Palnitkar, and V. Khanna, Game ON! Predicting English Premier League Match Outcomes, CS 229 Machine Learning Final Projects, Stanford University, 2013. [57] D. Buursma, ―Predicting sports events from past results Towards effective betting on football matches‖, In Proceedings of the 14th Twente Student Conference on IT, Enschede, University of Twente, January 2011.
26
CHAPTER 2 Determination of Association Rules Using Genetic Algorithm 2.1
INTRODUCTION Different organizations such as business farmhouses, banking sectors,
financial institutions, social and health services usually gather voluminous data routinely in the progression of the daily operations. Such data are extremely useful for administration of the customer base. Usually, the organization data sets are enormous, continuously growing and consist of many complex features. While these data sets reveal attributes of the managed subjects and relations and are thus possibly of precise use to their owners, they often have low information density. However, the information industry needs robust, simple and computationally efficient software tools to excerpt information from such data. The evolution of such tools is an essential area of data mining research. These tools mostly rely on concepts derived from computer science, machine learning, mathematics, and statistics. Mining useful information and knowledge from these large databases has thus evolved into a significant research domain [1] [2]. Data mining has drawn unlimited attention in the information industries and the society as a whole in the recent years. This discipline has the widespread accessibility to massive quantities of data and the impending need for transforming such data into worthwhile information and knowledge. Many practical fields like market surveys, fraud identification, consumer retaining, production control and scientific investigation can use these gathered knowledge efficiently. Frequent pattern mining or association rule mining is a central theme of data mining research. The frequent patterns are patterns (such as item sets, sub-sequences, or sub-structures) those appear in a data set regularly. For instance, a set of objects, like bread, butter and milk that repeatedly occur concurrently in a transaction dataset is a frequent itemset. A sub-sequence, such as purchasing first a personal computer, then operating 27
system software, and then antivirus software, if it occurs repetitively in a transactional database, is a frequent sequential pattern. A sub-structure can refer to different structural forms, such as sub-graphs, sub-trees, or sub-lattices, which are in combinations with item sets or sub-sequences. If a sub-structure occurs frequently, it is a frequent structured pattern. Discovering such frequent items has a vital role to play in mining associations, correlations, and several other fascinating relationships among data. Furthermore, it aids in data classification, clustering, and other data mining tasks as well. Association rule mining [3] [4] [5] is the procedure of determining fascinating and unforeseen rules from massive data sets. The field denotes a very general model that would like to find the relations between items of a database. An association rule is basically an if-then-rule supported by the data. Initially, the association rule mining algorithms used for solving the market-basket problem [3]. The problem was like this: given a set of objects and large amount of transaction records, the goal was to find associations between the objects confined to several transactions. A typical association rule resulting from such a research study could be ―95 percent of all customers who purchase bread and butter also buy milk" – which reveals a very significant information. Therefore, this kind of analysis can provide new visions into consumer buying habits that can be beneficial for business. Data mining research in the association rules mining field [6] [7] has received a lot of attention in the present decades. The primary association rule mining algorithm named Apriori [4] not only created waves in the data mining research community, but it also inspired other data mining fields as well. Apriori algorithm and all its variations such as Hash-based technique [8], Pincer-search [9], Partitioning [10], Sampling [11], Dynamic itemset counting [12], Incremental [13], Border [14] etc. used to compute all possible frequent itemsets from given transactional records. But the time complexities of these algorithms are pretty high. Here, an attempt has been made to compute frequent itemsets by applying genetic algorithm so that the overall time complexity gets reduced. The present work is basically an extension of the previous research work [15]. The rest of this chapter is organized as follows. In Section 2.2 a summary of the existing association rule mining procedures is provided. Section 2.3 brings a brief 28
introduction to genetic algorithm (GA). Section 2.4 describes the performance evaluation measures to compare between Apriori and GA. Section 2.5 covers the details of the proposed work. Performance analysis of simulation results is put in Section 2.6. Finally, Section 2.7 includes the concluding remarks.
2.2
ASSOCIATION RULE MINING Association rule mining (ARM) is a descriptive data mining task, and it tries
to uncover all the hidden relationships among the itemsets in transaction databases [6] [7]. The primary goal of this procedure is to discover the set of all items or attributes that repeatedly occur in several databases or transaction records. Moreover, It can determine rules to demonstrate how an itemset affects the occurrence of another itemset. ARM algorithms determine association rules in the form: if the conditions of the values of the attributes hold true, then forecast values for some other attributes. The terminologies associated with the mining of association rules have been defined here. Given a set of all possible items, I, a transaction database, D, and the minimum support and minimum confidence thresholds, σ and τ respectively. Let A and B sets of items such that A I , B I and A B . A given transaction T is in D and T I . It contains an itemset A if and only if A T . The research study assumes that there are n number of records in D. A typical association rule takes the form of A B , where A is antecedent and B is consequent. Association rule shows how many times B has occurred if A has already occurred depending on the support and confidence value. The support of an association rule A B is the number of transactions in D that contain the itemset (A B). Certainly, it is a probability denoted by prob(AB). ( )
(
)
(
)
(
)
The confidence of a particular association rule A B is the number of transactions that contain (A B) divided by the number of transactions that contain A. Here prob(B|A) denotes the conditional probability. 29
( )
(
(
)
) ( )
(
)
The following is an association rule indicating X as the variable representing a customer in some electronic shop: )
(
(
)
In the above example, a confidence threshold (τ) of 50% indicates that if a customer purchases a computer, there is a 50% possibility that the customer can buy antivirus software. A 10% support count threshold (σ) indicates that in 10% instances of all possible transactions the items like computer and antivirus software bought together. It is now easier to define the term ‗frequent itemset‘. An arbitrary itemset A in I (i.e., A is a subset of I) is a frequent itemset in the given database D with respect to σ, if it satisfies the following inequality: ( )
( )
(
)
The association rule mining procedure has two different steps. First, it is customary to find out the set of all frequent itemsets whose support count values are greater than equal to the minimum support count threshold σ. In the next step, one should use these frequent itemsets to discover the association rules based on the minimum confidence threshold τ. The general concept is that if, for example, AB and A are two frequent itemsets, then one can determine whether the association rule
A B holds by checking the following inequality: ( )
(
)
(
) ( )
(
)
It is easier to identify that the set of frequent itemsets for a given database D, with respect to a given σ, reveals the following significant properties–
Downward-closure property: A subset of a frequent itemset is always a frequent itemset.
Upward-closure property: A superset of an infrequent itemset is an infrequent itemset.
30
The above features lead to two related definitions as follows–
Maximal frequent itemset: A frequent itemset is a maximal frequent set if and only if this is a frequent itemset, but no superset of this set is a frequent itemset.
Border itemset: An itemset is a border set if it is an infrequent itemset, but all its proper subsets are frequent itemsets.
Thus, it is easier to generate all the frequent item sub-sets given the set of maximal frequent itemsets. Otherwise, if one has the information about the border sets and the maximal frequent itemsets, which are not subsets of any of the border set, then it becomes easier to produce all possible frequent itemsets. The major ARM algorithm named Apriori is basically a level-wise algorithm that employs the downward-closure property, also known as the apriori property. R. Agrawal and R. Srikant proposed this algorithm for mining Boolean association rules [4]. As the name suggests, the algorithm uses prior knowledge of frequent itemset properties. To be more precise, the algorithm utilizes a bottom-up search, going upward level-wise in the lattice. The algorithm has two functions to perform at each iteration namely candidate generation and pruning. It moves upward in the lattice beginning from level 1 till level k, where no candidate set exists after the pruning operation. The following pseudocode describes the Apriori algorithm for discovering frequent itemsets: // Pseudocode of Apriori algorithm Iteration 1 1. Create the candidate sets in C1 2. Preserve the frequent sets in L1 Iteration k 1. Create the candidate sets in Ck from the frequent sets in Lk-1 a) Join Lk-1 p with Lk-1 q, as follows: Add to Ck select p.item1, p.item2, . . . , p.itemk-1, q.itemk-1 31
from Lk-1 p, Lk-1 q where p.item1 = q.item1, . . . p.itemk-2 = q.itemk-2, p.itemk-1 < q.itemk-1 b) Create all (k-1)-subsets from candidate sets in Ck c)Prune all candidate sets from Ck where some (k-1)subset of the candidate itemset not present in frequent set Lk-1 2. Scan the database to determine the support for each candidate set in Ck 3. Preserve the frequent sets in Lk
Essentially, a frequent set is a set having a support count value greater than equal to a user-specified minimum support (known as Lk). The candidate set is also a potentially frequent itemset (known as Ck,). Apriori and all its variants like Partition, Incremental, Pincer-Search, Border algorithm, etc. follow the same functions repeatedly. That is why their time complexity is pretty much high. So far, the researchers designed conventional association rule mining algorithms to identify positive association rules between objects in the transactional datasets. The positive association rules indicate the relationships between existing objects in datasets. Besides this, the negative association rules can also provide critical information. In many situations, the negation of the products may play a substantial role. The GA-based approach can predict the rules that consist of negative attributes as well as more than one attribute in the consequent part. In this regard, the contributions of these researches [16] [17] are worth mentioning for determining association rules.
2.3
GENETIC ALGORITHM Genetic Algorithms (GAs) denote adaptive heuristic search techniques that
rely on the evolutionary thoughts of natural selection and genetic science. The fundamental design concept of GAs is to simulate processes in the natural system essential for evolution, especially those that follow the principle first introduced by the famous scientist Charles Darwin known as the "survival of the fittest". The GA32
based methods utilize random search intelligently within a defined search space to provide a solution to any given optimization problem. GAs can address a problem for which little information is available. They are a very general algorithm and so will operate well in any search space. John Holland introduced the concept of GA [18] in 1970. It is a stochastic search algorithm based on the method of natural selection, which underlines biological evolution [19] [20]. The GA-based methods have become successful in many search, optimization, and machine learning problems. GA operates iteratively by producing new populations of strings from the old ones. Each string is a binary encoded version of a candidate solution. A function evaluates the fitness of every binary string representing its ability for solving the problem at hand [21]. Conventional GA algorithm uses genetic operators such as selection, crossover and mutation on an initially random population to estimate a whole generation of new strings. GA operates to provide solutions for successive generations. The probability of an individual string reproduction is proportional to the goodness of the solution it describes. Therefore, the quality of the solutions in succeeding generations progresses. The procedure ends when one reaches an acceptable or optimum solution. GA is appropriate for problems that require optimization regarding some computable measures. The operations of the genetic algorithm are as follows:
Selection: This process denotes the survival of the fittest based on probability, in that; fitter chromosomes are taken to survive. Here fitness is a relative measure of how well a chromosome solves the problem.
Crossover: To perform this operation one should choose a random gene along the length of the chromosomes and then interchange all genes after that point.
Mutation: This function alters the new solutions for adding randomness in the searching procedure for better solution. The feature indicates the possibility that a bit within a chromosome is reversed such that 0 becomes 1, and 1 becomes 0.
33
Essentially, GA is a method of ‗breeding‘ computer algorithms and solutions to optimization or search problems using simulated evolution. One should apply natural selection, crossover, and mutation operators repeatedly to a population of binary strings that represent potential solutions. Over time, the number of aboveaverage individual strings increases and the fitter building blocks are combined with several competent individuals to find suitable solutions to the problem at hand. This generational process repeats until it reaches a terminating condition.
2.4
PERFORMANCE MEASURES The prime objective of the association rule mining is to derive association
rules based on the concept of frequent itemsets described in the section 2.2. The Apriori algorithm employs a breadth-first search (BFS) approach, initially discovering all frequent 1-itemsets and then finding 2-itemsets and continues by determining gradually larger frequent itemsets. The data is a sequence x (1), x (2), .., x (n) of binary vectors. The study represents the database as an n × d binary matrix, where n denotes the number of data records present, and d is the number of objects [22] [23]. At each level k, one should examine the database minutely to determine the support of the objects in the candidate itemset Ck. The primary decisive parameter for computing the complexity of the Apriori algorithm is: ∑ As usual,
(
)
since it is needed to consider all single objects. Besides, one
would not have any objects which alone are infrequent and so
(
)
.
Thus, the study reaches at the lower bound for C as: (
)
As one sees in practice that it is a significant portion of the total computations, one should have a decent estimate as
. Therefore, one can develop the time
complexity of Apriori considering the dependency on the data size as follows: (
)
(
)
Thus, one should have scalability in the size of data but quadratic dependence on the number of attributes present in the database. But GA consists of a population 34
and an evolutionary mechanism. The population is a collection of individuals that signify the most expected solutions through a mapping called a coding. One can rank a population with the help of fitness function. The researchers usually apply GA to the selected population from the database and estimate the fitness function after each step. The research study [24] proposes three various metrics for evaluating the performance of GA-based methods. These performance metrics depend on the observations made by simulation- the probability of optimality, the average fitness value, and the probability of evolution leap. Based on the above measures, researchers has introduced a new term named the cut-off generation k, i.e., for how many generations a GA-based approach should execute at each run. The study assumes that C = k*r be the total cost given to complete GA process where r means the number of repeated runs and k indicates the best cut-off generation. The value k is the number of generations which maximizes the performance concerning an appropriate measure. A term is introduced here named p(k) that denotes the probability that GA-based approach provides an optimal solution within k r
generations. If C is fixed, one should like to attain k maximizing the term [1- p(k)] . Similarly, if the value of the term [1- p(k)]r is constant, one should like to find k minimizing C = k*r. Indeed, this indicates that the GA-based methods should find all frequent itemsets in linear time. Therefore, the GA-based solution provides significant improvement in computational complexity compared to Apriori algorithm and all its variants.
2.5
PROPOSED METHOD The research study proposes a GA-based method to find the frequent itemsets
from large databases. The method starts from a randomly crated population of individuals according to uniform probability distribution and updates the population of individuals in succession. Basically, one succession of transforming a population is a generation. In every generation, multiple individuals are randomly selected from the present population based upon the use of fitness function, reproduced using crossover, and modified through mutation to produce a new population. The proposed GA-based method performs the following steps: 35
1. Load Database: Load the transaction records from the large database that fits into the main memory. 2. Generation:
Generate
chromosomes
randomly
an
initial
where
each
population transaction
of can
N be
represented by a string of bits (i.e., binary coding). 3. Fitness Calculation: Calculate the fitness value f(x) (i.e.,
an
objective
function)
for
every
individual
chromosome x using the roulette wheel [25] technique. 4. New Population: Generate a new population by iterating the
following
steps
until
the
new
population
is
complete. A. Selection: Select the different individuals from the existing population as parents based on their fitness
values.
These
binary
strings
should
involve in recombination. B. Recombination: Create new individuals named offsprings from the parents by using GA operators like crossover and mutation with their associated probability values. C. Estimation:
Estimate
the
individual
fitness
of
new individuals. D. Substitution: Substitute the least-fit population with the new individuals. 5. Test: If the end condition is fulfilled, stop, and return the best solution in the present population. Otherwise, go to the step 4.
2.6
RESULTS AND DISCUSSION The research study uses different benchmark databases for simulation using
MATLAB software (version R2013a). These databases are namely Supermarket, Mushroom, and Plants. The study evaluates the performance of Apriori and proposed GA method based on execution time (in seconds). Basically, the study measures the 36
execution times by varying the number of instances (i.e., tuples) and the confidence level thresholds on the given databases as presented below.
2.6.1 Supermarket database Supermarket database [26] contains 4627 instances and 217 attributes. Table 2.1 below indicates the test results of Apriori and GA for different number of tuples of the given databse. As a result, when the number of tuples is decreased, the execution time for both algorithms is also decreased. For the 4000 tuples of Supermarket data set, Apriori requires 49 seconds but GA requires only 7 seconds for generating the association rules. Table 2.1: Execution Times for Different Number of Instances for Supermarket database No. of Execution Time (in seconds) Instances Apriori GA 4000 49 7 3000 37 6 2000 25 5
The information in Table 2.1 is provided in Figure 2.1 below indicating the performance comparison between Apriori and GA.
Figure 2.1: Scalability of Apriori and GA using Supermarket database
In the above Figure 2.1, the performance of Apriori is compared with GA, based on execution time. For each algorithm, three different sizes of datasets are 37
considered with magnitudes of 4000, 3000, and 2000. Here the horizontal axis indicates size of database in number of tuples while the vertical axis indicates the execution time in seconds. When comparing Apriori with GA, the latter takes less time for any number of instances. So, the performance of GA is much better than Apriori based on time for various numbers of instances of the Supermarket database. Table 2.2 below summarizes that the execution time of Apriori and GA for various confidence threshold values. While the confidence threshold value is 0.5, the time taken to generate the association rule is 17 seconds in Apriori and 4 seconds in GA method. Table 2.2: Execution Times for Confidence Thresholds using Supermarket database Confidence 0.5 0.7 0.9
Execution Time (in seconds) Apriori 17 36 63
GA 4 6 8
The information provided in Table 2.2 are depicted below in Figure 2.2 that provides a performance comparison between Apriori and GA.
Figure 2.2: Confidence vs. Time plot using Supermarket database
The Figure 2.2 shows the relationship between the time and confidence level. In this graph, horizontal axis signifies the time and vertical axis signifies the 38
confidence level. The running time for GA with confidence of 0.9 is much lesser than that of Apriori.
2.6.2 Mushroom database Mushroom database [27] contains 8124 instances and 22 attributes. Table 2.3 below indicates the test results of Apriori and GA for different number of tuples of this databse. As a result, when the number of tuples decreased, the execution time for both algorithms is decreased. For the 8000 tuples of Mushroom data set, Apriori requires 147 seconds but GA requires only 12 seconds for generating the association rules. Table 2.3: Execution Times for Different Number of Instances for Mushroom database No. of Execution Time (in seconds) Instances Apriori GA 8000 147 12 7000 123 11 6000 107 10
The information in Table 2.3 is provided in Figure 2.3 below indicating the performance comparison between Apriori and GA.
Figure 2.3: Scalability of Apriori and GA using Mushroom database
In the above Figure 2.3, the performance of Apriori is compared with GA, based on time. For each algorithm, three different sizes of dataset are taken with 39
magnitudes of 8000, 7000, and 6000. Au usual, the horizontal axis indicates size of database in number of tuples while the vertical axis indicates the execution time in seconds. When comparing Apriori with GA, the latter takes less time for any number of instances. So, GA outperforms Apriori based on time for various numbers of instances of the Mushroom database. Table 2.4 below summarizes that the execution time of Apriori and GA for various confidence threshold values. While the confidence threshold value is 0.5, the time taken to generate the association rule is 127 seconds in Apriori and 11 seconds in GA method. Table 2.4: Execution Times for Confidence Thresholds using Mushroom database Confidence 0.5 0.7 0.9
Execution Time (in seconds) Apriori 127 173 221
GA 11 13 15
The information provided in Table 2.4 are depicted below in Figure 2.4 that provides a performance comparison between Apriori and GA.
Figure 2.4: Confidence vs. Time plot using Mushroom database
The Figure 2.4 shows the relationship between the time and confidence level. As usual, horizontal axis of the graph denotes the time and vertical axis of the graph 40
indicates the confidence level. The running time for GA with confidence of 0.9 is much lesser than that of Apriori.
2.6.3 Plants database Plants database [28] contains 22, 632 instances and 70 attributes. Table 2.5 below indicates the test results of Apriori and GA for different number of tuples of the given databse. As a result, when the number of tuples decreased, the execution time for both algorithms is decreased. For the 20,000 tuples of Plants data set, Apriori needs 527 seconds but GA needs only 23 seconds to produce the rules. Table 2.5: Execution Times for Different Number of Instances for Plants database No. of Execution Time (in seconds) Instances Apriori GA 20000 527 23 18000 411 20 15000
337
18
The information in Table 2.5 is provided in Figure 2.5 below indicating the performance comparison between Apriori and GA.
Figure 2.5: Scalability of Apriori and GA using Plants database
In the above Figure 2.5, the performance of Apriori is compared with GA, based on time. For each algorithm, three different size of dataset are considered with magnitudes of 20000, 18000, and 15000. In the usual manner, the horizontal axis 41
indicates size of database in number of tuples while the vertical axis indicates the execution time in seconds. When comparing Apriori with GA, the latter takes lesser amount of time for any number of instances. So, the performance of GA is certainly better than Apriori based on time for various numbers of instances of the Plants database. Table 2.6 below summarizes that the execution time of Apriori and GA for various confidence threshold values. While the confidence threshold value is 0.5, the time taken to generate the association rule is 445 seconds in Apriori and 21 seconds in GA method. Table 2.6: Execution Times for Confidence Thresholds using Plants database Confidence 0.5 0.7 0.9
Execution Time (in seconds) Apriori 445 532 731
GA 21 23 27
The information provided in Table 2.6 are depicted below in Figure 2.6 that provides a performance comparison between Apriori and GA.
Figure 2.6: Confidence vs. Time plot using Plants database
The Figure 2.6 shows the relationship between the time and confidence level. In this graph, horizontal axis signifies the time and vertical axis signifies the 42
confidence level. The running time for GA with confidence of 0.9 is much lesser than that of Apriori. The research study applies Apriori and GA-based method on different large datasets. In fact, when tested with the huge data set of Plants database the simulation has achieved success that surely proves the efficiency of the proposed method.
2.7
CONCLUSION The research study has dealt with the inspiring association rule mining
problem of finding frequent itemsets using the proposed GA based method. The method, described here is simple and efficient one. This is successfully tested for different large data sets namely Supermarket, Mushroom and Plants. The proposed GA-based method provides considerable improvement in computational complexity when compared to Apriori algorithm. The simulated result also supports this. The study analyzes the performance evaluation by varying number of instances and confidence level thresholds. The competences of both these algorithms are evaluated based on time to generate the association rules. From the simulated results presented it can be concluded that the GA performs better than the Apriori algorithm. The results reported in this paper are correct and appropriate. However, a more widespread empirical evaluation of the proposed method will be the objective of future research. The inclusion of some other interesting measures mentioned in the literature is also part of the planned future work.
2.8 REFERENCES [1]
R. Agrawal, T. Imielinski and A. Swami, ―Database Mining: A Performance Perspective,‖ IEEE Transactions on Knowledge and Data Engineering, vol. 5, no. 6, pp. 914–925, 1993.
[2]
M. S. Chen, J. Han and P. S. Yu, ―Data Mining: An Overview from a Database Perspective,‖ IEEE Transactions on Knowledge and Data Engineering, vol. 8, no. 6, pp. 866-883, 1996.
[3]
R. Agrawal, T. Imielinski, and A. Swami, ―Mining Association rules between sets of items in large databases,‖ In Proceedings of the ACM SIGMOD International Conference on Management of Data (ACM SIGMOD‘93), pp. 207-216, 1993. 43
[4]
R. Agrawal and R. Srikant, ―Fast algorithms for mining association rules,‖ In Proceedings of the 20th International Conference on Very Large Data Bases, (VLDB‘94), edited by J. B. Bocca, M. Jarke, and C. Zaniolo, Morgan Kaufmann, pp. 487–499, 1994.
[5]
R. Agrawal and R. Srikant, Fast algorithm for mining association rules in large databases, In Research Report RJ 9839, IBM Almaden Research Center, San Jose, CA, 1994.
[6]
A. K. Pujari, Data Mining Techniques, Universities Press (India) Private Limited, First Edition, 2001.
[7]
J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan and Kaufmann, Second Edition, 2006.
[8]
J. S. Park, M. S. Chen, and P. S. Yu, ―An effective hash-based algorithm for mining association rules,‖ In Proceedings of the ACM-SIGMOD International Conference on Management of Data (SIGMOD‘95), pp. 175–186, May 1995.
[9]
D. -I. Lin and Z. M. Kedem, ―Pincer Search: A New Algorithm for Discovering the Maximum Frequent Set,‖ In Proceedings of the 6th International Conference on Extending Database Technology: Advances in Database Technology, pp.105119, March 1998.
[10] A. Savasere, E. Omiecinski, and S. Navathe, ―An efficient algorithm for mining association rules in large databases,‖ In Proceedings of the International Conference on Very Large Data Bases (VLDB‘95), pp. 432–443, September 1995. [11] H. Toivonen, ―Sampling large databases for association rules,‖ In Proceedings of the International Conference on Very Large Data Bases (VLDB‘96), pp. 134– 145, September 1996. [12] S. Brin, R. Motwani, J. D. Ullman, and S. Tsur, ―Dynamic itemset counting and implication rules for market basket analysis,‖ In Proceedings of the International Conference on Management of Data (SIGMOD‘97), pp. 255–264, May 1997. [13] D. W. Cheung, J. Han, V. Ng, and C. Y. Wong, ―Maintenance of discovered association rules in large databases: An incremental updating technique,‖ In Proceedings of the International Conference on Data Engineering (ICDE‘96), pp. 106–114, February 1996. [14] Y. Aumann, R. Feldman, O. Lipshtat, and H. Manilla, ―Borders: An Efficient Algorithm for Association Generation in Dynamic Databases,‖ Journal of Intelligent Information Systems, vol. 12, no. 1, pp. 61–73, 1999. [15] S. Ghosh, S. Biswas, D. Sarkar, and P. P. Sarkar, ―Association Rule Mining Algorithms and Genetic Algorithm: A Comparative Study,‖ In Proceedings of the 3rd International Conference on Emerging Applications of Information Technology (EAIT 2012), pp. 202– 205, 2012. [16] A. Ghosh and B. Nath, ―Multi-objective rule mining using genetic algorithms,‖ Information Sciences, vol. 163, pp. 123-133, 2004. 44
[17] M. Anandhavalli, S. K. Sudhanshu, A. Kumar and M. K. Ghose, ―Optimized association rule mining using genetic algorithm,‖ Advances in Information Mining, vol. 1, no. 2, pp. 01-04, 2009. [18] M. Pei, E. D. Goodman, and F. Punch, ―Feature Extraction using genetic algorithm,‖ In Proceedings of the International Symposium on Intelligent Data Engineering and Learning (IDEAL‘98), Hong Kong, October 1998. [19] David Beasley et al., ―An overview of genetic algorithms,‖ Part 1 & 2, University Computing, vol. 15, no. 2 & 4, pp.58-69 & 170-181, 1993. [20] S. J. Russell and P. Norvig, Artificial Intelligence: A Modern Approach, 2008. [21] D. E. Goldberg, Genetic Algorithms in Search Optimization and Machine Learning, Addison Wesley, 1989. [22] M. Hegland, ―The Apriori Algorithm – a Tutorial,‖ CMA, Australian National University, WSPC/Lecture Notes Series, pp. 22-27, March 2005. [23] B. Liu, W. Hsu, S. Chen and Y. Ma, ―Analyzing the Subjective Interestingness of Association Rules,‖ IEEE Intelligent Systems, vol. 15, no. 5, pp. 47-55, 2000. [24] K. Sugihara, ―Measures for Performance Evaluation of Genetic Algorithms,‖ In Proceedings of the 3rd Joint Conference on Information Sciences (JCIS '97), 1997. [25] M. D. Vose, The Simple Genetic Algorithm: Foundations and Theory, MIT Press, Cambridge, 1999. [26] Supermarket database: http://storm.cis.fordham.edu/~gweiss/data-mining/wekadata/supermarket.arff [27] Mushroom database: https://archive.ics.uci.edu/ml/datasets/Mushroom [28] Plants database: https://archive.ics.uci.edu/ml/datasets/Plants
45
CHAPTER-3 Weather Data Mining Using Artificial Neural Network
3.1
INTRODUCTION Data mining [1] [2] [3] helps in extracting hidden predictive information from
large databases. It is a dominant and gratuitously new technology with the boundless potential to explore significant information in databases and data warehouses. Data mining examines databases thoroughly for hidden patterns, discovers predictive information that experts may oversight, as it goes beyond their anticipations. There is always a growing aspiration to utilize this innovative technology in novel domains of application so as to convert these large passive databases into valuable, worthwhile knowledge. Meteorology is one of such areas, where data mining can improve the efficiency of its analysts extremely by transforming their voluminous, unmanageable and prone to ignorance information into usable pieces of knowledge. In earlier days, scientists used to perform weather forecasting by testing barometric pressure, existing weather conditions, and sky condition with manual computation. But nowadays, researchers develop forecast models based on data mining, soft computing to predict future weather conditions. However, human effort is still vital in selecting the best possible model for forecasting, which involves pattern recognition skills, teleconnections, information of the model performance, and understanding of the model biases. There are many reasons for which the weather forecast models are becoming less effective. The main reasons are the disorderly nature of the atmosphere and the enormous computational power needed to solve the equations describing the atmospheric conditions. The others reasons are the inaccuracy involved in evaluating the initial conditions and the incomplete information about atmospheric processes.
46
Data mining consists of two different research areas [4] [5], descriptive and predictive. Descriptive data mining defines task-related databases in concise, enlightening, and discriminatory forms. Whereas predictive data mining mainly based on data analysis, develops models for the database, and predicts the trends and features of unknown data. Classification is an important predictive data mining technique that can be helpful for weather prediction. Well-known classification methods based on artificial neural network, support vector machine, decision tree are used here to develop useful weather forecasting models. Artificial neural network (ANN) [6] [7] is a dominant data modeling tool that can represent complex input-output relationships. The goal of the neural network technology is to develop an artificial system that could perform intelligent tasks similar to the human brain. A multi-layer perception (MLP) [8] [9] network consists of at least three layers: one input layer, one output layer, and, at least one hidden layer in between them. The nodes in the input layer and the hidden layer(s) have some weights associated with them. An MLP model exhibits two modes of operation: feedforward and backpropagation. A Hopfield network [10] on the other hand is a form of recurrent network. It serves as content-addressable memory systems with binary threshold units. MLP and Hopfield neural network based combined approach can build an efficient classification tool. Support Vector Machine (SVM) [11] [12] is another dominant data modeling tool used in the machine learning field. It may build one, or more than one hyperplane in a high-dimensional feature space for classification, regression or other analysis tasks. SVM utilizes a hyperplane to discriminate between classes. When classes intersect with each other, the common idea is to employ a hyperplane to minimize the error of data points alongside the boundary line between classes; these points are the support vectors. A decision tree (DT) [13] [14] follows a tree structure that is quite similar to the flowchart. Here every non-terminal node denotes a test on an attribute while each branch denotes an outcome of the test, and each terminal node holds a class label. The topmost node in a tree is the root node. It is easier to convert a decision tree to classification rules. During tree construction, one should use attribute selection measures to choose the attribute that correctly divides the tuples into different classes. 47
Well-known attribute selection measures are Information gain, Gain ratio, and Gini index. While constructing a decision tree, many of its branches may indicate noise in the training dataset. In this scenario, one can employ tree pruning approach for identifying and removing such branches to enhance the accuracy of test dataset. Conventional algorithms for decision tree induction like ID3, C4.5, and CART implement a greedy method for building decision trees in a top-down recursive divide-and-conquer manner. The research study has used predictive data mining to forecast precise weather conditions using a meteorological database named weather underground. The present work is basically an extension of the previous research work [15]. In this context, the contributions of these research works are also worth mentioning [16] [17] [18] [19] [20] [21]. Initially, the idea is to construct a classification model called ANN from the training dataset using MLP and Hopfield neural network based combined approach. After that, one should test the accuracy of this model by using test data test. The proposed method uses the benchmark dataset for weather forecasting and, therefore, compares its performance with SVM and DT classification models. The research study has estimated the performance of these classification techniques in terms of different evaluation measures like accuracy, Kappa statistic, root-mean-square error (RMSE), True Positive Rate (or Recall), Precision, False Positive Rate, and FMeasure. The research study is organized as follows: Section 3.2 provides the description of different classification models being used; section 3.3 explains the proposed method. Section 3.4 describes the various measures for performance evaluation. Section 3.5 presents the results of performance analysis, and Section 3.6 draws some conclusion on this work.
3.2
CLASSIFICATION TECHNIQUES USED In this study, the proposed ANN model uses the combination of MLP and
Hopfield network model. The descriptions about various models namely ANN, SVM and DT are given below.
48
3.2.1 Artificial Neural Network Artificial Neural Network (ANN) can perform intelligent tasks similar to the human brain. It is extensively known for exceptional precision and incredible learning capability. ANN also exhibits high tolerance to noisy data. One of the competent methods of data classification from the ANN domain is the Multi-layer Perceptron (MLP) model. A MLP network is a set of connected input and output nodes with each path having a specific weight. During learning these weights are adjusted so as to classify the correct output from the set of input. Learning is performed on multilayer feed-forward neural network by the backpropagation algorithm. Here, the neural network model learns about a set of weights for a particular predicted class label through iterations. The network model considers an input layer, one or more than one hidden layer, and an output layer. Each of the layers is built up of nodes. The input to the network corresponds to the attribute present in the training dataset. Inputs are fed into these nodes which make up the input layer. These inputs are then mapped to another level of nodes in the neural network model. This layer is known as hidden layer, the output of this hidden layer can be an input to another hidden layer as illustrated in Figure 3.1. The number of hidden layer is determined by the complexity of the problem and should be chosen carefully. After the last hidden layer, the nodes are mapped to a set on class labels. This last layer of this model is the output layer of the network, therefore it can be stated that the neural network must contain at least three layers.
Figure 3.1: A multilayer feed-forward network 49
As shown in Figure 3.1 the neural network structure appears like a connected graph with each level as a layer in this network with X as the input layer and Y as the output layer. This network is feed-forward as the weight in edges do not cycle back to the input layer or the previous layer. This multilayer feed-forward network with sufficient number of given hidden layers can closely predict the values of a function. Before training the network, the study needs to decide the number of nodes for input and output layer and the number of hidden layer(s). The backpropagation technique learns iteratively from the training data, the neural network is altered for each training tuple to map the test input to an appropriate target value (class label). For each training tuple the weights are modified to reduce mean squared error between the actual result and predicted result. This modification is done backward from the output layer down through the hidden layer to the first layer, thus the name backpropagation. The steps for learning and testing are described below: A) Initialization of the weights: The weights in the network are initialized to number within a small interval (for e.g. values within the range -1.0 to +1.0). Each of the nodes in the neural network has a bias associated with it. This bias is also initialized to small random number. The following steps describe the processing of the training tuples. B) Forward propagation of the inputs: The training tuple is at first fed to the input layer. Each of the attribute corresponds to individual input node. The input passes through this node unchanged i.e., for an input node, its output will be O j and is equal to the input value I j . The input and output of each of the nodes in hidden and output layer is then calculated. In each node of the neural network as depicted in Figure 3.2 has a number of inputs connected to it is coming from the previous layer. Each of the input connection is assigned with some specific weight. To compute net input, a weighted sum is calculated which can be given as: I j wij Oi j
(3.1)
i
Thus, equation 3.1 calculates the net input I j for the node j , having output from the previous node i and wij is the weight of the connection. The bias j acts as a threshold to vary the activity of the node. 50
Figure 3.2: Hidden or output layer of backpropagation neural network
Each node in the hidden and output layer takes the calculated net input and applies an activation function to it, shown in Figure 3.2. This function represents the activation of that neuron which is being represented by that node. The activation function uses tan sigmoid function. So, the output of the node j can be given as: Oj
2 1 e
2*I j
1
(3.2)
The function as given in equation 3.2 is also known as squashing function, as it maps the large input domain from the input into a small range of values between 0 and 1. The study computes output values O j for each of the nodes including the output layer. C) Propagation of the error: The error that has generated from the network is propagated backward to modify the necessary values to correct the error. For any node j the error Errj can be given as: Err j O j (1 O j )(T j O j )
(3.3)
where Oj is the actual output from node j and Tj is the target known value of the training tuple. Now to compute the error in each hidden layer the study takes Errk as the error of the node in next higher layer k, and then the error in the node j can be given as: Err j O j (1 O j ) Errk w jk
(3.4)
k
The weights and bias are updated to correct the errors. The weights are updated according to the following equations: 51
wij (l ) Err j Oi and wij wij wij
(3.5)
where wij is the change in weight and l is the learning rate which varies in between the range -1.0 to +1.0. Similarly, the bias can be modified as: j (l) Err j and j j j
(3.6)
where j is the change in bias. This learning rate is used to avoid getting stuck at a local minimum in decision space. If the learning rate is small, then learning occurs very slowly. If it is too high then it may revolve between insufficient results. Thus, it is a rule to fix the learning rate to 1/t, where t is the current number of iteration completed in training the network. After building the MLP model, it is then combined with a Hopfield network to develop a classification model. A Hopfield network is a form of recurrent network that serves as content-addressable memory system with binary threshold units. The Figure 3.3 below shows a typical neuron used in a Hopfield network.
Figure 3.3: A typical artificial neuron used in a Hopfield network
Hopfield networks are built from artificial neurons, which are fully connected. These neurons have n inputs. With each input i there is a weight wi associated. They also have an output denoted by the symbol o. The state of the output is preserved, until the neuron is modified. Modifying the neuron involves the following operations:
The value of each input xi is determined and the weighted sum of all inputs ∑ is calculated.
The output state of the neuron is adjusted to +1 if the weighted input sum is greater than or equal to 0. It is adjusted to -1 if the weighted input sum is lesser than 0.
A neuron holds its output state until it is modified again. 52
3.2.2 Support Vector Machine Support vector Machine (SVM) is a promising model for classification of both linear and nonlinear data. SVM uses nonlinear mapping to transform the linear dataset into a higher dimension. In this dimension it searches for the linear optimal separating hyperplane. A hyperplane is a decision boundary to separate two classes. Support vectors are the essential training tuples from the set of training dataset. With a sufficiently high dimension and appropriate nonlinear mapping two classes can be separated with the help of support vectors and margins defined by the support vectors. Training of SVM is slow, but is very accurate due to their ability to model nonlinear decision boundaries. This is why SVM has been selected to classify the given dataset. To explain SVM it is necessary to consider a dataset D be given as ( X 1 , y1 ), ( X 2 , y 2 ), , ( X | D| , y|D| ) , where X i is the set of training tuple with the
corresponding class label yi . Each of the class labels yi can take either +1 or -1 corresponding to the training tuple. Then the study needs to search the separating hyperplane. There are infinite numbers of hyperplanes that can exist in between the two class labels. In SVM it is needed to search the maximum marginal hyperplane (MMH) as illustrated below in Figure 3.4.
Figure 3.4: A SVM showing maximum marginal hyperplane between two classes
53
Here, Figure 3.4 considers all the possible hyperplanes and their possible margins. A margin is the shortest distance between the hyperplane and one of the sides, where the side is parallel to the hyperplane. For accurate classification it is likely to consider the maximum possible distance in between the margins. A separating hyperplane can be given as:
W X b 0
(3.7)
where W is the weight vector and b is the bias. Thus the tuples in the dataset can be sub-grouped into two classes with margins H1 and H 2 as:
H1 : W X b 1 for yi 1 (3.8)
H 2 : W X b 1 for yi 1 Thus the tuples that falls above H1 corresponds to the first class and the tuples that are below H 2 are in the second class. If there exist any tuples that fall on either
H1 or H 2 , then they are the support vectors. It is important to note that the hyperplane is positioned in the middle of the two margins. Thus the maximum margin can be given as 2 / ||W||. Here, a SVM model with polynomial kernel is selected for this research study. The non-linear version of a SVM can be represented by using a kernel function K as:
K xi , x j xi . x j
(3.9)
where x is the non-linear mapping function employed to map the training tuples in the database. For SVM, a polynomial kernel of degree (exponent) d is defined as: K xi , x j xi . x j 1
d
(3.10)
After building the SVM model it is then used on the test data set to get appropriate result.
3.2.3 Decision Tree Decision tree (DT) is a classification model in which a decision tree is built by learning from the tuples in the training dataset. A decision tree appears like a flowchart in a tree-like structure, where each internal node denotes condition testing 54
on an attribute and each branch resulting from that node denotes the outcome from the test. The leaf node in the decision tree holds a class label. In the tree the nodes divide the tuples into different groups at each level of the tree until they fall into distinct class labels. There are three commonly known variation of decision tree algorithm. They are given as follows: •
ID3 (Iterative Dichotomiser)
•
C4.5
•
CART (Classification and Regression Trees)
A sample decision tree is illustrated in Figure 3.5 below.
Figure 3.5: A sample decision tree classifier
ID3, C4.5 and CART are developed using greedy approach. Most of these methods follow a top-down approach, in which the tree starts with a set of tuples and their corresponding classes. The tree starts with a single node containing all the tuples. If all the tuples are in the same class then this becomes the leaf node and the tree is not further divided. Otherwise, the set of tuples is divided into two or more parts in accordance to the splitting criterion. The splitting criterion involves splitting of the set of tuples into subsets in correspondence to the grouping done with respect to the splitting attribute. The splitting criterion is chosen in such a way that the tuples are grouped in the best possible way. If a node is marked with the splitting criterion then branches are created for each outcome of the test. Using this principle recursively the decision tree is built. An attribute selection needs to be done for obtaining the best splitting criterion that can correctly and accurately separate the tuples into classes. This attribute selection method (also called splitting rules) determines how the tuples 55
are to be divided. The attribute is having the best score is selected as the splitting attribute. There are three popular attribute selection measure-information gain, gain ratio and gini index. The present work uses a C4.5 based implementation of decision tree. The attribute selection measure for C4.5 is the gain ratio. The concept of gain ratio is described below. Let D be the training data set consisting of class-labeled tuples. Suppose the class label attribute has m distinct values describing m distinct classes, Ci (for i = 1, 2, .... , m). Let Ci,D be the set of tuples of class Ci in D. Let |Dj| and |Ci,D| denote the number of tuples in D and Ci,D, respectively. The expected information needed to classify a tuple in D is given by
Info( D) im1 pi log 2 pi
(3.11)
where pi is the probability that a random tuple in D fits in class Ci and is assessed by the ratio |Ci,D|/|D|. Now, suppose the procedure are to divide the tuples in D on some attribute A having v different values, (a1, a2, …, av), as detected from the training data set. If A is discrete-valued, these values relate directly to the v outcomes of a test on A. Attribute A can be used to split D into v divisions or subsets, (D1, D2, … , Dv), where Dj comprises those tuples in D that have outcome aj of A. Info A ( D) vj 1
Dj D
Info ( D j )
(3.12)
Information gain (used by ID3) is defined as the difference between the original information constraint (i.e., based on the proportion of data classes) and the new constraint (i.e., obtained after dividing on A). That is,
Gain A Info D Info A D
(3.13)
Generally, the attribute A with the highest information gain is selected as the splitting attribute. But, the information gain metric is biased toward tests with many outcomes. Basically, it prefers to select attributes having a large number of values. That is why another term has been introduced which is known as the gain ratio. C4.5, a successor of ID3, uses the concept of gain ratio to avoid this bias. It applies a kind of
56
normalization to information gain using a ―split information‖ value defined analogously with Info(D) as SplitInfo A ( D) vj 1
Dj log 2 D D
Dj
(3.14)
This value denotes the probable information produced by splitting the training data set, D, into v divisions, conforming to the v results of a test on attribute A. Hence, the gain ratio is defined as GainRatio A
Gain A SplitInfo A
(3.15)
Typically, the attribute with the largest gain ratio is designated as the splitting attribute.
3.3
PROPOSED TECHNIQUE Weather prediction is a significant application in climatology. Weather is a
continuous, data-intensive, multidimensional, dynamic procedure that makes weather forecasting a formidable challenge. The Figure 3.6 below shows different surface stations on the earth (i.e. globe).
Figure 3.6: The surface stations on the earth
57
As already stated, the present research study relies on the classification approach of data mining. Classification is the way of developing model that defines and separates several data classes from each other. Initially, the classification procedure applies some preprocessing tasks (data cleaning, data transformation etc.) to the original database. Then, the method divides this preprocessed data set into two different sections namely the training data set and the test data set. These data sets should be independent of each other to avoid biases.
Classification contains two different steps. The first step builds a classification model indicating a well-defined set of classes. Therefore, this is the training phase, where the classification technique constructs the model by learning from a given training data set accompanied by their related class label attributes. After that, the classification model is applicable for prediction called the testing phase. This step estimates the accuracy of the derived model using the test data set independent of the training data set. To test with the proposed system a sample dataset is taken from weather underground website [22]. This dataset contains the real time observation of the weather for a particular period of time. For this study, an observation of the complete previous five years data from January 2007 to December 2011 is taken. The dataset has 25000 records with attributes such as Temperature (°C), Wind speed (km/hour), Humidity (%), Precipitation (mm), and Pressure (hPa). All these attributes in this database are real-world data that signify different characteristics of the weather information base taken here for performing research study. Initially, the following preprocessing techniques are applied to the dataset before the classification task — Data cleaning: It represents the preprocessing of data for eliminating or diminishing noise and the treatment of missing values. A missing value is normally substituted by the arithmetic mean for that attribute based on statistics. Data transformation: Using this way the dataset is normalized as because the ANN based technique requires distance measurements in the training phase. It converts attribute values to a small-scale range like -1.0 to +1.0.
58
Afterwards, the weather dataset is distributed into two disjoint sub-sets, namely the training set and the test set. Basically the study employs 10-fold crossvalidation technique for creating the training and test datasets separately. The particular approach of dataset division avoids biasing of data. Therefore, the technique always creates training set and test set completely independent of each other. In the present work, MLP and Hopfield network model are combined to develop an artificial neural network (ANN) model using the training dataset. Later, one should use the test dataset to investigate the performance of the proposed ANN model in predicting the atmospheric condition. The names of the weather dataset attributes with their abbreviations are as follows: Temperature denoted by T, Wind speed indicated by WS, Humidity signified by H, Precipitation meant by P, and Pressure represented by PR. The proposed weather prediction system makes use of the following geographical structure of the earth as shown below in Figure 3.7.
Figure 3.7: The sixteen equi-space regions of Earth
59
The proposed technique essentially depends on the following steps: 1. Divide the globe into sixteen equi-space areas (regions). 2. Each area designated by an individual node that consists of five input parameters such as temperature, wind speed, humidity, precipitation, and air pressure respectively. 3. Connect the nodes with each other using directed edges, i.e. develop a mesh topological network structure. Links between any two nodes are bi-directional as because temperature, wind speed, humidity, precipitation, and air pressure can flow in both directions. 4. Construct a MLP network model for each pair of locations (including the selfloop). The model basically consists of a single input layer, a single output layer, and one hidden layer in between them. The input and output layer consists of five neurons. The selection of the number of processing elements (PEs) present in the hidden layers is also an essential parameter. Subsequently a thorough investigation helps to select the number of neurons present in the hidden layers [23]. The number of nodes in the hidden layer, denoted by, l, is given by the following equation as l = (number of inputs + number of outputs) * 2/3
(3.16)
5. Every node i has an input Si and output Xi. It is now easier to represent Si as a five-tuple structure as (T, WS, H, P, PR). The input Si is basically a vector with temperature (WT), wind speed (WWS), humidity (WH), precipitation (WP), and air pressure (WPR) as its components. Therefore it is easier to define Si as:
Si = W1*WT *Xi1 +W2*WWS*Xi2 +W3*WH*Xi3 +W4*WP*Xi4 +W5*WPR*Xi5 (3.17) where W1, W2, W3, W4, and W5 are the scalars. The output Xi is defined as (
)
(3.18)
6. Perform initialization of the scalars with some random values between -1.0 and +1.0. After building the MLP network, develop a Hopfield neural network model using the training dataset. 60
7. Test the network model with the test dataset. The system must accomplish the flow of temperature, wind speed, humidity, precipitation, and air pressure to establish equilibrium. This procedure will continue iteratively and in every iteration bias and weight values need to be adjusted until it converges. Finally, the proposed artificial neural network (ANN) model applies to the benchmark dataset for weather forecasting and, therefore, compares its performance with support vector machine and decision tree based classification models. The study performs the performance evaluation of these classifiers using several measures such as accuracy, Kappa statistic, root-mean-square error (RMSE), True Positive Rate (or Recall), Precision, False Positive Rate, and F-Measure.
3.4
PERFORMANCE MEASURES The research work estimates the performances of these classification models
on the basis of different performance measures described below.
3.4.1 Root-mean-square error (RMSE) RMSE [24] is a long-familiar performance measure of the dissimilarity between the values predicted by a classifier and the values actually found from the system being modeled. The RMSE of a classifier‘s estimation with regard to the calculated variable eclassifier is the square root of the mean-squared error: 2 edis cov ered,k eclassifier ,k k 1 n
RMSE
n
(3.19)
where ediscovered are the discovered values and eclassifier are the predicted values for k . Here, n denotes the number of data records present in the database.
3.4.2 Kappa statistic The Kappa statistic [25], represented by κ, is a well-known performance metric in statistics. It is the measure of reliability among different raters or judges. The following equation estimates the value of κ as: K
prob O prob C 1 prob C 61
(3.20)
Here prob(O) is the probability of witnessed settlements amongst the raters, and prob(C) is the probability of settlements estimated by coincidence. If κ = 1, the judges have approved each other‘s decision. If κ = 0, then the judges do not agree with each other.
3.4.3 Confusion matrix In the soft computing field, the confusion matrix [26] is a specific tabular representation illustrating a classification algorithm's performance. It is a table layout that permits more thorough analysis than accuracy. Each column of the matrix denotes the patterns in a predicted class while each row indicates the patterns in the actual class. Table 3.1 below displays the confusion matrix for a two-class classifier with the following data entries:
true positive (tp) indicates the number of 'positive' patterns classified as 'positive.'
false positive (fp) means the number of 'negative' patterns classified as 'positive.'
false negative (fn) denotes the number of 'positive' patterns classified as 'negative.'
true negative (tn) implies the number of 'negative' patterns classified as 'negative.' Table 3.1: A confusion matrix for a two-class classifier Predicted Class positive negative positive
tp
fp
negative
fn
tn
Actual Class
A two-class confusion matrix defines several standard terms. The accuracy is the sum of the correctly classified examples divided by the total number of examples present. The following equation calculates this as: accuracy
tp tn tp tn fp fn
(3.21)
The precision is the ratio of the predicted positive examples found to be correct, as calculated using the following equation:
62
precision
tp tp fp
(3.22)
The fp-rate is the ratio of negative examples incorrectly classified as positive, as determined using the equation: fp rate
fp fp tn
(3.23)
The tp-rate or Recall is the ratio of positive occurrences discovered correctly, as estimated using the equation: recall tp rate
tp tp fn
(3.24)
In some situations, high precision may be more relevant while sometimes high recall may be more significant. However, in most representations, one should try to improve both values. The combined form of these values is called the f-measure, and usually expressed as the harmonic mean of both these values: f measure
3.5
2 * precision * recall precision recall
(3.25)
RESULTS AND PERFORMANCE ANALYSIS Initially, the idea is to implement a MLP network with three layers (one input
layer, one hidden layer, and one output layer) using MATLAB software. The activation functions are tan sigmoid in each node with initial weights are taken in the range -1.0 to +1.0. Finally, the results obtained by MLP model are fed to a Hopfield Network. MLP and Hopfield neural network based combined approach can build an efficient model. The study tries to evaluate the accuracy of prediction using the test data set. In that way the ANN model has been built for weather prediction. In the testing phase, ANN, SVM, and DT classifiers are applied on the benchmark dataset for investigation and performance analysis which are described below. The study has employed the 10-fold cross-validation technique for creating the training and test datasets separately. Therefore, the resulting training and test data sets will be entirely disjoint. The results described here are solely based on the simulation.
63
As usual the three classification models, namely ANN, SVM, and DT are applied to the test dataset for classification. The study has measured the operation of these classifiers along the theme of different performance measures like classification accuracy, RMSE, and the Kappa statistic value as presented below in Table 3.2. Basically, one should assume that a lower value of RMSE indicates a better classification performance; while a higher value of Kappa statistic index usually denotes an improvement in terms of classification. The RMSE and Kappa statistic values should lie between 0.0 and 1.0. Table 3.2: Comparison based on accuracy, RMSE, and Kappa statistic Classifier ANN SVM DT
Accuracy (%) 97.6 87.5 86.3
RMSE 0.1254 0.2548 0.2839
Kappa statistic 0.9192 0.7812 0.7687
From Table 3.2 it is seen that, ANN classifier has an accuracy of 97.6%. SVM model has a classification accuracy of 87.5%; while DT model has an accuracy of 86.3%. Indeed, accuracy wise ANN has performed significantly better than SVM and DT. Then, the study has analyzed the performance of each classifier based on the information on RMSE and the Kappa statistic values collected from Table 3.2. This information could be better represented in the form of a 3-D column chart in Figure 3.8 below shows a performance comparison among these classifiers.
Figure 3.8: Comparison of RMSE and Kappa statistic using weather database 64
The research work has employed very commonly used evaluation measure like the RMSE value and Kappa statistic index. As it is evident from Figure 3.7, the Kappa statistic value of the ANN model is around 0.81-1.0. According to the definition of Kappa statistic, the accuracy of the ANN method is considered to be ‗almost perfect agreement‘. While the Kappa statistic value of the SVM and DT methods is around 0.61-0.80 which is considered to be ‗substantial‘. Based on the result, ANN comes out first with an RMSE value of 0.1254 and a Kappa statistic value of 0.9192; followed by SVM is having an RMSE value of 0.2548 and a Kappa statistic value of 0.7812 and DT stands last with the highest RMSE value (0.2839) and the lowest Kappa statistic value (0.7687). Therefore, with regard to the performance measures like classification accuracy, RMSE and Kappa statistic, the proposed ANN classifier has performed the best. Next, the performances of these models are compared on the basis of TP-Rate, FP-Rate, Precision, Recall, and F-Measure values derived from the confusion matrix of individuals with respect to the test dataset. For any classifier, one should expect higher values of TP-Rate (or Recall), Precision and F-Measure; and a lower value of FP-Rate. The detailed accuracy for these classifiers is shown below in Table 3.3.
Table 3.3: Comparison based on TP-Rate/Recall, FP-Rate, Precision, and F-Measure Classifier TP-Rate /Recall FP-Rate Precision F-Measure ANN 97.5 % 3.4 % 97.4 % 97.4 % SVM 87.4 % 12.8 % 87.4 % 87.4 % DT 86.3 % 13.7 % 86.2 % 86.2 %
The research work has demonstrated the performance comparison of these classification models based on the information on a weighted average of different performance measures from Table 3.3; which is then presented in the shape of a 3-D column chart as shown below in Figure 3.9. As usual, these measures are TP-Rate, FP-Rate, Precision, Recall, and F-Measure values derived from the confusion matrix of each classifier used here.
65
Figure 3.9: Comparison using TP-Rate/Recall, FP-Rate, Precision, and F-Measure
From Table 3.3, one can recognize that the values of TP-Rate/Recall, FP-Rate, Precision, and F-Measure for the proposed ANN classifier are 97.5%, 3.4%, 97.4%, and 97.4% respectively; whereas for the SVM classifier these values are 87.4%, 12.8%, 87.4%, and 87.4% respectively. The DT model has the values of TPRate/Recall, FP-Rate, Precision, and F-Measure as 86.3%, 13.7%, 86.2%, and 86.2% respectively. Certainly, the ANN model has the highest values for TP-Rate/Recall, Precision, and F-Measure and the lowest value for FP-Rate among all. Indeed, the FPRate is very high for SVM and DT classifiers compared to the proposed ANN classification model. Regarding F-Measure as the best performance measure derived from a confusion matrix, the ANN model has the highest value for the F-Measure as 97.4%, the SVM classifier is having the average value for the F-Measure as 87.4%; while for SVM model the value is just 86.2%. Figure 3.8 also supports this observation. With regard to the different performance measures used, the simulation has got better results on average for the proposed MLP and Hopfield neural network based combined approach compared to SVM and DT based classification models. When tested on real dataset of five years period, the performance of the combined neural network model is more than satisfactory as there are not substantial numbers of errors in categorizing. In fact, the neural network based model has the highest values for Accuracy, Kappa statistic index, TP-Rate/Recall, Precision, and F-Measure and
66
the lowest values for RMSE and FP-Rate on average. An algorithm having higher precision and lower error rates will be considered effective as because it has, the more powerful classification capability and predictive power in the weather prediction field. Therefore, the proposed ANN based model has outperformed SVM and DT classifiers in all respects.
3.6
CONCLUSION The present research study is the first approach of weather prediction which
combines both MLP and Hopfield network model effectively. As a conclusion, the research study has accomplished the objective of analyzing and investigating the proposed method with SVM and DT based on different performance measures like Accuracy, RMSE, Kappa statistic, TP-Rate, FP-Rate, Precision, Recall, and FMeasure. The best method based on performance evaluation using the weather database is the proposed ANN-based method. The proposed neural network based method has an accuracy of 97.6% with the dataset; which is undoubtedly better than that of SVM and DT algorithms. This classifier also has the lowest RMSE, FP-Rate values and highest F-Measure and Kappa statistic values compared to SVM and DT. The simulated result suggests that among the three classifiers studied and analyzed, the proposed ANN classifier has the potential to considerably improve the conventional classification methods for utilization in the weather forecasting field. The combined approach for weather forecasting is capable of yielding virtuous results and can be considered as an alternative to traditional meteorological approaches. This approach is also able to determine the non-linear relationship that exists between the historical data (temperature, wind speed, humidity, precipitation, air pressure etc.) supplied to the system during the training phase and on that basis, make a prediction of what the weather would be in future.
3.7 REFERENCES [1]
R. Agrawal, T. Imielinski and A. Swami, ―Database Mining: A Performance Perspective,‖ IEEE Transactions on Knowledge and Data Engineering, vol. 5, no. 6, pp. 914–925, 1993.
67
[2]
M. S. Chen, J. Han and P. S. Yu, ―Data Mining: An Overview from a Database Perspective,‖ IEEE Transactions on Knowledge and Data Engineering, vol. 8, no. 6, pp. 866-883, 1996.
[3]
U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (eds.), Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996.
[4]
J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan and Kaufmann, Second Edition, 2006.
[5]
A. K. Pujari, Data Mining Techniques, Universities Press (India) Private Limited, First Edition, 2001.
[6]
N. K. Bose and P. Liang, Neural Network Fundamentals with Graphs, Algorithms, and Applications, McGraw-Hill, pp. 119-209, 1996.
[7]
S. Haykin, Neural Networks: A Comprehensive Foundation, Prentice Hall, Second Edition, 1998.
[8]
D. E. Rumelhart, G. E. Hinton, and R. J. Williams, ―Learning representations by back-propagating errors,‖ Nature, vol. 323, no. 6088, pp. 533-536, 1986.
[9]
P. J. Werbos, The Roots of Backpropagation. From Ordered Derivatives to Neural Networks and Political Forecasting, New York, NY: John Wiley & Sons, 1994.
[10] R. Rojas, Neural Networks A Systematic Introduction, Springer-Verlag, Berlin, 1996. [11] C. Cortes and V. Vapnik, ―Support-vector networks,‖ Machine Learning, vol. 20, no.3, pp. 273-297, September 1995. [12] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery, Numerical Recipes: The Art of Scientific Computing, New York: Cambridge University Press, third Edition, 2007. [13] J. R. Quinlan, ―Simplifying decision trees,‖ International Journal of ManMachine Studies, vol. 27, no. 3, pp. 221–234, 1987. [14] L. Breiman, J. H. Freidman, R. A. Olshen, and C.J. Stone, ―Classification and Regression Trees‖, Belmont, Wadsworth, 1984. [15] S. Ghosh, A. Nag, D. Biswas, J. P. Singh, S. Biswas, D. Sarkar, P. P. Sarkar, ―Weather Data Mining Using Artificial Neural Network,‖ In Proceedings of the RAICS 2011 IEEE Conference, pp. 192 – 195, 2011. [16] Y. Radhika and M. Shashi, ―Atmospheric Temperature Prediction using Support Vector Machines,‖ International Journal of Computer Theory and Engineering, vol. 1, no. 1, pp. 55-58, 2009. 68
[17] M. Hayati and Z. Mohebi, Application of Artificial Neural Networks for Temperature Forecasting,‖ World Academy of Science, Engineering and Technology, International Journal of Electrical, Computer, Electronics and Communication Engineering, vol. 1, no. 4, pp. 654-658, 2007. [18] B. A. Smith, R. W. McClendon, and G. Hoogenboom, ―Improving Air Temperature Prediction with Artificial Neural Networks,‖ International Journal of Computational Intelligence, vol. 3, no. 3, pp. 179-186, 2007. [19] S. Chattopadhyay, ―Feed forward Artificial Neural Network model to predict the average summer-monsoon rainfall in India,‖ Acta Geophysica, vol. 55, no. 3, pp. 369-382, September 2007. [20] S. S. Baboo and I. K. Shereef, ―An Efficient Weather Forecasting System using Artificial Neural Network,‖ International Journal of Environmental Science and Development, vol. 1, no. 4, pp. 321-326, October 2010. [21] S. Kotsiantis, A. Kostoulas, S. Lykoudis, A. Argiriou, and K. Menagias, ―Using data mining techniques for estimating minimum, maximum and average daily temperature values,‖ International Journal of Mathematical, Physical and Engineering Sciences, vol. 1, no. 1, pp. 16-20, 2008. [22] Weather database for the duration of 5 years (January 2007 to December 2011) [URL-http://www.wunderground.com/history]. [23] A. Elisseeff, and H. Paugam-Moisy, ―Size of multilayer networks for exact learning: analytic approach,‖ Advances in Neural Information Processing Systems, vol. 9, USA: MIT Press, pp.162-168, 1997. [24] J. S. Armstrong, and F. Collopy, ―Error Measures For Generalizing About Forecasting Methods: Empirical Comparisons,‖ International Journal of Forecasting, vol. 8, pp. 69–80, 1992. [25] J. Carletta, ―Assessing agreement on classification tasks: The Kappa statistic,‖ Computational Linguistics, MIT Press Cambridge, MA, USA, vol. 22, no.2, pp. 249–254, 1996. [26] S. V. Stehman, ―Selecting and interpreting measures of thematic classification accuracy,‖ Remote Sensing of Environment, vol. 62, no. 1, pp.77–89, 1997.
69
CHAPTER 4 Breast Cancer Detection Using a Neurofuzzy Based Classification Method 4.1
INTRODUCTION Data mining applications are used in the field of medical science and
Bioinformatics for diagnosis of critical diseases [1] [2]. Breast cancer [3] is one of the contracting diseases that may end lives and has probably become an intensely focused subject for discovering cures aside from AIDS in the present decade. It is a kind of cancer disease arising from the breast tissue cells, and the disease is more usual in adult females than in adult males. Furthermore, this can be very harmful to women because it can lead to the loss of a breast or can even be fatal. A medical survey [4] in 2008 reveals that breast cancer occurs in 22.9% of all the cancers in women, and it also results in 13.7% of deaths from them. Medical diagnosis and survival rates for the disease mostly depend on the type and stage of cancer, method of treatment and geographical location of the patient. The diagnosis of breast cancer begins when a patient or the physician finds a mass or abnormal chemical change on a mammogram. Clinical test by screening or self-examination may also discover irregularity in the shape of a woman‘s breast. The doctors study these factors when choosing a diagnostic test— the medical condition and age, type of cancer believed likely, the severity of indications and previous clinical examination results. Physicians use diagnostic analysis for cancer and determine if it has spread to other parts of the body outside the chest area. For most cancer types, the biopsy is the only way to diagnose cancer accurately. If a biopsy is unacceptable, the clinician may suggest other medical tests that will help diagnosis. Therefore, the well-known imaging tests like diagnostic using mammography, MRI or ultrasound might be done to determine the severity of breast cancer.
70
Classification [5] as an important data mining technique is used to diagnose and analyze breast cancer at an early stage. The earlier detection of the disease followed by proper medical treatment might save the life of the patient. That is why such analysis is very much essential in the field of medical science as well as in Bioinformatics. Classification is the method of determining a classifier or model that describes and discriminates several data classes from each other. Initially, the classification procedure applies some preprocessing tasks (data cleaning, data transformation etc.) to the original data. Then, the method divides this preprocessed data set into two different sections namely the training data set and the test data set. These data sets should be independent of each other to avoid biases. Classification consists of two different steps. The first step develops a classification model indicating a well-defined set of classes. Therefore, this is the training phase, where the classification technique constructs the model by learning from a given training data set accompanied by their related class label attributes. After that, the classification model is applicable for prediction called the testing phase. This step estimates the accuracy of the derived model using the test data set independent of the training data set. Artificial Neural Network (ANN) [6] [7] [8] is a predominant data modeling tool that can perform intelligent tasks similar to the human brain. It is widely known for excellent precision and extraordinary learning capability while a very little information is available. One of the reliable and efficient methods of data classification from the ANN domain is the Multilayer Perceptron (MLP) model [9] [10]. An MLP model contains several layers of nodes arranged in a directed graph structure, with connections between the adjacent layers. MLP uses the backpropagation technique to train the network. Support Vector Machine (SVM) [11] is a very powerful supervised learning model used in the machine learning field. It constructs one, or more than one hyperplane in a high-dimensional feature space for classification, regression or other analysis tasks. SVM employs a hyperplane to differentiate between classes. When classes overlap, one should use hyperplanes to minimize the error of data points along or across the boundary line between classes. These points are the support points or support vectors.
71
Due to the presence of ambiguity in the training data set, overlying borders among classes and vagueness in describing characteristics some uncertainties can still occur at any stage of a classification system. The fuzzy set theory [12] [13] [14] is flexible enough to manage the various aspects of uncertainties about real life situations. ANN and fuzzy set theory based combined approach is the Neuro-fuzzy [15] [16] technique. A typical Neuro-fuzzy based approach exploits these two techniques in an efficient way. It combines the human alike logical reasoning of fuzzy systems with the learning and connectedness structure of ANNs to develop the Neurofuzzy System (NFS) model. The present work proposes to design a neuro-fuzzy based classification model for detection of breast cancer. This research work is basically based on the previous research work [17]. These research studies [18] [19] [20] [21] [22] [23] [24] [25] [26] also made some significant contributions in this area those are worth mentioning. The work applies the proposed model along with MLP and SVM classifiers to three standard UCI databases for performance comparison. These benchmark databases are namely Wisconsin Breast Cancer (WBC), Wisconsin Diagnostic Breast Cancer (WDBC), and Mammographic Mass (MM). The study has investigated the performances of NFS, MLP and SVM classifiers using different measures such as Accuracy, Root-mean-square error (RMSE), True Positive Rate, False Positive Rate, Kappa statistic, Precision, Recall, and F-Measure. The research study is organized as follows: Section 4.2 explains the proposed method; section 4.3 describes the methodology in detail. Section 4.4 presents the results of performance analysis, and Section 4.5 specifies the conclusion of the work.
4.2
PROPOSED NEURO-FUZZY BASED METHOD The research study intends to design a neuro-fuzzy based classification
method for detecting breast cancer disease. The objective is to exploit the feature-wise degree of belongings of the sample data patterns (i.e., data records) to all the classes that are attained using a fuzzification procedure. The study has used the sigmoidal membership function (MF) for fuzzification. The fuzzification process produces a membership matrix with an overall number of elements equal to the multiplication product of the number of data patterns and classes present in the database. The 72
proposed hybrid classification system assigns memberships for each characteristic feature of a data pattern to dissimilar classes thereby developing the membership matrix. These matrix elements are the input to an ANN model. The number of output nodes of the ANN equals the number of data classes. Defuzzification process is then performed on the ANN output. A hard classification of the input data patterns can be achieved using a maximum operation on the output of ANN as in the case of a traditional fuzzy classification based system. The proposed Neuro-fuzzy classification method takes a training data set comprising multiple data patterns, fuzzifies the data pattern values with a sigmoidal MF, and then computes the degree of membership of individual patterns to various classes. The study assumes that the database have Q input patterns and P data classes. The proposed method is divided into three different phases which are described below. A. Phase 1 (Fuzzification Phase): The primary phase named fuzzification basically constructs a membership matrix of order Q × P that consists of the degree of the memberships of Q different patterns to P number of classes. Each item in this matrix is a membership function of the form mi,j (zi), where zi is the i-th pattern value of input pattern vector z with indices i = 1, 2, …, Q and j = 1, 2, …, P. The membership function can thus be defined as mi , j ( zi ) de gree of membership of pattern i with respect to class j
(4.1)
The input pattern vector z is designated as
z z1 , z2 , ........, zQ
T
(4.2)
where „T‟ is the matrix transpose operator. For fuzzification, the study has used the well-known sigmoidal MF. It is asymmetric in nature and depends upon two different parameters a and b as specified by the equation
mi , j zi mi , j zi ; a, b
73
1 1 e
a ( zi b )
(4.3)
where the parameter a controls the slope at the crossover point zi = b. The sigmoidal MF is either open to its right or to its left depending on the sign of the parameter a; and thus can represent ‗very positive‘ or ‗very negative‘. It is also easier to model and train in an ANN model. So, a Neuro-fuzzy model with sigmoidal MF is proposed and used in this work which is easier to implement in a neural network. A sigmoidal MF is shown below in Figure 4.1.
Figure 4.1: A typical sigmoidal membership function
By modifying the values of the two parameters a and b, the study will find the desired MF which offers more flexibility for the classification task. Applying the above MF the membership matrix for a particular pattern vector z looks like this: m1,1 z1 m2,1 z 2 M z m3,1 z3 mQ ,1 zQ
m1, 2 z1
m2 , 2 z 2
m3, 2 z1
mQ , 2 zQ
m1,3 z1 m1, P z1 m2 , 3 z 2 m2 , P z 2 m3,3 z3 m3, P z3 mQ ,3 zQ mQ , P zQ
(4.4)
where mi, j (zi) is the degree of membership of i-th pattern in input vector z to the class j using i =1, 2,…, Q and j =1, 2,…, P. For example, m3,5 (z3) denotes the degree of membership of 3rd pattern to the class 5. The membership matrix is used as input to an ANN model as presented below.
74
B. Phase 2 (Building the ANN model): The second phase builds an MLP classification model or classifier. In this phase, the above membership matrix is converted into a P × Q vector by performing transpose operation on all rows and columns. The vector is now considered as an input to an MLP classifier. The classifier basically consists of a single input layer, a single output layer, and one hidden layer in between them. The number of nodes present in the input layer equals the number of elements in the above-mentioned membership matrix and the node count in the output layer equals the number of existing data classes in the database. The particular ANN model has used as learning rule the gradient descent with momentum. It employs tan sigmoid as transfer function in the hidden and output layer. The selection of the number of processing elements (PEs) present in the hidden layer is also an essential parameter. Subsequently a thorough investigation helps to select the number of PEs present in the hidden layer [27]. The number of nodes in the hidden layer, denoted by, h, is given by the following equation as
h (number of inputs number of outputs )
(4.5)
C. Phase 3 (Defuzzification Phase): The third or final phase named defuzzification is just opposite to that of the first phase. In this phase, the proposed Neuro-fuzzy classifier performs a hard classification by using a maximum operation to produce the activation output of the MLP model. An input pattern will be associated with a specific class j provided that the pattern has the highest class membership value with respect to class j compared to other classes. Therefore, an unknown pattern x is assigned to a class j based on the concept of ―highest class membership value‖ if and only if M j x M i x i (1, 2, ......, P) and i j
(4.6)
where Mi (x) is considered as the activation output of i-th neuron in output layer of the MLP model. The block diagram of the proposed Neuro-fuzzy system model is shown below in Figure 4.2.
75
Figure 4.2: The proposed Neuro-fuzzy system model
4.3
METHODOLOGY The methodology is established on the NFS, MLP and SVM classification
techniques which are applied to three benchmark breast cancer data sets, namely Wisconsin Breast Cancer (WBC), Wisconsin Diagnostic Breast Cancer (WDBC), and Mammographic Mass (MM). The broad level stages of the proposed methodology are described here in details. Stage 1: These preprocessing techniques are applied to each data set before the classification task — Stage-1a. Data cleaning: It represents the preprocessing of data for eliminating or decreasing noise and the treatment of missing values. A missing value is normally substituted by the arithmetic mean for that attribute based on statistics. Stage-1b. Data transformation: Using this way the data set is normalized as because the ANN based technique requires distance measurements in the training phase. It converts attribute values to a small-scale range like -1.0 to +1.0. Stage 2: Afterwards, every single data set is distributed into two sub-sets, namely the training set and the test set. Basically the study has employed 10-fold cross-validation technique for creating the training and test data sets separately. Stage 3: The proposed Neuro-fuzzy system (NFS) is applied to the training set 76
for building a classification model. The training set is also given to the MLP and SVM techniques individually for developing other classification models. Stage 4: The three classification models (NFS, MLP, and SVM) are then employed to the test set for estimating the performance of each classifier. Stage 5: The performance evaluations of these classifiers are then done on the basis of different performance measures like Accuracy, root-mean-square error (RMSE), Kappa statistic, Precision, Recall, F-Measure value. The broad level stages of the proposed methodology are depicted below in Figure 4.3.
Figure 4.3: Broad level stages of the proposed methodology
4.4
RESULTS AND DISCUSSION After building the models by the above-mentioned classification techniques,
these are applied to the test data set for performance evaluation. The work has estimated the performances of these classification models on the basis of different performance measures such as classification accuracy, root-mean-square error (RMSE) [28], Kappa statistic [29]; and, an assessment of True Positive Rate (TPRate), False Positive Rate (FP-Rate), Precision, Recall, and F-Measure values resulting from the confusion matrix [30] as discussed in the section 3.3 of the third chapter. The three classification techniques namely NFS, MLP and SVM are trained and tested on three benchmark breast cancer data sets using MATLAB software (version R2015a). 77
The MLP classification method used the same set of configuration parameters as have used in the MLP network structure of the proposed NFS model. Furthermore, the optimal configuration for developing an SVM classifier is described here. Several possible combinations like the number of folds used, value of random seed, and different kernel-based techniques are investigated in simulation. Finally, an SVM model with polynomial kernel is selected as the best option among all. A nonlinear version of SVM can be represented by using a kernel function K as:
K xi , x j xi . x j
(4.7)
where x is the non-linear mapping function employed to map the training tuples in the database. For SVM, a polynomial kernel of degree (exponent) d is defined as: K xi , x j xi . x j 1
d
(4.8)
The polynomial kernel based SVM model with parameters such as cache size = 250007 and exponent = 1.0 gives superior performance as compared to other possible models. The SVM model has other configuration parameters such as complexity parameter = 1.0, number of folds = -1, random seed value = 1, the value of tolerance parameter = 0.001, and epsilon value for round-off error = 1.0e-12. In the testing phase, NFS, MLP, and SVM classifiers are applied on three UCI data sets for investigation and performance analysis which are described below. The results described here are solely based on the simulation that the study has taken.
4.4.1 Wisconsin Breast Cancer Database The study has used the benchmark Wisconsin Breast Cancer (WBC) database [31] from the UCI machine learning repository for diagnosis of breast cancer. The data set has 699 numbers of tuples and contains 11 attributes. The class label attribute consists of two values, namely Benign and Malignant. The numbers of instances per class with percentage values are— Benign: 458 (65.5%) and Malignant: 241 (34.5%). There are 16 instances in this data set that contain a single missing attribute value, denoted by "?"; that are resolved during the preprocessing phase. All the attributes of the data set along with their range of values are shown in Table 4.1 below.
78
Table 4.1: The attributes of the WBC data set Sl. No. Attribute
Domain
1.
Sample code number
id number
2.
Clump Thickness
1-10
3.
Uniformity of Cell Size
1-10
4.
Uniformity of Cell Shape
1-10
5.
Marginal Adhesion
1-10
6.
Single Epithelial Cell Size
1-10
7.
Bare Nuclei
1-10
8.
Bland Chromatin
1-10
9.
Normal Nucleoli
1-10
10.
Mitoses
1-10
11.
Class
2 for benign, 4 for malignant
The meanings of each of the attributes in the table are described here. The first attribute Sample code number is not considered for classification. The remaining attributes having serial numbers starting from 2 to 10 are important in the classification procedure and are referred to as the input attributes. The Clump thickness property indicates that the cancerous cells are grouped in multiple layers, while normal (benign) cells are tending to be combined in a single layer. Uniformity of cell size/shape denotes the similarity in size and shape of normal cells while the cancer cells tend to change in size and shape. Marginal Adhesion in breast tissue cells is an important characteristic to ascertain cancer. The loss of adhesion is a symptom of malignancy as the cancer cells lose the ability to stay together. Considering the feature Single epithelial cell size, its size is actually associated with the homogeneity of cell size mentioned above. Epithelial cells that are considerably enlarged might be a malignant cell. The term Bare nuclei denote a type of nuclei that is not bounded by the cell cytoplasm. These are usually witnessed in the benign tumors. The Bland chromatin describes an identical texture of the nucleus normally seen in the benign cells. In cancer cells, the chromatin becomes harsher. The Normal nucleoli are very little structures observed in the cells and generally are very negligible in size. Usually, nucleoli in the cancerous cells are more noticeable and at times there are more than one. The Mitoses are nuclear division combined with cytokines that create two identical cells during the prophase stage. Pathologists 79
generally count the number of mitoses to determine the degree of cancer. The 11th attribute is the class attribute and it has two values, namely benign (normal) and malignant (cancerous). These two values are interpreted by the numeric values 2 and 4 respectively. Each of the three classifiers namely NFS, MLP, and SVM is applied individually for classifying the test data set. To evaluate the performance of the classifiers the simulation has used different measures like Accuracy, root-meansquare error (RMSE), and Kappa statistic as shown below in Table 4.2. Table 4.2: Comparison based on Accuracy, RMSE, and Kappa statistic using WBC Classifier Accuracy 97.8 % NFS MLP 86.3 % SVM 87.6 %
RMSE 0.1062 0.1932 0.1822
Kappa statistic 0.9784 0.8614 0.8762
From Table 4.2, it is seen that the classification accuracy of the NFS classifier is 97.8%. The MLP model has a classification accuracy of 86.3%; while the SVM model is having the classification accuracy of 87.6%. Certainly, accuracy wise NFS has performed better than MLP and SVM. Then the study investigates the performance comparison of these classifiers established on the RMSE and the Kappa statistic index. The research work has used well-known measure like the RMSE value which is likely to be very low. Kappa statistic is regarded as a very good estimator for interrater agreement between classes. As it is evident from Table 4.2, the Kappa statistic value of the selected algorithms is nearby 0.81-1.0. Following the definition of the Kappa statistic, the result is considered to be ‗almost perfect agreement‘. The result indicates that NFS gives the best performance with the RMSE value of 0.1062 and a Kappa statistic index of 0.9784; followed by SVM is having 0.1822 as RMSE value and 0.8762 as a Kappa statistic index and MLP stands last with the highest RMSE value (0.1932) and the lowest Kappa statistic value (0.8614). Considering the measures used, the SVM classifier has performed slightly better than the MLP classifier. Hence, based on the measures like accuracy, RMSE and Kappa statistic values, the NFS model gives the best performance. Figure 4.4 below shows this statistical evidence using a 3-D column diagram. 80
1 0.9 0.8 0.7 0.6
NFS
0.5
MLP
0.4
SVM
0.3 0.2 0.1 0
RMSE
Kappa statistic
Figure 4.4: Comparison of RMSE and Kappa statistic using the WBC dataset
Then the these models are compared in Table 4.3 below using TP-Rate/Recall, FP-Rate, Precision, and F-Measure values constructed from the confusion matrix. For assessing the classifier performance, the study should presume greater magnitudes for TP-Rate/Recall, Precision, F-Measure; and smaller magnitude for FP-Rate. Table 4.3: Comparison of TP-Rate/Recall, FP-Rate, Precision, and F-Measure using WBC Classifier NFS MLP SVM
TP-Rate/Recall 97.7 % 86.2 % 87.5 %
FP-Rate 3.2 % 12.8 % 11.7 %
Precision 97.6 % 86.2 % 87.4 %
F-Measure 97.6 % 86.2 % 87.4 %
From Table 4.3, it is seen that the values of TP-Rate/Recall, FP-Rate, Precision, and F-Measure for the proposed NFS classifier are 97.7%, 3.2%, 97.6%, and 97.6% respectively; whereas for the MLP classifier these are 86.2%, 12.8%, 86.2%, and 86.2% respectively. The SVM model has the values of TP-Rate (or Recall), FP-Rate, Precision, and F-Measure as 87.5%, 11.7%, 87.4%, and 87.4% respectively. Certainly, the NFS model has the best values for TP-Rate, Precision, Recall, and F-Measure and the lowest value for FP-Rate measure. Since F-Measure is the best performance metric resulting out of a confusion matrix of individual classifier, the NFS model has the best value for the F-Measure as 97.6%, the SVM classifier is having the value for the F-Measure as 87.4%; while for MLP model the value is just 86.2%. It is easier to validate the information using a 3-D column diagram as given in Figure 4.5 below. 81
100.00% 90.00% 80.00% 70.00% 60.00%
NFS
50.00%
MLP
40.00%
SVM
30.00% 20.00% 10.00% 0.00%
TP-Rate or Recall
FP-Rate
Precision F-Measure
Figure 4.5: Comparison between classifiers using confusion matrix measures for WBC dataset
Hence, based on all these performance measures, the NFS model has specified the finest performance. 4.4.2 Wisconsin Diagnostic Breast Cancer Database Later, the research work has used the Wisconsin Diagnostics Breast Cancer (WDBC) database [32] from the UCI machine learning repository for classification performance analysis. The data set has 569 numbers of tuples and consists of 32 attributes including the ID number, diagnosis (class attribute), and 30 real-valued input features. The class attribute has got only two values, namely Benign (B) and Malignant (M). The numbers of instances per class with percentage values are— Benign: 357 (62.7%) and Malignant: 212 (37.3%). The attributes from 3 to 32 of this data set are calculated from the digitized imagery of a Fine Needle Aspirate (FNA) of the breast tissue mass. They actually identify the features of the cell nuclei present in this digitized image. The attributes in the WDBC data set are described below. 1) ID number 2) Diagnosis (M = malignant, B = benign) 3-32) Ten real-valued characteristics are calculated for each of the cell nucleus as: a) the radius (the mean value of distances from the centre to points on the perimeter) b) the texture (the standard deviation of gray-scale values) 82
c) the perimeter d) the area e) the smoothness (the local variation in the radius lengths) f) the compactness ([perimeter]2 / area - 1.0) g) the concavity (the severity of concave portions of the contour) h) the concave points (the number of concave portions of the contour) i) the symmetry j) the fractal dimension (coastline approximation - 1) Essentially mean value, standard error (SE) value and largest value (mean of the three largest values) of these features are computed for each digitized image, resulting in exactly 30 features. For example, field 3 is the mean radius value, field 13 is the radius SE value, and field 23 is the largest radius value. All the feature values in the data set are coded with four significant digits. The first attribute ID number is the id number of each tuple in the database and considered to be irrelevant to the classification process; therefore the attribute is not considered for classification. The second attribute Diagnosis denotes the class attribute and takes in exactly two values, namely, malignant and benign denoted by ‗M‘ and ‗B‘ respectively. So the three classification models, namely NFS, MLP, and SVM are applied to the test data set for classification. The study has evaluated the performance of these classifiers along the base of different performance measures like classification accuracy, RMSE, and the Kappa statistic value as presented below in Table 4.4. Table 4.4: Comparison based on Accuracy, RMSE, and Kappa statistic using WDBC Classifier Accuracy RMSE NFS 95.9 % 0.1254 MLP 84.4 % 0.2265 SVM 82.7 % 0.2624
Kappa statistic 0.9504 0.8552 0.8302
From Table 4.4, it is understood that the classification accuracy of the NFS classifier is 95.9%. MLP model has classification accuracy of 84.4%; while SVM model has an accuracy of 82.7%. Surely, NFS has better accuracy than MLP and SVM. Then the research work conducts a performance comparison analysis among 83
these classifiers using evaluation metrics like RMSE and Kappa statistic. The Kappa statistic value of the selected classification algorithms is about 0.81-1.0. By the definition of the Kappa statistic, the result is ‗almost perfect agreement‘. The results show that the NFS model gives the best performance with an RMSE value of 0.1254 and a Kappa statistic index of 0.9504. Followed by MLP is having a RMSE magnitude of 0.2265, and a Kappa statistic index of 0.8552 and SVM holds the last place with the highest RMSE (0.2624) and the lowest Kappa statistic (0.8302). Figure 4.6 below shows this using a 3-D column diagram.
1 0.9 0.8 0.7 0.6
NFS
0.5
MLP
0.4
SVM
0.3 0.2 0.1 0
RMSE
Kappa statistic
Figure 4.6: Comparison of RMSE and Kappa statistic using the WDBC dataset
Next, the performances of these classification models are compared using TPRate/Recall, FP-Rate, Precision, and F-Measure values built from the confusion matrix of individual classifier. The detailed accuracy measures for these classifiers are shown in Table 4.5 below. Table 4.5: Comparison of TP-Rate/Recall, FP-Rate, Precision, and F-Measure using WDBC Classifier TP-Rate/Recall NFS 95.7 % MLP 84.2 % SVM 82.6 %
FP-Rate 4.6 % 13.2 % 14.5 %
Precision 95.8 % 84.2 % 82.5 %
F-Measure 95.7 % 84.2 % 82.5 %
From Table 4.5, it is realized that the values of TP-Rate/Recall, FP-Rate, Precision, and F-Measure for the proposed NFS classifier are 95.7%, 4.6%, 95.8%, and 95.7% respectively; whereas for the MLP classifier these values are 84.2%, 84
13.2%, 84.2%, and 84.2% respectively. The SVM model has the values of TPRate/Recall, FP-Rate, Precision, and F-Measure as 82.6%, 14.5%, 82.5%, and 82.5% respectively. Indeed, the NFS model has the highest values for TP-Rate/Recall, Precision, and F-Measure and the lowest value for FP-Rate measure. It is now easier to ascertain this statistical information using a 3-D column diagram as indicated below in Figure 4.7.
100.00% 90.00% 80.00% 70.00% 60.00%
NFS
50.00%
MLP
40.00%
SVM
30.00% 20.00% 10.00% 0.00%
TP-Rate or Recall
FP-Rate
Precision F-Measure
Figure 4.7: Comparison between classifiers using confusion matrix measures for WDBC dataset
The NFS model has the highest value for the F-Measure as 95.7%%, the MLP classifier is having the value for the F-Measure as 84.2%; while for SVM model the value is just 82.5%. Therefore, the NFS model has given the best performance based on all these evaluation criteria.
4.4.3 Mammographic Mass Database Lastly, the research work has used the Mammographic Mass (MM) [33] database of UCI for the research study. Numerous tools involving Computer Aided Diagnosis (CAD) are utilized to decrease the high number of needless breast biopsies in the last decades. These systems are a collaborative endeavour of many health groups. As usual, CAD based systems should assist physicians to decide whether to do a biopsy on any localized abnormal structural change in breast tissues using mammogram test as an alternative. In this domain, the researchers typically employ a 85
tool called Breast Imaging-Reporting and Data System (BI-RADS) for quality assurance. The tool has become popular for use with digital mammography. The data set contains 961 tuples and has 6 attributes. The attributes in this data set are the assessment of BI-RADS tool, Age of the patient; different BI-RADS features associated with a mammography mass, such as, Shape, Margin, and Density; and finally the class attribute named Severity. The class label column consists of only two values, namely Benign (normal) and Malignant (cancerous). The numbers of instances per class with percentage values are— Benign: 516 (53.7%) and Malignant: 445 (46.3%). There are some missing attribute values in this data set which are resolved during the preprocessing stage. The implications of each of the attributes in the MM data set are described below. The first attribute denotes the BI-RADS assessment that lies within 1 to 5. A higher value is likely to increase the chance of malignity. The second attribute named Age represents the patient‘s age in years. The third column named Shape designates the shape of a mammography mass and its value lies within 1 to 4. The fourth column named Margin denotes mass margin, and its value ranges from 1 to 5. The fifth column named Density identifies the density of mammography mass and its value lies within 1 to 4. The sixth attribute Severity denotes the class attribute and brings in exactly two values, namely, benign and malignant indicated by numeric values ‗0‘ and ‗1‘ respectively. All these features are associated with digital mammography assessment performed by CAD-based tools like BI-RADS. As usual the three classifiers, namely NFS, MLP, and SVM are applied to the test data set for classification. The work has measured the operation of these classifiers along the theme of different performance evaluation metrics like classification accuracy, RMSE, and the Kappa statistic index as presented below in Table 4.6. Table 4.6: Comparison based on Accuracy, RMSE, and Kappa statistic using MM Classifier Accuracy NFS 87.6 % MLP 77.9 % SVM 78.7 %
RMSE 0.2759 0.3753 0.3623
86
Kappa statistic 0.7938 0.6025 0.6354
From Table 4.6, it is seen that, the classification accuracy of the NFS classifier is 87.6%. The MLP model has the accuracy of 77.9%; while the SVM model has an accuracy of 78.7%. Evidently, NFS has better classification accuracy than MLP and SVM. Then the study inspects the performance comparison of these classifiers established on the RMSE and the Kappa statistic index as collected from Table 6. Here, the Kappa statistic index of the selected algorithms is ranging from 0.61 to 0.80. Following the definition of the Kappa statistic, this is considered to be ‗substantial‘. Based on the above result, NFS again gives the best performance with 0.2759 as the RMSE value and 0.7938 as the Kappa statistic index. Followed by SVM are having 0.3623 as its RMSE value and 0.6354 as the Kappa statistic index. MLP has held the last position with the maximum RMSE value (0.3753) and the lowermost Kappa statistic index (0.6025). The Figure 4.8 below demonstrates this information using a 3-D column diagram.
0.8 0.7 0.6 0.5
NFS
0.4
MLP
0.3
SVM
0.2 0.1 0
RMSE
Kappa statistic
Figure 4.8: Comparison of RMSE and Kappa statistic using the MM dataset
Afterward, the study has compared the performance of each classifier in Table 4.7 below based on the information about TP-Rate/Recall, FP-Rate, Precision, and FMeasure metrics created from the confusion matrix. Table 4.7: Comparison of TP-Rate/Recall, FP-Rate, Precision, and F-Measure using MM Classifier NFS MLP SVM
TP-Rate/Recall 87.5 % 77.8 % 78.6 %
FP-Rate 10.8 % 20.1 % 19.3 % 87
Precision 87.6 % 77.8 % 78.5 %
F-Measure 87.5 % 77.8 % 78.5 %
From Table 4.7, it is approved that the magnitudes of TP-Rate/Recall, FPRate, Precision, and F-Measure metrics for the NFS model are 87.5%, 10.8%, 87.6%, and 87.5% correspondingly. Whereas, the MLP classifier consists of these values as 77.8%, 20.1%, 77.8%, and 77.8% respectively. The SVM model has the values of TPRate/Recall, FP-Rate, Precision, and F-Measure as 78.6%, 19.3%, 78.5%, and 78.5% individually. Certainly, the NFS model has the highest values for TP-Rate/Recall, FPRate, Precision, and F-Measure and the lowest value for FP-Rate among all. It is now easier validate this statistical information in the arrangement of a 3-D column diagram as specified below in Figure 4.9.
90.00% 80.00% 70.00% 60.00% 50.00%
NFS
40.00%
MLP
30.00%
SVM
20.00% 10.00% 0.00%
TP-Rate or Recall
FP-Rate
Precision
F-Measure
Figure 4.9: Comparison between classifiers using confusion matrix measures for MM dataset
Thus, based on all these assessment measures, the NFS model has established the highest performance using the MM data set. As a whole, in respect of all the performance evaluation metrics used for the three UCI datasets, the study has obtained superior results for the proposed NFS classification method compared to MLP and SVM-based models. The NFS technique has the largest values for accuracy, Kappa statistic, TP-Rate/Recall, Precision, and FMeasure and the smallest values for RMSE and FP-Rate metrics during the simulation. Indeed, the NFS method showing an excellent predictive ability and reduced error rate has outperformed MLP and SVM models in all respects. Therefore,
88
the proposed Neuro-fuzzy based method has been very much efficient in predicting the presence (i.e., malignant) or absence (i.e., benign) of the breast cancer disease.
4.5
CONCLUSION The research study has proposed a Neuro-fuzzy based classification method
for breast cancer detection and established its efficiency successfully using three UCI datasets, namely, WBC, WDBC, and MM. The method utilizes and integrates the primary benefits of artificial neural networks such as immense parallelism, adaptivity, sturdiness, and optimality with the inexactness and vagueness management capability of fuzzy sets. Furthermore, the proposed classification model builds a membership matrix that offers information of the feature-wise degree of membership of a data pattern to all the classes instead of considering a specific class. The research work intends to analyse and investigate the performance of the proposed NFS classification method compared to MLP and SVM classifiers. The study has used different metrics such as accuracy, RMSE, Kappa statistic index, TPRate/Recall, FP-Rate, Precision, and F-Measure to accomplish performance evaluation. The proposed neuro-fuzzy classification method has an accuracy of 97.8% using the WBC dataset, 95.9% using the WDBC dataset, and 87.6% using the MM dataset. These results are significantly better than that of MLP and SVM algorithms (average accuracy is more than 10%). Considering the three benchmark UCI datasets, the NFS classifier also has the lowest RMSE value and the highest F-Measure and Kappa statistic values compared to MLP and SVM classifiers. Thus, it is concluded that the proposed classification method has great potential in terms of classification capability and predictive power to use in the Medical Science and the Bioinformatics research field.
4.6 REFERENCES [1]
R. Agrawal, T. Imielinski, and A. Swami, ―Database Mining: A Performance Perspective,‖ IEEE Transactions on Knowledge and Data Engineering, vol. 5, no. 6, pp. 914–925, December 1993.
[2]
M. S. Chen, J. Han, and P. S. Yu, ―Data Mining: An Overview from a Database Perspective,‖ IEEE Transactions on Knowledge and Data Engineering, vol. 8, no. 6, pp. 866-883, December 1996. 89
[3]
Coyne and S. Borbasi, ―Living the experience of breast cancer treatment: The younger women‗s perspective,‖ Australian Journal of Advanced Nursing, vol. 26, no. 4, pp. 6-13, June 2009.
[4]
P. Boyle and B. Levin, "World Cancer Report 2008," International Agency for Research on Cancer, World Health Organization, 2008.
[5]
J. Han and M. Kamber, Data Mining: Concepts and Techniques. Morgan and Kaufmann, Second Edition, pp. 285-378, 2005.
[6]
R. Rojas, ―The Backpropagation Algorithm,‖ in Neural Networks: A Systematic Introduction, Springer-Verlag, Berlin, pp.151-184, 1996.
[7]
S. Haykin, Neural Networks: A Comprehensive Foundation. Prentice Hall, Second Edition, pp. 253-277, 1998.
[8]
N. K. Bose and P. Liang, Neural network fundamentals with graphs, algorithms, and applications, McGraw-Hill, pp. 119-209, 1996.
[9]
P. J. Werbos, The Roots of Backpropagation: From Ordered Derivatives to Neural Networks and Political Forecasting, NY: John Wiley and Sons, New York, pp. 67-163, 1994.
[10] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, ―Learning representations by back-propagating errors,‖ Nature, vol. 323 no. 6088, pp. 533-536, 1986. [11] C. Cortes and V. Vapnik, ―Support-vector networks,‖ Machine Learning, vol. 20, no. 3, pp. 273-297, September 1995. [12] L. A. Zadeh, ―Fuzzy sets,‖ Information and Control, vol. 8, no. 3, pp. 338–353, 1965. [13] B. Liu, Uncertain theory: an introduction to its axiomatic foundations. Berlin: Springer-Verlag, pp. 191-346, 2004. [14] D. Dubois and H. Prade, Fuzzy Sets and Systems. Academic Press, New York, pp. 255-348, 1980. [15] J-S.R. Jang, C-T. Sun and E. Mizutani, Neuro-Fuzzy and Soft Computing: A Computational Approach to Learning and Machine Intelligence. Prentice Hall, USA, pp. 333-393, 1997. [16] C.-T. Lin and C. S. G. Lee, Neural Fuzzy Systems: A Neuro-Fuzzy Synergism to Intelligent Systems. Prentice Hall, pp. 313-459, 1996. [17] S. Ghosh, S. Biswas, D. Sarkar, P. P. Sarkar, ―A Novel Neuro-fuzzy Classification Technique for Data Mining,‖ Egyptian Informatics Journal, Elsevier, vol. 15, no. 3, pp. 129 – 147, November 2014. [18] W. J. Clancey, E. H. Shortliffe, eds., Readings in medical artificial intelligence: the first decade. Reading, Mass., Addison-Wesley, pp. 339-360, 1984. [19] O. Anunciacao, B. C. Gomes, S. Vinga, J. Gaspar, A. L. Oliveira, and J. Rueff, ―A Data Mining Approach for the detection of High-Risk Breast Cancer Groups,‖ Springer-Verlag, Berlin Heidelberg, Advances in Intelligent and Soft Computing, vol. 74, pp. 43-51, 2010.
90
[20] W.-Y. Cheng, T.-H. O. Yang, and D. Anastassiou, ―Development of a Prognostic Model for Breast Cancer Survival in an Open Challenge Environment,‖ Sci. Transl. Med, vol. 5, no. 181, pp. 1-12, April 2013. [21] R. R. Janghel, A. Shukla, R. Tiwari, and R. Kala, ―Breast cancer diagnosis using Artificial Neural Network model,‖ In Proceedings of the 3rd IEEE International Conference on Information Sciences and Interaction Sciences, Chengdu, China, pp. 89-94, 2010. [22] A. KELEŞ and A. KELEŞ, ―Extracting fuzzy rules for diagnosis of breast cancer,‖ The Turkish Journal of Electrical Engineering & Compute Science, vol. 21, no. 1, pp. 1495-1503. August 2013. [23] D. Nauck, F. Klawonn, and R. Kruse, Foundations of Neuro fuzzy Systems. Wiley, Chichester, pp. 33-171, 1997. [24] D. Venet, J. E. Dumont, V. Detours, ―Most random gene expression signatures are significantly associated with breast cancer outcome,‖ PLoS Computational Biology, vol. 7, no. 10, pp. 1-8, October 2011. [25] D. Hanahan and R. A. Weinberg, ―Hallmarks of cancer: The next generation,‖ Elsevier, vol. 144, no. 5, pp. 646–674, 2011. [26] S. Ghosh, S. Mondal, B. Ghosh, ―A Comparative Study of Breast Cancer Detection Based on SVM and MLP BPN Classifier,‖ In Proceedings of the 1st International IEEE Conference ACES, pp. 1 – 4, 2014. [27] A. Elisseeff and H. Paugam-Moisy, ―Size of multilayer networks for exact learning: analytic approach,‖ Advances in Neural Information Processing Systems, vol. 9, USA: MIT Press, pp. 162-168, December 1996. [28] J. S. Armstrong and F. Collopy, ―Error Measures for Generalizing About Forecasting Methods: Empirical Comparisons,‖ International Journal of Forecasting, vol. 8, no. 1, pp. 69–80, June 1992. [29] J. Carletta, ―Assessing agreement on classification tasks: The Kappa statistic,‖ Computational Linguistics, vol. 22, no. 2, pp. 249–254, June 1996. [30] S. V. Stehman, ―Selecting and interpreting measures of thematic classification accuracy,‖ Remote Sensing of Environment, vol. 62, no. 1, pp.77–89, October 1997. [31] Breast Cancer Wisconsin (Original) Data Set, UCI machine learning repository. [https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%29]
[32] Breast Cancer Wisconsin (Diagnostic) Data Set, UCI machine learning repository. [https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29]
[33] Mammographic Mass Data Set, UCI machine learning [https://archive.ics.uci.edu/ml/datasets/Mammographic+Mass]
91
repository.
CHAPTER 5 Soil Classification from Large Imagery Databases using a Neuro-fuzzy Classifier 5.1
INTRODUCTION Data mining [1] [2] techniques typically analyze large imagery databases and
set up useful classification and patterns to develop geographical information systems (GIS) based frameworks for industrial, scientific and commercial purposes. Research studies in the data mining field have used various techniques of data analysis, including decision trees, genetic algorithms, machine learning and other statistical analysis methods [3] [4] for building such GIS-based frameworks. Large imagery soil databases, covering a vast geographical area, provide a distinctive prospect for developing land use and land cover information describing several soil classes. Regularly updated land use and land cover information are very much crucial to many environmental and socio-economic applications based on GIS, comprising regional and urban planning, conservation and management of natural resources, etc. The research objective leads to a new area of data mining referred to as soil data mining. The present work outlines research study involving useful data mining techniques to improve the effectiveness and classification accuracy of massive imagery soil databases. General information related to land comes from soil survey analysis. It is the procedure of categorizing various soil types or using other attributes of the ground cover over a particular geographical area for soil mapping [5]. Field sampling usually constructs the major data for soil survey analysis. The goal is to perform categorization of large imageries in soil mapping. This kind of information can be very much beneficial for the researchers and knowledge workers for performing the analysis. Classification [6] as an important data mining technique is used to perform soil survey for building GIS-based frameworks. Thus, soil classification as a form of soil survey analysis can analyze large imagery databases and develop worthwhile
92
knowledge for GIS-based frameworks. That is why such analysis is essential for the research field related to soil data mining. For producing updated land use and land cover information at diverse scales, various classification techniques have been established. Artificial Neural Network (ANN) [7] [8] [9] is a prevailing modeling tool that can achieve human-like rational thinking. It is widely known for excellent accuracy and extraordinary learning capability even when negligible facts are accessible. One of the proficient methods of classification from the ANN domain is the Multilayer Perceptron (MLP) model [10] [11]. An MLP model contains several layers of nodes arranged in a directed graph structure, with connections between the adjacent layers. MLP uses the backpropagation technique to train the network. The Radial Basis Function Network (RBFN) [12] [13] is another influential model that uses radial basis function (RBF) as the activation function. The output of RBFN is the linear combination of RBFs of inputs and neuron parameters. RBFNs are typically used in system control, approximation of functions, classification and prediction of time series etc. The k-Nearest Neighbour (k-NN) [14] is an instance-based learning method for classifying objects using the rationale of nearest training examples within the search space. The procedure compares a given test pattern with training patterns that are similar to it. One should use suitable distance metric to assign a new data point to the most frequently occurring class in the neighbourhood. The distance metric used in this approach are the Euclidean distance for continuous valued variables or the Hamming distance for discrete valued variables. The method works well in classifying the imagery databases where the data reveal spatial properties. Support Vector Machine (SVM) [15] is another powerful supervised learning model used in the machine learning field. It constructs one, or more than one hyperplane in a high-dimensional feature space for regression, classification, or some other analysis tasks. SVM employs a hyperplane to differentiate between classes. When classes overlap, one should use hyperplanes to minimize the error of data points alongside the boundary line amongst classes. These points are the support points or support vectors.
93
However, no matter how good a classifier is, several uncertainties or ambiguities can still rise at any period of a classification procedure. It occurs due to the existence of vagueness in input data, intersecting boundaries between several classes, and incomprehensibility in describing features. The fuzzy set theory [16] [17] [18] is flexible enough to deal with the several aspects of indecisiveness about real life circumstances. ANN combined with the fuzzy set theory based method is the Neuro-fuzzy (NF) technique [19] [20] [21]. The hybrid approach combines the human like rational thinking of fuzzy based models with the wisdom and connection-oriented structure of ANNs by using the fuzzy sets and linguistic model-based procedures. There is one variant of neuro-fuzzy model named Adaptive Neuro Fuzzy Inference System (ANFIS) [19]. ANFIS is a powerful classification model that generates a set of interpretable IF-THEN rules. The nodes of an adaptive neural network are associated with certain parameters that will decide the final output. It normally employs a hybrid learning algorithm combining gradient descent and least square methods to adjust the neural network parameters efficiently in an adaptive network. The research study proposes a Neuro-fuzzy system (NFS) based classification technique to determine various soil classes from large imagery soil databases. Basically, it is an extension of the previous research work [22] mentioned in the chapter 4. These research studies [23] [24] [25] [26] [27] [28] [29] [30] [31] also made some noteworthy contributions in the area of soil classification. The proposed method considers the feature-wise degree of memberships of large imagery databases to existing soil classes that are performed using a fuzzification process. The process produces a membership matrix with an element count equal to the multiplication product of number of data records and classes present. These matrix elements are the input to an artificial neural network. The present work employs this technique to three UCI databases namely Statlog Landsat Satellite, Forest Covertype, and Wilt for soil classification. The study aims to find out soil classes using the proposed NFS classifier and then compare its performance with RBFN, k-NN, SVM, and ANFISbased models. Numerous measures, for example, root-mean-square error (RMSE), Kappa statistic, Accuracy, False Positive Rate, True Positive Rate, Precision, Recall, and F-Measure are used for evaluating the quantitative analysis of the results. 94
The research work is organized in the following manner: Section 5.2 gives a description of the proposed Neuro-fuzzy classifier. Section 5.3 explains the detailed procedure in terms of the proposed Neuro-fuzzy classification method, RBFN, k-NN, SVM, and ANFIS classifiers. Section 5.4 deliberates the classifier performance analysis and simulation results, and Section 5.5 provides the concluding remarks.
5.2
PROPOSED NEURO-FUZZY CLASSIFIER The present work proposes a Neuro-fuzzy classification method to identify
different soil types from large imagery databases. In the area of soil classification using imageries researchers are still in search of a procedure that can provide an enhanced performance. Essentially, the proposed neuro-fuzzy technique here is tuned to provide optimized performance for soil classification from large imagery databases. As already stated, it is an extension of the earlier research work [22] mentioned in chapter 4. The basic idea is to consider the feature-wise degree of memberships of imagery databases to the soil classes that are done using a fuzzification process. The proposed hybrid classifier uses a Gaussian membership function (MF) for fuzzification purpose. As usual, the process generates a membership matrix with a total number of elements equal to the multiplication product of the number of data records and obtainable classes. The proposed classifier gives memberships for each data pattern in the imagery database to all existing classes to form the membership matrix. Typically, these matrix elements are the input to an ANN model. The number of output neurons of the ANN equals the number of soil classes present. Defuzzification process is then applied to the ANN output. A hard classification of the input data patterns is achieved using a maximum operation on the output of ANN as in the case of a traditional fuzzy classification based system. The proposed Neuro-fuzzy method divides into three different phases described below.
A. First Phase (Fuzzification): The first phase named fuzzification takes a sample data set containing various data patterns, fuzzifies the data pattern values with a Gaussian MF, and then computes 95
the degree of membership of individual data patterns to numerous classes. In that sense, it is quite similar to the fuzzification process mentioned in chapter 4. Suppose the data set consists of B input patterns and A data classes. It is now easier to define the data set in terms of the input pattern vector x where ‗T‘ denotes the matrix transpose operation: x x1 , x 2 , ........, x B
T
(5.1)
The phase fundamentally constructs a membership matrix of order (B × A) from the input pattern vector x that consists of the degree of the memberships of B different patterns to A number of classes. Each and every element in the matrix is a membership function of the form gu,v (xu), where xu is the u-th feature value of the input pattern vector x with indices u = 1, 2, …, B and v = 1, 2, …, A. Thus, the membership function is described as follows gu ,v ( xu ) de gree of membership of pattern u with respect to class v
(5.2)
As already stated, the research work has used the well-known Gaussian MF for fuzzification. The Gaussian curve MF has a smooth curvature and non-zero at all points. It is also symmetric in nature and depends upon two different parameters σ, and c as given by the following equation g u , v xu g u , v xu ; , c e
xu c 2 2 2
(5.3)
Basically, the parameters c and σ here represent the Gaussian curve MF‘s center and width respectively. They can control the shape and curvature of the MF. If one can modify the values of these parameters then the desired MF can be achieved which offers more flexibility for classification. The MF is easier to model and train in an ANN model. Besides this, it is suitable for performing image classification using imagery databases. So, a Neuro-fuzzy model with a Gaussian MF is proposed and used in this work that is more straightforward to implement in the neural network model. The Gaussian curve MF used here is shown below in Figure 5.1.
96
Figure 5.1: The Gaussian curve membership function used in fuzzification
The Gaussian curve MF presented in Figure 5.1 has the center at point c, with c = (a + b)/2.0, where a and b are the two crossover points in the curve. Here, the curve is symmetric in nature and therefore σ = |b - c| = |c - a|. The membership value at the crossover points is 0.5 and at the center c its value is 1.0 (i.e., maximum). The selection of membership value is done in a way so that every training pattern acquires a membership value of 1.0 when it is at the center of the MF, and when it is moved away from the center its value gradually drops and reaches 0.5 at the boundary of the training data set. The center position c is evaluated as the mean (i.e., average) value of the training data set. It is selected as c = mean_val(x) (i.e., the mean value of the data set for a pattern x). The two crossover points a and b in the Gaussian MF curvature are calculated as: a = mean_val(x) – [max_val(x) − min_val(x)]/2.0, and the other value b = mean_val(x) + [max_val(x) − min_val(x)]/2.0, where min_val and max_val are the minimum and maximum value, respectively, of the data set for a specific data pattern x. Such a selection of a and b confirm that most of the training data patterns will have membership values ≥0.5; and test data patterns will have membership values in the interval [0, 1]. The study uses the above MF to the input pattern vector x for constructing the membership matrix. The resultant matrix seems like this:
97
g 1,3 x1 g 1,1 x1 g 1, 2 x1 g 2,1 x 2 g 2, 2 x 2 g 2,3 x 2 G x g 3,1 z 3 g 3, 2 x1 g 3, 3 x 3 g B ,1 x B g B , 2 x B g B ,3 x B
g 1, A x1 g 2, A x 2 g 3, A x 3 g B , A x B
(5.4)
Here gu,v (xu) is the degree of belongings of u-th data pattern in input vector x to the class v using u =1, 2,…, B and v =1, 2,…, A. For example, m2,4 (z2) denotes the degree of membership of 2nd pattern to the class 4. The membership matrix is used as input to an ANN model as presented below. B. Second Phase (Developing ANN model): The second phase constructs a MLP classifier or model. In this phase, the membership matrix mentioned above is transformed into an (A × B) vector by performing transpose operation on all tuples and attributes. The vector is the input to a MLP classifier. The classifier consists of a single input layer, a single output layer, and two hidden layers in between them. Hence, the current ANN model is structurewise different (i.e., having two hidden layers) from the other one discussed in chapter 4. Table 5.1 below provides values of several configuration parameters used in the MLP model. Table 5.1: Configuration parameters used in the MLP model Network Parameter
Value
Number of hidden layers
two
Number of neurons in input layer
element count in membership matrix
Number of neurons in output layer data classes present Learning rule
gradient descent with momentum
Transfer function used
tan-sigmoid
The selection of the number of neurons obtainable in the hidden layers is too an important constraint. Afterward, a thorough investigation helps in selecting the number of PEs present in the hidden layers [32]. The first hidden layer contains some PEs, denoted by, h1, is given by the following equation as
h1 (number of inputs number of outputs ) 98
(5.5)
The second hidden layer contains a few number of neurons, represented by, h2, as specified by the equation
h2
2 number of inputs number of outputs 3
(5.6)
C. Third Phase (Defuzzification): The final phase employs the defuzzification procedure that is theoretically just opposite to the first phase (as mentioned in chapter 4). In this phase, the proposed classification model implements a hard classification by using a maximum operation to create the activation output of the MLP model. An input pattern z is assigned to the class t based on the concept of ―maximum class membership value‖ if and only if Gt z Gw z w (1, 2, ......, A) and w t
(5.7)
where Gw (z) is the activation output of w-th node in the last layer (i.e., output layer) of the MLP model. The block diagram of the proposed neuro-fuzzy classifier is shown below in Figure 5.2.
Figure 5.2: The proposed Neuro-fuzzy classifier
99
5.3
DETAILED PROCEDURE The research study makes use of the NFS, RBFN, k-NN, SVM, and ANFIS
classification techniques using three large imagery soil data sets. These data sets are namely Statlog Landsat Satellite, Forest Covertype, and Wilt. The broad level stages of the procedure are described here in details. Stage 1: Initially, some preprocessing techniques are applied to each soil data set before the classification task — Stage-1a. Data cleaning: It represents the preprocessing of data for excluding or diminishing noise and managing the missing values. The arithmetic mean usually substitutes a missing value for that attribute based on statistics. Stage-1b. Data transformation: The method attempts to normalize the data set as because the neural network model requires measurement of distance for classification analysis. It transforms the database attribute values to a small interval like -1.0 to +1.0. Stage 2: Afterward, every single data set is distributed into two separate subsets, namely the training set and the test set. The present work employs 10-fold cross-validation technique to generate the training and test data sets so that they should be independent of each other for avoiding biases. Stage 3: The proposed NFS technique employs the training set for building a classification model. The training set is also specified to RBFN, k-NN, SVM, and ANFIS techniques independently for building other models. Stage 4: The five classification models (RBFN, k-NN, SVM, and ANFIS) later applied to the test data set for assessing the performance of each classifier. Stage 5: Certain well-known metrics such as root-mean-square error (RMSE), Kappa statistic, accuracy, False Positive Rate (FP-Rate), True Positive Rate (TPRate), Precision, Recall, and F-Measure are used to achieve quantitative analysis of the results generated by these classifiers. The broad level stages of the detailed procedure are portrayed below in Figure 5.3.
100
Figure 5.3: Broad level stages of the detailed procedure
5.4
RESULTS AND DISCUSSION The five classification techniques namely NFS, RBFN, k-NN, SVM, and
ANFIS are trained and tested on three UCI soil data sets using the MATLAB software (version R2015a). So far, the research study has discussed about the configuration of NFS model only. The present section describes the configurations of the other classification models used in the simulation.
101
SVM: The study uses an SVM model with a Gaussian radial-basis function (RBF) kernel for image classification. A nonlinear version of SVM can be represented by using a kernel function K as:
K xi . x j xi . x j
(5.8)
Here x is the non-linear mapping function employed to map the data tuples in the imagery database. An SVM model with a Gaussian RBF kernel is defined as: xi x j
K xi . x j e
2
2 2
(5.9)
k-NN: The instance-based learning model here uses the Mahalanobis distance as the distance metric to select the nearest neighbours. RBFN: The radial-basis function network (RBFN) model employs a Gaussian radial-basis function (RBF) activation function in the hidden layer unit. It uses the fuzzy C-means algorithm to find out the RBF centers. ANFIS: It is a Sugeno-type Fuzzy Inference System (FIS) model. The classifier applies a hybrid learning algorithm combining backpropagation gradient descent and least-square methods to regulate the neural network parameters. After building the models by the classification techniques mentioned above, these classifiers are employed to the test data set for performance evaluation. The study estimates the performances of these models using various evaluation metrics, such as root-mean-square error (RMSE) [33], Kappa statistic [34], and various measures resulting from the confusion matrix [35]. The confusion matrix measures are typically accuracy, False Positive Rate (FP-Rate), True Positive Rate (TP-Rate), Precision, Recall, and F-Measure values. The study applies NFS, RBFN, k-NN, SVM, and ANFIS classifiers to three UCI imagery data sets for performance analysis as described below. The results reported here are solely based on the simulation.
5.4.1
Statlog Landsat Satellite Database The research work uses the UCI Statlog Landsat Satellite [36] database of
agricultural land in Australia to classify the different soil classes constituting dissimilar soil types. The database was built considering only a small division (82 rows × 100 columns) of the original Landsat multispectral scanner (MSS) imaging 102
data set. It has four spectral bands in a single image frame. In a satellite image, 3 × 3 (=9) square neighborhood of pixels was designated, and the concerning four spectral values of pixels were computed. This multivariate data set consists of 6435 tuples and 36 (= 9 pixels in the neighborhood × 4 spectral bands) input features and one class attribute. Each of the input attributes is quantitative in nature, and the value should lie inside 0 to 255. The classification method associates with the central pixel in individual neighborhood zone. Thus, the work should consider only four input attributes as suggested by UCI. The class label attribute comprises six values indicating six categories of soil. They are namely red soil (class 1), cotton crop soil (class 2), grey soil (class 3), damp grey soil (class 4), vegetation stubble soil (class 5), and very damp grey soil (class 7). There is no example of category 6 (mixture type) in the data set. Each of the five classifiers namely NFS, RBFN, k-NN, SVM, and ANFIS are employed to the test data set for analysis. The assessments of their performances are done based on diverse measures like RMSE, classification accuracy, and Kappa statistic as given below in Table 5.2. Table 5.2: Comparison of Accuracy, RMSE, and Kappa statistic for Landsat Satellite dataset Classifier Accuracy RMSE Kappa statistic NFS 97.6% 0.1446 0.9283 RBFN 86.7% 0.2798 0.8167 k-NN 87.5% 0.2634 0.8243 SVM 85.4% 0.2917 0.8051 ANFIS 90.7% 0.2386 0.8651
From Table 5.2, it is seen that the NFS classification model has a classification accuracy of 97.6%. The accuracy values of RBFN, k-NN, SVM and ANFIS models are 86.7%, 87.5%, 85.4%, and 90.7% respectively. So, based on accuracy NFS has performed better than RBFN, k-NN, SVM and ANFIS. Then the study analyzes each classifier performance using RMSE and the Kappa statistic values. The RMSE and Kappa statistic index values of any classifier should lie between 0.0 and 1.0. A lower RMSE value indicates a better classifier performance. The Kappa statistic estimates the accuracy for distinguishing between the reliability of the classified data gathered and their validity. It is expected to have a higher Kappa value for a specific classifier.
103
It is seen that the value of Kappa statistic for the designated algorithms is around 0.81-1.0. Using the description of Kappa statistic, the performance of these classification approaches indicate ‗almost perfect agreement‘. Based on the result, NFS holds the first position with an RMSE measure of 0.1446 and a Kappa measure of 0.9283. ANFIS does legitimately better with RMSE index as 0.2386 and Kappa value as 0.8651. Followed by the k-NN model is having an RMSE measure of 0.2634 and a Kappa measure of 0.8243; and then comes RBFN with an RMSE magnitude of 0.2798 and a Kappa index of 0.8167. The SVM model is the worst performer with the highest RMSE measure (0.2917) and the lowest Kappa measure (0.8051). Therefore, using evaluation measures such as RMSE, Accuracy, and Kappa statistic index, the proposed NFS classifier has accomplished the best. The information can be signified using the shape of a 3-D column diagram in Figure 5.4 below showing a performance comparison of these classifiers.
Figure 5.4: Comparison of RMSE and Kappa statistic for Statlog Landsat Satellite data set
Then, the classification models are compared using TP-Rate/Recall, FP-Rate, Precision, and F-Measure metrics resulting from the confusion matrix of individual classifier. The detailed accuracy for these classification models is presented below using the percentage values (%) in Table 5.3. For assessing the performance of a classifier, one should assume higher magnitudes for TP-Rate/Recall, Precision, FMeasure; and a smaller magnitude for the FP-Rate.
104
Table 5.3: Comparison of TP-Rate/Recall, FP-Rate, Precision, and F-Measure for Landsat Satellite dataset Classifier TP-Rate /Recall FP-Rate Precision F-Measure NFS 97.6% 3.1% 97.4% 97.5% RBFN 86.7 % 12.4 % 86.6 % 86.6 % k-NN SVM
87.5 % 85.3 %
10.7 % 13.5 %
87.4 % 85.2 %
87.4 % 85.2 %
ANFIS
90.6%
9.4 %
90.6 %
90.6 %
The research work compares each classifier performance using the values of evaluation metrics from Table 5.3. The information is presented in the shape of a 3-D column diagram as shown below in Figure 5.5.
Figure 5.5: Comparison of TP-Rate/Recall, FP-Rate, Precision, and F-Measure for Statlog Landsat Satellite data set
From Table 5.3 and Figure 5.5, it is noticed that the numeric values of TPRate/Recall, FP-Rate, Precision, and F-Measure metrics for the NFS classifier are 97.6%, 3.1%, 97.4%, and 97.5% correspondingly. Whereas, the RBFN classifier is having these values as 86.7%, 12.4%, 86.6%, and 86.6% respectively. The k-NN model has the values as 87.5%, 10.7%, 87.4%, and 87.4% separately. The SVM model has the values of TP-Rate/Recall, FP-Rate, Precision, and F-Measure as 85.3%, 13.5%, 85.2%, and 85.2% individually. The ANFIS classifier is having these values as 90.6%, 9.4%, 90.6%, and 90.6% respectively. Assuredly, NFS has the uppermost magnitudes for TP-Rate/Recall, Precision, and F-Measure metrics and the lowermost magnitude for the FP-Rate compared to other classifiers. The researchers typically consider F-Measure as the greatest measure derived from the confusion matrix. 105
Accordingly, the NFS model has the uppermost value for the F-Measure as 97.5%; the ANFIS model is having an F-Measure value of 90.6%, and the RBFN classifier has the value for the F-Measure as 86.6%. The k-NN model has got the value of FMeasure as 87.4%. The SVM model has the value of F-Measure as 85.2%. Undeniably, NFS classifier has performed significantly better than other classifiers in all respects.
5.4.2 Covertype Database Afterward, the study uses the UCI Covertype database [37] in forecasting the soil classes constituting dissimilar forest cover types using cartographic variables. The current database was created choosing only a small segment (30 × 30-meter cell) from Region 2 of the United States Forest Service (USFS) Resource Information System (RIS) data. The independent cartographic variables were resulting from records originally obtained from the USFS and the United States Geological Survey (USGS) archives. The records were in raw form and consist of binary columns of information for independent qualitative variables representing wilderness areas and land types. This database was not developed using remotely sensed imagery data. But, it is certainly a classical example of massive imagery soil database used within the research community. The area of study comprised four wilderness areas located in the Roosevelt National Forest in north central Colorado of the United States. They were namely Rawah (area 1), Neota (area 2), Comanche Peak (area 3), and Cache la Poudre (area 4). These zones represent forests with minimal human-induced disorders so that the existing forest cover types are more a consequence of ecological processes rather than management practices followed in forests. Of them, area 2 had the highest mean elevational value; followed by area 1 and area 3; while area 4 having the lowest mean elevational value. This multivariate data set contains 5, 81,012 rows and 55 columns. Of them, 54 columns of data (10 quantitative variables, four binary wilderness areas, and 40 binary soil type variables) are the input features and the last column denotes a class having seven soil class values. These soil class values are ranging from soil type 1 to soil type 7. So, there are fifty four input attributes and one class attribute present in 106
the given database. The traditional classification techniques might consider a huge amount of computer time to build models from such gigantic imagery databases. So, it is recommended to take only a certain percentage of data from the original database for reducing the complexity of the whole procedure. Table 5.4 below presents the information related to the Covertype data set attributes. Table 5.4: Attribute Information of Forest Covertype data set Serial Number
Attribute
Data Type
1 2 3 4 5 6 7 8 9 10 11-14 15-54 55
Elevation Aspect Slope Horizontal_Distance_To_Hydrology Vertical_Distance_To_Hydrology Horizontal_Distance_To_Roadways Hillshade_9am Hillshade_noon Hillshade_3pm Horizontal_Distance_To_Fire_Points Wilderness_Area Soil_Type Cover_Type
quantitative quantitative quantitative quantitative quantitative quantitative quantitative quantitative quantitative quantitative quantitative quantitative integer (1 to 7)
The meanings of each of the attributes in the table are described here. The first attribute Elevation denotes the height using meters unit; the second attribute Aspect is the value of perspective expressed in the unit degrees azimuth while the third column Slope indicates the slope value in degrees. The fourth and fifth columns denote the horizontal and vertical distances in meters to the nearest surface water features respectively; while the sixth column signifies the distance to the nearest roadway using the same unit. The attribute numbers 7, 8, and 9 indicate the hill shade index during the summer solstice at 9 am, at 12 noon, and at 3 pm respectively. The column number 10 denotes the horizontal distance to the nearest wildfire ignition points in meters. The attribute numbers 11 to 14 represent four binary wilderness areas; while column numbers 15 to 54 designate 40 binary soil type variables. All the input attributes are quantitative, and their values should lie in the range of 0 to 255. The class attribute named Cover_Type has seven values denoted by numbers 1 to 7. So the five models, namely NFS, RBFN, k-NN, SVM, and ANFIS are employed to the test set for classification. The research work evaluates the 107
performance of these classifiers along the base of different performance measures like classification accuracy, RMSE, and Kappa statistic as presented below in Table 5.5. Table 5.5: Comparison based on Accuracy, RMSE, and Kappa statistic for Covertype data set Classifier Accuracy RMSE Kappa statistic NFS RBFN
88.4% 75.7%
0.1806 0.3192
0.7932 0.6948
k-NN SVM
76.8% 78.3%
0.2928 0.2734
0.7039 0.7231
ANFIS
80.3%
0.2541
0.7521
From Table 5.5, it is seen that NFS has a classification accuracy of 88.4%. The accuracy values of RBFN, k-NN, SVM and ANFIS models are 75.7%, 76.8%, 78.3%, and 80.3% respectively. Assuredly, accuracy wise NFS has performed much better than RBFN, k-NN, SVM and ANFIS. Then the study analyzes each classifier performance using RMSE and the Kappa statistic measures. The information can be represented using a 3-D column diagram in Figure 5.6 below showing a performance comparison of these classifiers.
Figure 5.6: Comparison of RMSE and Kappa statistic for Covertype data set
From Table 5.5 and Figure 5.6, it is found that the value of Kappa statistic for the designated algorithms is around 0.61-0.80. Using the concept of Kappa statistic, the performance of these classification procedures is ‗substantial‘. Based on the result, NFS holds the first position with an RMSE measure of 0.1806 and a Kappa statistic
108
measure of 0.7932. ANFIS holds the succeeding position with RMSE value as 0.2541 and Kappa index as 0.7521. Followed by SVM is having an RMSE measure of 0.2734 and a Kappa measure of 0.7231; and then comes k-NN model with an RMSE magnitude of 0.2928 and a Kappa index of 0.7039. RBFN stands last with the highest RMSE value being 0.3192 and the lowest Kappa statistic value being 0.6948. Therefore, regarding the measures such as RMSE, Accuracy, and Kappa statistic index, NFS classifier has accomplished the finest. Afterward, these models are compared for performance analysis using TPRate/Recall, FP-Rate, Precision, and F-Measure metrics. Table 5.6 below provides the performance analysis of these classification models. Table 5.6: Comparison of TP-Rate/Recall, FP-Rate, Precision, and F-Measure for Covertype dataset Classifier TP-Rate/Recall FP-Rate Precision F-Measure NFS 88.3% 11.7% 88.3% 88.3% RBFN 75.7 % 24.3 % 75.7 % 75.7 % k-NN SVM
76.6 % 78.3 %
23.4 % 21.7 %
76.6 % 78.3 %
76.6 % 78.3 %
ANFIS
80.2 %
19.8 %
80.2 %
80.2 %
The study compares their performances using confusion matrix measures from Table 5.6. The information is presented in the shape of a 3-D column diagram as shown below in Figure 5.7.
Figure 5.7: Comparison of TP-Rate/Recall, FP-Rate, Precision, and F-Measure for Covertype dataset 109
From Table 5.6 and Figure 5.7, it is realized that the values of TP-Rate/Recall, FP-Rate, Precision, and F-Measure metrics for NFS model are 88.3%, 11.7%, 88.3%, and 88.3% individually. The RBFN classifier is having these values as 75.7%, 24.3%, 75.7%, and 75.7% respectively. The k-NN model has the values as 76.6%, 23.4%, 76.5%, and 76.5% separately. The SVM model has the values of TP-Rate/Recall, FPRate, Precision, and F-Measure as 78.3%, 21.7%, 78.3%, and 78.3% correspondingly. The ANFIS classifier is having these values as 80.2%, 19.8%, 80.2%, and 80.2% respectively. NFS classifier has the highest values for TP-Rate/Recall, Precision, and F-Measure and the lowest value for FP-Rate among all. The NFS model has the highest magnitude for the F-Measure as 88.3%. The ANFIS model is having an FMeasure value of 80.2%; the RBFN classifier has the value for the F-Measure as 75.6%; while the k-NN model has the value as 76.5%. The SVM model has got the value of F-Measure as 78.3%. Indeed, NFS has given an improved performance even with a lesser amount of data compared to other classifiers.
5.4.3 Wilt Database Finally, the research work uses the UCI Wilt data set [38] for classifying soil type related to the diseased trees. It is a high-resolution remotely sensed data set that contains some training and testing instances from a remote sensing survey in the Quickbird imagery. The multivariate imaging database consists of 4889 tuples and has six attributes including the class attribute. The class attribute has two values namely ‗w‘ denoting diseased trees and class ‗n‘ indicating another land cover. There are 74 instances of the ‗w‘ class and 4265 for ‗n‘ class. The data set consists of image segments, constructed by applying segmentation operation on the pan-sharpened image. These image segments consist of the spectral information resulting from the multispectral image bands of the Quickbird imagery. They also contain the texture information derived from the panchromatic image band (Pan Band). Table 5.7 below presents the attribute information of Wilt data set.
110
Table 5.7: Attribute Information of Wilt data set Serial Number Attribute
Data Type
1 2
GLCM_Pan Mean_G
Numeric Numeric
3
Mean_R
Numeric
4
Mean_NIR
Numeric
5
SD_Pan
Numeric
6
Class
Categorical: ‗w‘ (diseased trees), ‗n‘ (all other land cover)
The significances of each of the attributes in the table are identified here. The first attribute GLCM_Pan denotes the Gray Level Co-Occurrence Matrix (GLCM) with respect to the Pan band. The 2nd, 3rd, and 4th columns indicate the mean of Green value, mean of Red value, and mean of Near Infrared (NIR) value respectively. The fifth attribute SD_Pan implies the Standard deviation concerning the panchromatic band. The database has a class attribute with two values, namely ‗w‘ and ‗n‘ indicating diseased trees and all other land covers respectively. As usual the five classification models, namely NFS, RBFN, k-NN, SVM, and ANFIS are employed to the test set for performance analysis. The present work measures the operation of these classifiers along the theme of different measures like RMSE, Accuracy, and the Kappa statistic measure as presented below in Table 5.8. Table 5.8: Comparison based on Accuracy, RMSE, and Kappa statistic for Wilt data set Classifier Accuracy RMSE Kappa statistic NFS 98.3% 0.1014 0.8732 RBFN 87.4% 0.2048 0.7679 k-NN SVM ANFIS
88.2% 86.5% 90.2%
0.1849 0.2247 0.1556
0.7836 0.7457 0.8274
From Table 5.8, it is seen that NFS model has the accuracy of 98.3%. The accuracy values of RBFN, k-NN, SVM and ANFIS models are 87.4%, 88.2%, 86.5%, and 90.2% respectively. Admittedly, NFS has performed better than RBFN, k-NN, SVM and ANFIS in terms of accuracy. Then the study analyzes each classifier performance using RMSE and the Kappa statistic measures.
111
Figure 5.8: Comparison using RMSE and Kappa statistic values for Wilt data set
From Table 5.8 and Figure 5.8, it is observed that the Kappa measures of NFS and ANFIS models are around 0.81-1.0. Using the definition of Kappa statistic, the performances of NFS and ANFIS denote ‗almost perfect agreement‘. While the Kappa statistic values of the RBFN, k-NN and SVM methods are around 0.61-0.80 which denotes ‗substantial‘. According to the result, NFS holds the first position with an RMSE measure of 0.1014 and a Kappa measure of 0.8732. ANFIS holds the next position with RMSE value as 0.1556 and a Kappa measure as 0.8274. Followed by kNN has an RMSE measure of 0.1849 and a Kappa measure of 0.7836; and then comes the RBFN model with an RMSE magnitude of 0.2048 and a Kappa index of 0.7679. SVM turns out to be the worst performer with the highest RMSE magnitude (0.2247) and the lowest Kappa index (0.7457). Therefore, with reference to the measures such as RMSE, accuracy, and Kappa statistic, the NFS method has performed the greatest. Then, the test compares their performances using TP-Rate/Recall, FP-Rate, Precision, and F-Measure. The information is presented below in Table 5.9. Table 5.9: Comparison based on TP-Rate/Recall, FP-Rate, Precision, and F-Measure for Wilt data set Classifier TP-Rate /Recall FP-Rate Precision F-Measure NFS 98.2 % 5.3 % 98.2 % 98.2 % RBFN 87.4 % 14.9 % 87.3 % 87.3 % k-NN 88.2 % 12.7 % 88.1 % 88.1 % SVM 86.5 % 16.8 % 86.4 % 86.4 % ANFIS
90.1 %
9.9 % 112
90.1 %
90.1 %
The study compares each classifier performance using the information of various evaluation criteria from Table 5.9. The statistical information is then presented in the shape of a 3-D column diagram as shown below in Figure 5.9.
Figure 5.9: Comparison of TP-Rate/Recall, FP-Rate, Precision, and F-Measure for Wilt data set
From Table 5.9 and Figure 5.9, it is noticed that the values of TP-Rate/Recall, FP-Rate, Precision, and F-Measure metrics for the NFS model are 98.2%, 5.3%, 98.2%, and 98.2% separately. The RBFN classifier is having these values as 87.4%, 14.9%, 87.3%, and 87.3% respectively. The k-NN model has the values as 88.2%, 12.7%, 88.1%, and 88.1% separately. The SVM model has values of TP-Rate/Recall, FP-Rate, Precision, and F-Measure as 86.5%, 16.8%, 86.4%, and 86.4% correspondingly. The ANFIS classifier is having these values as 90.1%, 9.9%, 90.1%, and 90.1% respectively. Certainly, the NFS classifier has achieved the maximum values of TP-Rate/Recall, FP-Rate, Precision, and F-Measure and the minimum values of FP-Rate. Seeing F-Measure as the supreme performance measure, NFS model has the value for the F-Measure as 98.2%. The ANFIS model has an FMeasure value of 90.1%; the k-NN classifier is having the value for the F-Measure as 88.1%; while the RBFN model has the F-Measure value as 87.3%. The SVM model has got the value of F-Measure as 86.4%. Again, NFS has achieved meaningfully superior to other classifiers in all respects.
113
Concerning the different performance measures used, the study has got excellent results for the proposed NFS classifier compared to RBFN, k-NN, SVM, and ANFIS-based models. In fact, the NFS classifier has the highest values for accuracy, Kappa statistic index, TP-Rate/Recall, Precision, and F-Measure and the lowest values for RMSE and FP-Rate. An algorithm having exceptional preciseness and reduced error rate will be considered effective as because it has, the greater classification capability and predictive power in the soil data mining field. Indeed, the proposed NFS model has outperformed RBFN, k-NN, SVM and ANFIS classifiers in terms of the evaluation measures used in simulation.
5.5
CONCLUSION The research study deals with the determination of soil classes from large
imagery databases. The present work proposes a neuro-fuzzy method for soil classification and establishes its efficiency successfully using three UCI datasets, namely, Statlog Landsat Satellite, Forest Covertype, and Wilt. The method utilizes and integrates the primary benefits of artificial neural networks such as immense parallelism, adaptivity, robustness, and optimality with the imprecision and vagueness managing capability of fuzzy sets. Furthermore, the proposed classification model builds a membership matrix that offers information of the feature-wise degree of membership of a data pattern to all the classes instead of considering a specific class. The property successively delivers improved generalization ability. As a conclusion, the research work has accomplished the objective to investigate the proposed neuro-fuzzy classifier for soil classification and then compare its performance with RBFN, k-NN, SVM, and ANFIS using different evaluation measures. These measures are RMSE, Kappa statistic, accuracy, FP-Rate, TP-Rate (or Recall), Precision, and F-Measure. The most promising technique established on performance evaluation along the three UCI data sets is the proposed NFS. It has an accuracy of 97.6% using the Statlog Landsat Satellite data set, 88.4% using the Forest Covertype data set, and 98.3% using the Wilt data set. These values are surely better than that of MLP, k-NN, SVM, and ANFIS classifiers. The NFS model also has the lowest RMSE value and the highest F-Measure and Kappa statistic values compared to the other classifiers. Indeed, it has performed significantly better 114
than the other dominant classifiers used here. Therefore, it can be concluded that the proposed NFS classifier has the potential to replace the traditional classification approaches for utilization in the applied soil data mining field. It is also observed that the proposed neuro-fuzzy classifier offers an enhanced performance even with less training data. In case of large imagery databases, the performance of the NFS classifier is meaningfully higher than the other predominant classification methods used (accuracy is more than 7% to 8%). Thus, its learning capability with a lesser percentage of training data makes it practically applicable to any large imagery soil databases with a vast number of input features and classes.
5.6 REFERENCES [1]
R. Agrawal, T. Imielinski, and A. Swami, ―Database Mining: A Performance Perspective,‖ IEEE Transactions on Knowledge and Data Engineering, vol. 5, no. 6, pp. 914–925, December 1993.
[2]
M. S. Chen, J. Han, and P. S. Yu, ―Data Mining: An Overview from a Database Perspective,‖ IEEE Transactions on Knowledge and Data Engineering, vol. 8, no. 6, pp. 866-883, December 1996.
[3]
P. Bhargavi, and S. Jyothi, ―Soil Classification Using GATree,‖ International Journal of Computer Science and Information Technology (IJCSIT) vol. 2, no.5, October 2010.
[4]
S. J. Cunningham, and G. Holmes, ―Developing innovative applications in agriculture using data mining,‖ In Proceedings of the Southeast Asia regional Computer Confederation Conference, 1999.
[5]
V. Ramesh, and K. Ramar, ―Classification of Agricultural Land Soils: A Data Mining Approach,‖ Agricultural Journal, vol. 6, no. 3, pp. 82-86, 2011.
[6]
J. Han, and M. Kamber, Data Mining: Concepts and Techniques, Morgan and Kaufmann, Second Edition, 2006.
[7]
N. K. Bose, and P. Liang, Neural network fundamentals with graphs, algorithms, and applications, McGraw-Hill, 1996.
[8]
S. Haykin, Neural Networks: A Comprehensive Foundation, Prentice Hall, Second Edition, 1998.
[9]
R. Rojas, Neural Networks A Systematic Introduction, Springer-Verlag, Berlin, 1996.
[10] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, ―Learning representations by back-propagating errors,‖ Nature, vol. 323, no. 6088, pp. 533-536, 1986. [11] P. J. Werbos, The Roots of Backpropagation. From Ordered Derivatives to Neural Networks and Political Forecasting, New York, NY: John Wiley & Sons, 1994. 115
[12] K.J. Hunt, R. Haas, and R. Murray-Smith, ―Extending the functional equivalence of radial basis function networks and fuzzy inference systems,‖ IEEE Transactions on Neural Networks, vol. 7, no. 3, pp. 776–781, 1996. [13] D. S. Broomhead and D. Lowe, Radial basis functions, multivariable functional interpolation and adaptive networks. Royal Signals and Radar Establishment. Technical report, 4148, pp. 1–34, 1988. [14] G. Shakhnarovich, T. Darrell, and P. Indyk. Nearest Neighbor Methods in Learning and Vision: Theory and Practice, MIT Press, 2005. [15] C. Cortes and V. Vapnik, Support-vector networks. Machine Learning, vol. 20, no.3, pp. 273-297, September 1995. [16] D. Dubois, and H. Prade, Fuzzy Sets and Systems, Academic Press, New York, 1988. [17] B. Liu, Uncertain theory: an introduction to its axiomatic foundations, Berlin: Springer-Verlag, 2004. [18] L. A. Zadeh, "Fuzzy sets," Information and Control, vol. 8, no. 3, pp. 338–353, 1965. [19] J.-S. R. Jang, C.-T. Sun, and E. Mizutani, Neuro-Fuzzy and Soft Computing: A Computational Approach to Learning and Machine Intelligence, Prentice Hall, USA, 1997. [20] C.-T. Lin, and C. S. G. Lee, Neural Fuzzy Systems: A Neuro-Fuzzy Synergism to Intelligent Systems, Prentice Hall, 1996. [21] D. Nauck, F. Klawonn, and R. Kruse, Foundations of Neuro fuzzy Systems, Wiley, Chichester, 1997. [22] S. Ghosh, S. Biswas, D. Sarkar, P. P. Sarkar, ―A Novel Neuro-fuzzy Classification Technique for Data Mining,‖ Egyptian Informatics Journal, Elsevier, vol. 15, no. 3, pp. 129 – 147, November 2014. [23] J. A. Shine, and D. B. Carr, ―Relationships between land cover and spatial statistical compression capabilities in high-resolution imagery‖, In Proceedings of the 34th Interface Symposium, Montreal, CA, April 2002. [24] J. A. Shine, ―Mapping and modelling 1-Meter multispectral imagery data‖, In Proceedings of the American Statistical Association, Alexandria, VA: American Statistical Association, 2000. [25] J. A. Shine, ―Compression and analysis of very large imagery data sets using spatial statistics‖, In Proceedings of the 33rd Symposium on the Interface, Costa Mesa, CA, June 2001. [26] J. A. Shine, and D. B. Carr, ―A comparison of classification methods for large imagery data sets‖, Joint Statistical Meetings 2002 Statistics in an ERA of Technological Change-Statistical computing section, New York City, pp. 32053207, August 2002. [27] O. Rozenstein, and A. Karnieli, ―A comparison of methods for land-use classification incorporating remote sensing and GIS inputs,‖ EARSeL eProceedings, vol. 10, no. 1, pp. 27-45, 2011. 116
[28] D. Lu, and Q. Weng, ―A survey of image classification methods and techniques for improving classification performance,‖ International Journal of Remote Sensing, vol. 28, no. 5, pp. 823–870, March 2007. [29] M. Li, S. Zang, B. Zhang, S. Li, and C. Wu, ―A Review of Remote Sensing Image Classification Techniques: the Role of Spatio-contextual Information,‖ vol. 47, pp. 389-411, 2014. [30] S. Ghosh, S. Biswas, D. Sarkar, P. P. Sarkar, ―A Tutorial on Different Classification Techniques for Remotely Sensed Imagery Datasets,‖ Smart Computing Review Journal, Vol. 4, No. 1, pp. 34 – 43, February 2014. [31] S. Ghosh, S. Biswas, D. Sarkar, P. P. Sarkar, ―A Novel Neuro-fuzzy Classification Technique for Data Mining,‖ Egyptian Informatics Journal, Elsevier, vol. 15 no. 3, pp. 129 – 147, November 2014. [32] A. Elisseeff, and H. Paugam-Moisy, "Size of multilayer networks for exact learning: analytic approach," Advances in Neural Information Processing Systems, vol. 9, USA: MIT Press, pp.162-168, 1997. [33] J. S. Armstrong, and F. Collopy, ―Error Measures For Generalizing About Forecasting Methods: Empirical Comparisons,‖ International Journal of Forecasting, vol. 8, pp. 69–80, 1992. [34] J. Carletta, ―Assessing agreement on classification tasks: The Kappa statistic,‖ Computational Linguistics, MIT Press Cambridge, MA, USA, vol. 22, no.2, pp. 249–254, 1996. [35] S. V. Stehman, ―Selecting and interpreting measures of thematic classification accuracy,‖ Remote Sensing of Environment, vol. 62, no. 1, pp.77–89, 1997. [36] Statlog Landsat Satellite data set (Australian agricultural land), University of California, Irvine, Machine Learning Repository, https://archive.ics.uci.edu/ml/datasets/Statlog+%28Landsat+Satellite%29/ [37] Covertype data set, University of California, Irvine Machine Learning Repository, 1998, https://archive.ics.uci.edu/ml/datasets/Covertype/ [38] Wilt data set, University of California, Irvine Machine Learning Repository, 2014, https://archive.ics.uci.edu/ml/datasets/Wilt/
117
CHAPTER 6 Tennis Match Result Prediction Using an Adaptive Neuro Fuzzy Inference System
6.1
INTRODUCTION Tennis is one of the popular games both played and watched worldwide. It is a
sport played either individually (singles) or in a team of two (doubles). There are four main grand slam tennis tournaments held every year namely Australian Open, French Open, Wimbledon and US Open. These four grand slam tournaments are the most famous tennis tournaments all over the world. Needless to say, the playing surfaces of these mega tennis events are different. Australian and US Open are to be played on hard courts, French Open is to be played on clay courts and Wimbledon is to be played on grass courts. Every court surface is having its own features and makes variance in bounce and speed of the ball. Clay court has a gentler paced ball and an equally accurate bounce with extra spin. Hard courts are having a quicker paced ball and very accurate bounce. Grass courts are having a quicker paced ball movements and added unpredictable bounce. Furthermore, the scoring systems of men‘s and women‘s matches in grand slam tournaments are also different. Typically in the men‘s singles matches, the player who wins three sets out of five sets wins the match. On the other hand, in the women‘s singles matches, the first player winning two sets out of three sets wins the match. Due to the growth of technology, predictions are widely used in tennis matches, especially by the news agencies, the spectators, and the coaching staffs. The tennis prediction model is developed to evaluate the chance of winning match that the players will face. When a game is played, the result depends on many factors including the playing environment, player‘s skill, and results of past matches. Many researchers have worked in the fields of forecasting the outcome of tennis matches using past statistical data records. But predicting the theoretical outcome of tennis 118
matches still remains a challenging task and has been a keen interest for many researchers. Indeed, enough scope is there for making significant improvement in the quality of prediction and the interpretation of results. The present research study basically aims to predict the outcome of a tennis singles match using past match records of the grand slam tournaments. Data mining [1] [2] is a computational technique used for discovering useful knowledge from large data reservoirs. It is an essential step towards the discovery of valuable knowledge. In the field of research, various techniques of data analysis, including machine learning, artificial intelligence and other statistical analysis methods [3] [4] have been used. Artificial intelligence in combination with statistical analysis gives way to machine learning algorithms. The field of machine learning usually deals with classification algorithms [5] [6] that have the ability to learn from the raw data. Initially, the researchers apply some data preprocessing techniques to the original data. After preprocessing, one should build a model that will serve the purpose of predicting the output labels (as a win or loss in this case) from the given data. The classification is a significant data mining technique used to extract useful information from a large-scale real-time database that has matched with the given pattern and then performing assignment of patterns to a collection of target classes or categories. Basically, several well-known classification techniques are used in this work to predict the tennis match result. The artificial neural network [7] [8] [9] is a computational model inspired by human central nervous systems used to estimate or approximate functions that depends on a large number of unknown input data. ANN combined with the fuzzy set theory based method is the Neuro-fuzzy model. The hybrid approach unites the human like rational thinking of fuzzy based models with the wisdom and link-oriented organization of ANNs. There is one powerful neurofuzzy model named Adaptive Neuro Fuzzy Inference System (ANFIS) [10] [11] that generates a set of interpretable IF-THEN-ELSE rules. The nodes of an adaptive neural network are associated with certain parameters that will decide the final output. It typically uses a hybrid learning algorithm combining gradient descent and least square methods to adjust the neural network parameters efficiently in an adaptive network. The Radial Basis Function Network (RBFN) [12] [13] is an influential 119
model that uses radial basis function (RBF) as the activation function. RBFNs are typically used in system control, approximation of functions, classification and prediction of time series etc. Support vector machine (SVM) [14] [15] is another wellknown classifier that can analyze data and identify patterns used for classification and regression analysis. The present study is based on this research work [16]. In this context, the contributions of these works are also worth mentioning [17] [18] [19] [20] [21] [22] [23]. The work first proposes an ANFIS-based model and then compares its performance with two powerful classifiers namely RBFN and SVM. The study employs eight benchmark UCI data sets for performance investigation. It also uses different evaluation measures such as the RMSE, Accuracy, False Positive Rate, True Positive Rate, Kappa statistic, Recall, Precision, and F-Measure for evaluating classifier performance. The research study is organized as follows: Section 6.2 gives the description of the dataset being used; section 6.3 explains the proposed method, and section 6.4 describes the detailed procedure. Section 6.5 presents the results of performance analysis, and Section 6.6 specifies the conclusion of the work.
6.2
ABOUT THE DATASET The research study uses the benchmark Tennis Match Statistics dataset [24]
for Grand Slam tournaments provided by UCI. In a year, there are four major tennis tournaments to be held and they are namely Australian Open, French Open, Wimbledon, and US Open. These databases include the past tennis match records in 2013. In total, there are eight datasets considering all men‘s and women‘s grand slam tournaments. Though, each of the datasets is having a common format; the study to consider men‘s and women‘s tournaments as separate datasets as each of them have slightly different rules for playing the match. For example, in the men‘s matches, the player who wins three sets out of five sets wins the match. However, in the women‘s matches, the first player winning two sets out of three sets wins the match. Each of the datasets consists of 42 attributes and 127 tuples. The common format for these tennis match databases is mentioned in Table 6.1 below.
120
Table 6.1: Dataset attribute list along with their descriptions Sl. No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
Attribute Name Player 1 Player 2 Result of the match (class) FSP.1 FSW.1 SSP.1 SSW.1 ACE.1 DBF.1 WNR.1 UFE.1 BPC.1 BPW.1 NPA.1 NPW.1 TPW.1 ST1.1 ST2.1 ST3.1 ST4.1 ST5.1 FNL.1 FSP.2 FSW.2 SSP.2 SSW.2 ACE.2 DBF.2 WNR.2 UFE.2 BPC.2 BPW.2 NPA.2 NPW.2 TPW.2 ST1.2 ST2.2 ST3.2 ST4.2 ST5.2 FNL.2 T_Round
Value Type String String 0/1 Real Number Real Number Real Number Real Number Integer Number Integer Number Number Number Number Number Number Number Number Integer Number Integer Number Integer Number Integer Number Integer Number Integer Number Real Number Real Number Real Number Real Number Integer Number Integer Number Number Number Number Number Number Number Number Integer Number Integer Number Integer Number Integer Number Integer Number Integer Number Integer Number
Short Description Name of Player 1 Name of Player 2 Referenced on Player 1 is Result = 1 if Player 1 wins (FNL.1>FNL.2) First Serve Percentage for Player 1 First Serve Won by Player 1 Second Serve Percentage for Player 1 Second Serve Won by Player 1 Aces Won by Player 1 Double Faults committed by Player 1 Winners Earned by Player 1 Unforced Errors committed by Player 1 Break Points created by Player 1 Break Points won by Player 1 Net Points Attempted by Player 1 Net Points Won by Player 1 Total Points Won by Player 1 Set 1 Result for Player 1 Set 2 Result for Player 1 Set 3 Result for Player 1 Set 4 Result for Player 1 Set 5 Result for Player 1 Final Number of Games Won by Player 1 First Serve Percentage for player 2 First Serve Won by Player 2 Second Serve Percentage for Player 2 Second Serve Won by Player 2 Aces won by Player 2 Double Faults committed by Player 2 Winners earned by Player 2 Unforced Errors committed by Player 2 Break Points Created by Player 2 Break Points Won by Player 2 Net Points Attempted by Player 2 Net Points Won by Player 2 Total Points Won by Player 2 Set 1 result for Player 2 Set 2 Result for Player 2 Set 3 Result for Player 2 Set 4 Result for Player 2 Set 5 Result for Player 2 Final Number of Games Won by Player 2 Round of the tournament at which the current game is played
All the attributes listed in the given database denote their usual meanings related with the game of tennis. Among the 42 attributes considered, attribute number 3 is the class attribute that indicates the result of the match. The dataset designates 121
that the result of the match as either 1 or 0 with respect to player 1. It is taken to be 1 if player 1 wins or 0 otherwise. Therefore, the study considers two classes in this dataset namely class 1 and class 0. The remaining 41 attributes are the input attributes. The first two input attributes denote the names of player 1 and player 2 respectively. The attributes with serial numbers 4 to 22 are referenced on player 1. The fourth attribute named FSP.1 indicates the first serve percentage for player 1 and the fifth attribute termed FSW.1 denotes the first serve won by player 1. The sixth attribute named SSP.1 specify the second serve percentage for player 1 and the seventh attribute termed SSW.1 means the second serve won by player 1. The eighth attribute termed ACE.1 denotes the aces won by player 1 and the ninth attribute called DBF.1 designates the double faults committed by player 1. The tenth attribute WNR.1 means winners earned by player 1 and the eleventh attribute termed UFE.1 indicates the unforced errors committed by player 1. The twelfth attribute termed BPC.1 denotes the break points created by player 1 and the thirteenth attribute called BPW.1 indicates break points won by player 1. The fourteenth attribute named NPA.1 denotes the net points attempted by player 1 and the fifteenth attribute termed NPW.1 indicates the net points won by player 1. The attribute number 16 is termed as TPW.1 and it denotes the total points won by player 1. The attributes with serial numbers 17 to 21 uses the common variable format STX.1 where X is the set number. The term STX.1 denotes the result of set X for player 1 with X = 1, 2, 3, 4, 5. The attribute number 22 is named FNL.1 and it indicates the final number of games won by player 1. The attributes with serial numbers 23 to 41 are referenced on player 2 and they denote the same sequence of properties as followed by attributes 4 to 22. The last attribute named T_Round indicates the round of the tournament at which the current game is played. It is observed that some of the input attributes have missing values. The attributes namely ST3.1, ST4.1, ST5.1, ST3.2, ST4.2 and ST5.2 may contain N/A values. In men‘s singles tournament datasets, the attributes ST4.1, ST5.1, ST4.2 and ST5.2 may assume N/A values when fourth and fifth sets are not played. But, these attributes are not considered as valid attributes in case of women‘s tournament datasets. Similarly, the women‘s singles match sometimes may ignore playing the third set; so the value of the attribute ST3.1 and ST3.2 may denote N/A values. But, 122
for the game of tennis the correctness of the dataset should be maintained even if the missing values are present. These missing values create a problem to the classification step as they are treated as non-numeric values and can alter the overall prediction result of the classifier model. So, the work should modify the dataset in such a way that the N/A values are replaced by appropriate values such that the output of the tuples is not altered and the consistency of the dataset is maintained in accordance to the rules of tennis.
6.3
PROPOSED METHOD The proposed work builds an ANFIS model [10] [11] using the training data
set. ANFIS is a kind of fuzzy inference system (also called FIS) that relies on fuzzy rules and fuzzy reasoning. Fuzzy inference method is a modern computing framework that uses the concept of fuzzy set theory. The study discusses the implementation details of ANFIS to predict the output class label for any input tuple. Building of ANFIS starts with the construction of fuzzy inference system. FIS is based upon the concepts of fuzzy set theory, fuzzy if-then rules and fuzzy reasoning. In fuzzy set theory, there are many logical levels between two extremities of binary logic. Fuzzy set theory is similar to traditional set theory in terms of the set theory operations and other definitions that are used. Fuzzy set theory based approach is used to interpret fuzzy if-then rules and fuzzy reasoning. The fuzzy if-then rules (also known as fuzzy rules) take the following form: (6.1) It is a general form of fuzzy rules where A and B are linguistic values which is defined by fuzzy sets on the universe of discourse of X and Y respectively. In this case, the statement “x is A” is called the antecedent or premise and the statement “y is B” is called the consequence or conclusion. This rule can also be abbreviated as
A B . In this context, this relation defines a correspondence between two variables x and y in which the fuzzy rule is defined as a binary relation R on the product space X×Y. Fuzzy set theory is used to describe these expressions in the product space. These fuzzy rules are first defined in fuzzy reasoning. Fuzzy reasoning (also called approximate reasoning) tends to make a conclusion from the set of fuzzy rules and known facts. 123
The Fuzzy Inference System is based upon the fuzzy set theory, rules and reasoning. It is a computational framework with a broad range of applications. The basic structure of FIS is composed of three essential components: rule base, database (or dictionary), and a reasoning mechanism. The rule base contains a selection of valid fuzzy rules. The database contains the definition of membership functions that are used in fuzzy rules. The reasoning mechanism is responsible to derive a reasonable output or conclusion based on the rules and given facts. The primary fuzzy inference system is capable of taking either fuzzy input or crisp input (viewed as fuzzy singletons), but it always produces fuzzy sets as outputs. However, in this study only crisp output is needed, and therefore defuzzification is done. In this process, the idea is to extract the crisp value that can best represent the fuzzy set. There are many FIS models available such as Mamdani Fuzzy Models, Sugeno Fuzzy Models and Tsukamoto Fuzzy Models. The present study uses Sugeno Fuzzy Model to construct ANFIS structure. The Sugeno fuzzy model (or Takagi-Sugeno fuzzy model) is proposed to develop a systematic approach to generating fuzzy rules from a dataset containing inputs and outputs. A typical Sugeno fuzzy model can assume the following form of fuzzy rules: (
)
(6.2)
Here A and B are fuzzy rules in the antecedent and z = f(x,y) is a crisp function in the conclusion and is dependent on the input variables x and y or any function that can correctly describe the output. If f(x,y) denotes a first-order polynomial, then the resultant FIS is called a first-order Sugeno fuzzy model. Similarly, if f is constant, then the FIS results in zero-order Sugeno fuzzy model. The zero-order Sugeno fuzzy model gives a smooth function of its inputs variables as the output as long as the membership functions overlaps. The fuzzy reasoning model for first-order Sugeno fuzzy model is slightly different. In this case, the study considers that each rule has a crisp output, and the output is given by the polynomial given by the output of the rule. The Sugeno fuzzy model can be considered as a simple and basic model to calculate the output for the FIS. But it can generate complex models depending on the constructed fuzzy rules. It cannot follow composite rules of inference strictly during 124
reasoning step mechanism, but this model is very popularly for its use sample-databased fuzzy modeling. Using this model, the work will try to build an adaptive network that will result in ANFIS model. The adaptive network is a network structure consisting of the number of nodes interconnected by directed links. Each node in the network represents a processing element and is interconnected by weighted links. The input-output behaviour in this network is determined by a collection of modifiable parameters. In the context of the adaptive network, the present work considers a MLP network i.e., the inputs are fed in the first layer of nodes, and the output is obtained in the last layer of the network. The research study considers a first-order Sugeno fuzzy model. For simplicity, the present work considers an FIS with two inputs x and y and one output z. For the Sugeno fuzzy model consider the following two fuzzy rules: (
)
(6.3)
(
)
(6.4)
These two rules help to build an adaptive network. The network can correctly reflect these rules when a mapping is done from input to output product space. The block diagram of the neuro-fuzzy system model is shown below in Figure 6.1.
Figure 6.1: Proposed ANFIS architecture equivalent to a fuzzy inference system
125
The training dataset applies to the ANFIS-based classification model for training. The model includes five layers; they are namely— input layer, input MF layer, rule layer, output MF layer, and output layer. The adaptive network structure is shown in Figure 6.1 where each node in a specific layer performs the similar kind of function. Each -th node in a particular layer takes an input from the previous layer and creates an output
. This mapping for each individual layer in the given ANFIS
model can be described as follows: Layer 1 (Input layer): For each -th node in layer 1 is an adaptive node having the node function as: {
Here
or
or
( ) ( )
(6.5)
is the input variable to node and each node is assigned a linguistic label to it.
At this point
( ) denotes the membership function for A and can be any
valid parameterized function that depends on a parameter set. This parameter set can also be called as premise parameters. The membership function [i.e.,
( ) in the
generalized form] here denotes the generalized bell-shaped MF which depends upon three different parameters a, b, and c as given by the equation
( x; a, b, c)
1 xc 1 a
2b
(6.6)
Layer 2 (Input MF layer): In the next layer (layer 2), the work considers a fixed node label denoted by
for which the output for the incoming signal can be defined by:
( )
( )
(6.7)
This output represents the firing strength of a rule.
Layer 3 (Rule layer): In layer 3, it is required to denote a fixed node label for every node as
and then calculate its normalized firing strength. Typically, the normalized 126
firing strengths for every -th node can be computed as the ratio between the firing strength of that layer to the sum of all firing rules in that layer as: (6.8)
∑
Layer 4 (Output MF layer): In layer 4, it is necessary to compute the output considering the parameter set embedded in the membership function. The parameters in this set are referred as consequent parameters. For every i-th node the node function can be given as:
(
)
(6.9)
Layer 5 (Output layer): In the last layer (layer 5), each of the node is assigned with a fixed node label ∑ denoting a summation operation. Each node computes the summation of all incoming signals as outputs as:
∑
(
)
∑
(
)
∑
(6.10)
Thus, the ANFIS model is constructed based on the adaptive network structure. This architecture of ANFIS is not unique as the layers can be changed. The training step in ANFIS can be done using backpropagation learning algorithm or the hybrid learning algorithm. The present work uses a hybrid learning algorithm to tune the parameters of the implemented fuzzy inference system. The hybrid learning algorithm uses a combination of the least-squares and backpropagation gradient descent methods to train the applied model from the training dataset. It also checks that the training procedure does not suffer from the overfitting of data.
6.4
DETAILED PROCEDURE Basically, the research study needs to consider four datasets for each of the
men‘s and women‘s grand slam tennis tournaments. The detailed procedure is divided into two major steps— data preprocessing followed by data classification.
127
6.4.1 Data preprocessing The following data preprocessing techniques are applied to the dataset— Data cleaning: Data cleaning is one of the most important steps to be considered while considering classification of the dataset. Data cleaning makes an attempt to fill in missing values, smoothening of the noise present in the dataset and also correcting the inconsistency present in the dataset. For this dataset, the study considers two main preprocessing filters: replacing missing values and replacing N/A elements with suitable values without violating the rules of a tennis match. In this dataset, the attributes labeled NPA.1, NPW.1, NPA.2 and NPA.2 which represents the net points attempted by player 1 and 2 respectively are the attributes with missing values. The reason for this is that the particular player has not attempted for any net point in the tennis match. A missing value is normally substituted by the arithmetic mean for that attribute based on statistics. The dataset attributes ST3.1, ST4.1, ST5.1, ST3.2, ST4.2 and ST5.2 also contain N/A values. These attributes represent the set results for each player. They can be denoted as N/A if that set has not been played by the particular set of players and the match result have been already decided. The work replaces them by appropriate values so that it does not conflict with the final result of the game being played. Data transformation: The procedure normalizes the datasets as because ANN based techniques require distance measurements in the training phase. It converts attribute values to small-scale range like -1.0 to +1.0.
6.4.2 Data classification Afterwards, the tennis match dataset is distributed into two disjoint sub-sets, namely the training set and the test set. Basically the study employs a 10-fold crossvalidation technique of data distribution to avoid biases. Therefore, the training and test data set are entirely disjoint. The simulation takes three well-known classification techniques namely ANFIS, RBFN and SVM for training and testing purposes using the benchmark eight 128
tennis match UCI databases. Finally, the work compares the simulated results generated by individual classifiers for quantitative analysis. The major steps of the detailed procedure are depicted below in Figure 6.2.
Figure 6.2: Major steps of the detailed classification procedure
129
6.5
RESULTS AND DISCUSSION As usual, the three classification techniques namely ANFIS, RBFN and SVM
are trained and tested on the eight UCI tennis match databases using the MATLAB software (version R2015a). The dataset is divided into Men‘s and Women‘s match results and are trained and tested to generate different statistics for the accuracy evaluation of each of these classifiers. The classification models are trained and tested based on their classification accuracy, root-mean-square error (RMSE) [25], Kappa statistic [26] and the confusion matrix [27]. An assessment of True Positive Rate (TPRate), False Positive Rate (FP-Rate), Precision, Recall and F-Measure values are done based on the confusion matrix generated from each of the individual classifier models. Considering the testing phase of ANFIS, RBFN and SVM classifiers, the testing dataset generated from the two datasets taken from UCI Machine Learning Repository are applied on each of the individual classifiers and the performance analysis for each of them is described below. The results are divided into two parts: Men and Women. The work considers all four tennis major tournaments and place labels such as:
6.5.1
Australian Open (in mid-January) = 1
French Open (in May/June) = 2
Wimbledon (in June/July) = 3
US Open (in August/September) = 4
Men’s Tennis Match Tournaments The dataset for men and women have the same attribute list as given in Table
6.1. Each of the three classifiers namely ANFIS, RBFN, and SVM are applied to the four test datasets for classification. The performance comparisons of these classifiers are done on the basis of different evaluation measures like Accuracy, RMSE, and the Kappa statistic as shown below in Table 6.2. The results suggest that these measures are the averages of these classifiers corresponding to one of the grand slam tennis tournaments mentioned earlier. The performance evaluation measure RMSE is intended to be set aside as low as possible. 130
Table 6.2: Comparisons of the classifiers on the test datasets of Men‘s Tennis Major Tournaments Classifier Dataset Accuracy (%) RMSE Kappa statistic 1 98.65 0.1844 0.9539 ANFIS 2 98.61 0.1842 0.9544 3 98.96 0.1906 0.9495 4 97.59 0.1825 0.9454 1 89.52 0.2619 0.8623 RBFN 2 91.47 0.2549 0.9012 3 89.79 0.2592 0.8981 4 89.25 0.2692 0.8895 1 93.12 0.2144 0.9044 SVM 2 93.11 0.2164 0.9036 3 92.12 0.2126 0.8867 4 92.56 0.2215 0.9096
After testing, it is observed that ANFIS classifier is having an average accuracy of 98.45% as referred to Table 6.2. In comparison to this, the RBFN model is having an average accuracy of 90.01%; while SVM model is having an average accuracy of 92.73%. The result shows that ANFIS has the lowest RMSE value, followed by SVM and RBFN classifiers. The Kappa statistics shows a variation between 0.8-1.0 indicating an almost perfect agreement with ANFIS is having the highest value and RBFN with the lowest value. Considering the statistical measures of the different classifiers employed to classify the datasets, ANFIS classifier provides a moderately better result in comparison to SVM and RBFN. The comparisons of RMSE and Kappa statistic measures based on the average values of these classifiers are shown below in Figure 6.3.
1.0000 0.9000 0.8000 0.7000 0.6000 0.5000 0.4000 0.3000 0.2000 0.1000 0.0000
ANFIS RBFN SVM
RMSE
Kappa statistic
Figure 6.3: Comparisons of RMSE and Kappa statistic based on averages using Men‘s datasets 131
Next, the performance evaluation is done based on metrics such as TPRate/Recall, FP-Rate, Precision, and F-Measure values generated from confusion matrix of each classifier. The result is shown in Table 6.3 below. A classifier should have higher TP-rate, precession, Recall and F-measure values while having lower FPrate value. Table 6.3: Detailed accuracy for classifiers on Men‘s Tennis Major Tournaments datasets Classifier Dataset TP-Rate/Recall FP-Rate Precision F-Measure 1 98.65% 2.38% 98.65% 98.65% ANFIS 2 98.61% 2.63% 98.61% 98.61% 3 98.96% 1.12% 98.96% 98.96% 4 97.59% 3.76% 97.59% 97.59% 1 89.54% 9.14% 89.54% 89.54% RBFN 2 91.55% 8.32% 91.55% 91.55% 3 90.37% 8.95% 90.37% 90.37% 4 88.95% 9.55% 88.95% 88.95% 1 93.12% 7.57% 93.12% 93.12% SVM 2 93.11% 5.95% 93.11% 93.11% 3 92.12% 6.87% 92.12% 92.12% 4 92.56% 6.97% 92.56% 92.56%
Table 6.3 shows that these evaluation measures use the average values. Considering any classifier, it should have higher TP-Rate, Precision and F-Measure while having lower FP-Rate. It is observed that ANFIS classifier demonstrates a higher Precision and lower error rate which is better than SVM and RBFN classifiers. The information is presented below in Figure 6.4.
100.00% 90.00% 80.00% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00%
ANFIS RBFN SVM
TP-Rate /Recall
FP-Rate
Precision F-Measure
Figure 6.4: Comparisons of confusion matrix metrics based on averages using Men‘s datasets 132
Regarding F-Measure as the best performance evaluation metric resulting out of a confusion matrix of an individual classifier, ANFIS classifier establishes an FMeasure value of 98.12% which is significantly better than the F-Measure values of the other models. Basically, the average accuracy of ANFIS classifier is more than 5% compared to RBFN and SVM classifiers. It has also got the lowest FP-Rate magnitude. The result certainly proves that ANFIS produces superior performance compared to the others. 6.5.2 Women’s Tennis Match Tournaments The women‘s dataset is similar to the men‘s dataset having the similar attributes that have been described in Table 6.1. As usual, the three classification models, namely ANFIS, RBFN, and SVM are applied to the test datasets for classification. The study evaluates the performance of these classifiers using measures like accuracy, RMSE, and the Kappa statistic value as presented below in Table 6.4. Table 6.4: Comparisons of the classifiers on the test datasets of Women‘s Tournaments Classifier Dataset Accuracy (%) RMSE Kappa statistic 1 97.87 0.1705 0.9797 ANFIS 2 97.29 0.1731 0.9687 3 98.15 0.1714 0.9773 4 97.39 0.1805 0.9636 1 91.77 0.2565 0.9014 RBFN 2 90.73 0.2631 0.8917 3 89.35 0.2706 0.8761 4 88.91 0.3043 0.8835 1 92.75 0.1905 0.9351 SVM 2 92.35 0.1815 0.9267 3 92.13 0.2014 0.9189 4 92.35 0.2205 0.9203
Table 6.4 illustrates the different primary evaluation parameters based on the classification results. It can be observed that ANFIS classifier is having an accuracy of 97.68%; while the SVM classifier is having 90.19%. This proves that ANFIS has slightly better accuracy than SVM. Also, it has been observed that RBFN classifier produces far worse results having an accuracy of 92.40%. The results obtained from each of the classifiers are evaluated for obtaining different statistics such as RMSE and Kappa statistics. The Kappa statistic measures of these classifiers vary within the range 0.8-1.0. The Kappa values within this range indicate ―almost perfect agreement‖. The ANFIS classifier gives the highest Kappa value and is moderately 133
better than SVM classifier. But in comparison to all the classifiers, RBFN exhibits relatively inferior results. According to Table 4 which shows the performance evaluation results, ANFIS again comes out first compared to RBFN and SVM classifiers. The comparisons of RMSE and Kappa statistic measures based on the average values of these classifiers are shown below in Figure 6.5.
1.0000 0.9000 0.8000 0.7000 0.6000 0.5000 0.4000 0.3000 0.2000 0.1000 0.0000
ANFIS RBFN SVM
RMSE
Kappa statistic
Figure 6.5: Comparisons of RMSE and Kappa based on averages using Women‘s datasets
Next, the performance evaluation is done based on the metrics derived from the confusion matrix of the individual model being used. The TP-Rate/Recall, FP-Rate, Precision, and F-Measure values are calculated based on the generated confusion matrix. Each of the evaluation metrics as illustrated in Table 6.5 is of immense importance that can draw out the performance evaluation parameters from the result. Table 6.5: Detailed accuracy of the classifiers on Women‘s Tennis Major Tournaments datasets Classifier Dataset TP-Rate/Recall FP-Rate Precision F-Measure 1 97.77% 2.28% 97.23% 97.77% ANFIS 2 97.24% 2.87% 97.19% 97.24% 3 98.25% 2.17% 97.95% 98.25% 4 97.29% 3.66% 97.19% 97.29% 1 91.77% 8.88% 91.77% 91.77% RBFN 2 90.73% 9.44% 90.73% 90.73% 3 89.35% 9.19% 89.35% 89.35% 4 88.91% 11.02% 88.91% 88.91% 1 92.25% 6.79% 92.25% 92.25% SVM 2 92.37% 7.34% 92.37% 92.37% 3 92.23% 7.25% 92.23% 92.23% 4 92.45% 7.65% 92.45% 92.45%
134
It is observed from Table 6.5 that ANFIS classifier establishes a higher precision and lower error rate compared to SVM and RBFN classifiers. The results also suggest that ANFIS classifier demonstrates an F-Measure value of 97.53%. These results are certainly better than the average values given by SVM and RBFN classifiers. In fact, the average accuracy of ANFIS classifier is more than 5% compared to RBFN and SVM classification models. The information is presented below in Figure 6.6.
100.00% 90.00% 80.00% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00%
ANFIS RBFN SVM
TP-Rate /Recall
FP-Rate
Precision F-Measure
Figure 6.6: Comparisons of confusion matrix metrics based on averages using Women‘s datasets
In view of all the evaluation measures used, the study has got promising results for the ANFIS model compared to RBFN and SVM. The ANFIS model has the highest values for Accuracy, Kappa statistic, TP-Rate/Recall, Precision, and F-measure and the lowest values for RMSE and TP-Rate. Indeed, ANFIS outperforms RBFN and SVM classifiers in terms of all these performance measures being used. Assuredly, the ANFIS model could predict the outcome of singles grand slam tennis matches with a higher degree of precision as the average accuracy value lies within 97.5% to 99.5%.
6.6
CONCLUSION As a conclusion, the research study has attained its goal of inspecting the
proposed ANFIS model with RBFN and SVM classification algorithms based on different performance measures such as Accuracy, RMSE, Kappa statistic, TP-Rate,
135
FP-Rate, Precision, Recall, and F-Measure. The best method based on performance evaluation along the eight UCI data sets is the ANFIS classifier. Considering the eight benchmark data sets, this classifier also has the lowest RMSE value and highest FMeasure and Kappa statistic values compared to RBFN and SVM. These results suggest that among the three classifiers studied and analyzed, the ANFIS classifier has the potential to improve significantly the conventional classification methods for use in tennis match result prediction. Indeed, the average accuracy of the proposed ANFIS model is more than 5% compared to RBFN and SVM classifiers.
6.7 REFERENCES [1]
R. Agrawal, T. Imielinski and A. Swami, ―Database Mining: A Performance Perspective,‖ IEEE Transactions on Knowledge and Data Engineering, vol. 5, no. 6, pp. 914–925, 1993.
[2]
M. S. Chen, J. Han and P. S. Yu, ―Data Mining: An Overview from a Database Perspective,‖ IEEE Transactions on Knowledge and Data Engineering, vol. 8, no. 6, pp. 866-883, 1996.
[3]
U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (eds.), Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996.
[4]
J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan and Kaufmann, Second Edition, 2006.
[5]
A. K. Pujari, Data Mining Techniques, Universities Press (India) Private Limited, First Edition, 2001.
[6]
N. K. Bose and P. Liang, Neural Network Fundamentals with Graphs, Algorithms, and Applications, McGraw-Hill, 1996.
[7]
S. Haykin, Neural Networks: A Comprehensive Foundation, Prentice Hall, Second Edition, 1998.
[8]
D. E. Rumelhart, G. E. Hinton, and R. J. Williams, ―Learning representations by back-propagating errors,‖ Nature, vol. 323, no. 6088, pp. 533-536, 1986.
[9]
P. J. Werbos, The Roots of Backpropagation. From Ordered Derivatives to Neural Networks and Political Forecasting, New York, NY: John Wiley & Sons, 1994.
[10] J-S.R. Jang, C-T. Sun, and E. Mizutani, Neuro-Fuzzy and Soft Computing: A Computational Approach to Learning and Machine Intelligence, Prentice Hall, USA, 1997. 136
[11] C.-T. Lin and C. S. G. Lee, Neural Fuzzy Systems: A Neuro-Fuzzy Synergism to Intelligent Systems, Prentice Hall, 1996. [12] K.J. Hunt, R. Haas, and R. Murray-Smith. ―Extending the functional equivalence of radial basis function networks and fuzzy inference systems,‖ IEEE Transactions on Neural Networks, vol. 7, no. 3, pp. 776–781, 1996. [13] D. S. Broomhead and D. Lowe, ―Radial basis functions, multivariable functional interpolation and adaptive networks,‖ Royal Signals and Radar Establishment, Technical Report, vol. 4148, pp. 1–34, 1988. [14] C. Cortes and V. Vapnik, ―Support-vector networks,‖ Machine Learning, vol. 20, no.3, 1995. [15] A. Ben-Hur, D. Horn, H. Siegelmann, and V. Vapnik, ―Support vector clustering,‖ Journal of Machine Learning Research, vol. 2, pp. 125-137, 2001. [16] S. Ghosh, S. Sadhu, S. Biswas, D. Sarkar, and P. P. Sarkar, ―A Comparison of Different Classifiers for Tennis Match Result Prediction,‖ Smart Computing Review Journal, vol. 6, no. 1, February 2016. [17] T. Barnett, A. Brown, and S. R, Clarke, ―Developing a tennis model that reflects outcomes of tennis matches‖, In Proceedings of the 8th Australasian Conference on Mathematics and Computers in Sport, Coolangatta, Queensland, pp. 178188, 2006. [18] A. Somboonphokkaphan, S. Phimoltares, and C. Lursinsap, ―Tennis Winner Prediction based on Time-Series History with Neural Modeling,‖ In Proceedings of the International MultiConference of Engineers and Computer Scientists IMECS 2009, Hong Kong, March 2009. [19] J. D. Corral, and J. Prieto-Rodr´ıguez, ―Are differences in ranks good predictors for Grand Slam tennis matches?,‖ International Journal of Forecasting, vol. 26, no. 1, pp. 551–563, 2010. [20] A. Panjan, N. Šarabon, and A. Filipčič, ―Prediction of the Successfulness of Tennis Players with Machine Learning Methods,‖ Kinesiology, vol. 42, no. 1, pp. 98-106, 2010. [21] X. Wei, P. Lucey, S. Morgan, and S. Sridharan, ―Sweet-Spot: Using Spatiotemporal Data to Discover and Predict Shots in Tennis,‖ In Proceedings of the 7th Annual MIT Sloan Sports Analytics Conference 2013, Boston Convention and Research Center, March 2013. [22] A. S. Timmaraju, A. Palnitkar, and V. Khanna, Game ON! Predicting English Premier League Match Outcomes, CS 229 Machine Learning Final Projects, Stanford University, Autumn 2013. 137
[23] D. Buursma, ―Predicting sports events from past results Towards effective betting on football matches,‖ In Proceedings of the 14th Twente Student Conference on IT, Enschede, University of Twente, The Netherlands, January 21, 2011. [24] Tennis Major Tournament Match Statistics Data Set, UCI Machine Learning Repository, University of California, Irvine, Machine Learning Repository, 2014. [25] J. S. Armstrong, and F. Collopy, ―Error Measures For Generalizing About Forecasting Methods: Empirical Comparisons,‖ International Journal of Forecasting, vol. 8, pp. 69–80, 1992. [26] J. Carletta, ―Assessing agreement on classification tasks: The Kappa statistic,‖ Computational Linguistics, MIT Press Cambridge, MA, USA, vol. 22, no.2, pp. 249–254, 1996. [27] S. V. Stehman, ―Selecting and interpreting measures of thematic classification accuracy,‖ Remote Sensing of Environment, vol. 62, no. 1, pp.77–89, 1997.
138
CHAPTER-7 Conclusion and Future Scope of Work
7.1
Conclusion The research study addresses various categories of data mining approaches
especially association rule mining and classification based techniques. The research work has not only made an effort to enhance several modern-day data mining techniques but has also presented some new approaches in this domain. The proposed genetic algorithm based approach attempts to reduce the time complexity in finding the frequent itemsets compared to Apriori algorithm. The introduction of custommade neuro-fuzzy based classification approaches and its usage in order to improve predictive power in terms of classification is the most significant achievement of this research work. Due to the growth of technology, predictions are widely used in solving many real world problems for example, weather forecasting, cancer detection, soil classification, tennis match predictions etc. Several powerful prediction models are developed to provide solutions for such real-world problems in an efficient way. These models are compared with the traditional machine learning techniques for performing quantitative analysis of their classification performances. But for each domain of research under study, the proposed model is found to be superior compared to the other conventional methods. In this way, the limitations and drawbacks of some existing data mining techniques are brought out. Finally in this research work, the artificial neural network and fuzzy logic techniques have been combined to present a neuro-fuzzy based system for building a robust prediction model.
139
7.2
Future Scope of Work The future scope of this research work may be extended in two directions.
Firstly, the research study should attempt to design a neuro fuzzy rule-based framework for network anomaly detection problems and secondly, to explore how to integrate cloud computing with data mining in such a way so that the information retrieval will become efficient using cloud based infrastructures. Besides this, all the new techniques those have been presented in this thesis have the potential to accomplish a lot of research activities in future. The future work intends to provide an adaptive framework to construct a neuro fuzzy rule-based classification system for network anomaly detection problems. The classification of network anomaly is an effervescent research area. The proposed method will consist of the error correction-based learning procedure. The error correction-based learning procedure will also regulate the rank of confidence of each fuzzy rule by its classification performance, i.e., if a pattern is misclassified by a particular fuzzy rule, then a grade of certainty of that rule is reduced. The goal is to classify the network by two classes namely ‗anomaly‘ and ‗normal‘. This can be done by using a neuro-fuzzy rule based classification technique. The neuro-fuzzy method will employ this error correction-based learning procedure to select the significant rules by pruning the unnecessary rules. The method will have the potential to detect anomalies and thereby improving the reliability of TCP/IP networks. The future research work also aims to integrate data mining techniques with the cloud based framework. Data mining is typically used to extract potentially useful information from raw data. Cloud computing can provide a robust, scalable and dynamic infrastructure into which one can integrate, formerly known, techniques and methods of data mining. Therefore, data mining methods and applications are significantly required in the cloud computing domain. The implementation of data mining techniques using cloud computing paradigm will let the users to discover significant knowledge from data warehouses that will reduce the expenses of storage and infrastructure.
140