Noisy, untrusting, heterogeneous data cover and hide the golden nuggets of the patterns and the predictive abilities every user shall be looking to. This paper ...
A Study on Applications of Machine Learning Techniques in Data Mining Areej Shhab, Gongde Guo and Daniel Neagu Department of Computing, University of Bradford, UK, BD7 1DP Email :{A.Shhab, G.Guo, D.Neagu}@bradford.ac.uk
Abstract: Data Mining (DM) is considered as an interdisciplinary discipline: people try to solve large scale DM problems using Artificial Intelligence (AI) techniques for real-world applications such as industrial, social, healthcare, predictive toxicology, and bio, chemo-informatics. This paper intends to address some confusion issues related to overlapping notions in Data Mining and AI techniques and proposes a study on AI techniques traditionally used to solve specific Data Mining problems, as highlighted by scientific papers published in the past 10 years. The experimental results of four different AI techniques: Instance-based Learning algorithms, Decision Trees, Rule Induction and Artificial Neural Networks, tested on some public datasets and a toxicity dataset, Phenols, indicate that Artificial Neural Networks technique obtains the best average classification accuracy. Specifically, Artificial Neural Networks such as Multi-Layer Perceptron (MLP) algorithm are competitive methods for predictive toxicology data mining.
1
Introduction
Artificial Intelligence techniques show a new and up-rising development under the new evolution of storage and processing performances of computing machines. Moreover, the data stored starts to signal a new crisis: not only the availability, but also quality, sources, and mainly the meaning are new issues the users are interested in. Noisy, untrusting, heterogeneous data cover and hide the golden nuggets of the patterns and the predictive abilities every user shall be looking to. This paper gently introduces some of the main terms of Machine Learning and Data Mining in order to identify the overlapping areas, applications and possible confusions generated by the various points of view. A case study focused on dynamics of main techniques and applications, as shown by the scientific papers published in the last 10 years reveals some of the trends in these research areas.
2
General Concepts
We introduce in this section some of the most used terms in Artificial Intelligence, Knowledge Discovery and Data Mining, as defined in the majority literature sources.
2
Areej Shhab, Gongde Guo and Daniel Neagu
Information is generally defined [1, 2] as Data that have been processed and presented in a form suitable for human interpretation with the purpose of revealing meanings, such as patterns or rules. Data is defined [1] as the Facts regarding things such as people, objects, events, which can be described as the other face of the information in its numerical form, which can be digitally transmitted or processed. Models are defined [3] as creating a representation of patterns worthy of being standard ones. Knowledge is the theoretical and practical comprehension [2] of a certain domain that supports making decisions. Intelligence is the capability of learning, understanding and finding solutions for problems in a specific domain. Artificial Intelligence is a branch of Computer Science dealing with helping computing machines to provide solutions for dilemmas facing scientists and building intelligent systems in a way that human acts; knowledge and expertise form the "knowledge base" of the intelligent systems. 2.1 Machine Learning Machine Learning (ML) Techniques define the ability of a computing machine to improve its performance based on previous results [4], as arising from the current trends in research publications. Currently its applications to real life problems are extensively developed. Figure 1 presents a list of Machine Learning techniques [4] currently used for data mining problems.
Nearest Neighbour Instance-based Nearest Neighbour Learning Algorithms Method Method Decision Decision Trees Trees
Artificial Artificial Neural NeuralNetworks Networks Genetic GeneticAlgorithms Algorithms Machine MachineLearning Learning Techniques Techniques
Rule Rule Induction Induction
Inductive InductiveLogic Logic programming programming
Fig. 1. Machine learning techniques applied to data mining cases 2.2 Why Data Mining? The amount of the data seems to increase rapidly every single day for the majority domains related to information processing, and the need to find a way to mine and get knowledge from databases is still crucial.
A Study on Applications of Machine Learning Techniques in Data Mining
3
Data Mining (DM) defines the automated extraction procedures of hidden predictive information from databases [5]. DM problems addressed by intelligent systems are: 1. Diagnosis: Defining malfunctions based on an object behaviour then recommending solutions. 2. Pattern Recognition: Identification of objects and images by their shapes, forms, outlines, colour, surface, texture, temperature, or other attribute, usually by automatic means. 3. Prediction: Predicting the behaviour of an object in the future pertaining to its past one. 4. Classification: Assigning an object to a pre-defined class. 5. Clustering: Dividing a heterogeneous group of objects into homogeneous subgroups. 6. Optimisation: Improving the quality of solutions until an optimal one is found. 7. Control: Governing the behaviour of an object to meet specified requirements in the real-time [5].
3
Applications of ML Techniques to DM Tasks
From the point of view of applications to DM tasks, ML techniques can be categorized by their usability, performance and impact: 1. Artificial Neural Networks for DM: ANN is widely used for DM tasks in many disciplines such as pathology, biology, statistics, image processing, pattern recognition, optimising of numerical analysis as well as controlling systems. 2. Genetic Algorithms for DM: Genetic Algorithms, generally evolutionary computation techniques are well-known solving approaches for DM problems in chemistry, biotechnology, movement prediction, bio informatics and adaptive control for working systems. 3. Inductive Logic Programming for DM: ILP is shows a restricted area of applications, comparing with other machine learning techniques; however, it has been applied to diagnosis (diseases diagnostic), classification and clustering problems, controlling robotics systems etc. 4. Rule Induction for DM: symbolic rule induction shows some applicability for optimisation (a good example would be Semantic query optimisation). 5. Decision Trees for DM: DT is a powerful DM tool to solve problems in most of real world cases (prediction, classification etc.). Moreover, by using decision tree induction process, control rules can be derived. 6. Instance-based Learning Algorithms for DM: Instance-Based Learning (IBL) is defined as the generalizing of a new instance to be classified from the stored training examples, which is widely used for classification tasks.
4
Areej Shhab, Gongde Guo and Daniel Neagu
3.1 Scientific Literature Data Mining In order to identify the applicability of ML techniques reported as used by DM scientific literature, our empirical study is based on papers and related reports available via Web of Knowledge [6] and shows a broad use of different Machine Learning algorithms applied for DM problems. We searched the abstracts and titles of research papers showing both, a ML technique and a DM application, as listed above. The table below gives a picture of contributions of machine learning techniques to data mining problems, based on their applicability as reported in this large set of research literature. The symbol (√) refers to the frequency of using each machine learning technique for each data mining problem. Therefore, one (√) means seldom used (very low using), (√√√√√) means it is widely used, (√√) means low using, (√√√) means medium using and (√√√√) means good using. Every degree can be classified on three roughly different levels by adding: + (refers to the best contribution in this level) and – (refers to the worst contribution). Table 1. The impact of ML techniques used for DM tasks in our case study DM/ML Diagnosis Prediction Classification Clustering Pattern Recognition Optimisation Control
ANNs √√√√ √√√√√ √√√√√ √√√√√√√√ √√√√ √√√√√
Gas √√√√√√+ √√√√+ √√√ √√√ √√√√√+ √√√√√+
ILP √√+ √+ √√√√-
RI √ √+ √ √ √√√
DTs √ √+ √√√√√√√+ √√ √√ √√ √√+
IBk √+ √√√√√√√√√√ √√-
Some quantitative results on ML occurrence in the scientific literature in the period of time 1995-2004 are revealed in Table 2: to compare practically different Machine Learning approaches to Data Mining tasks. This study emphasizes the high usability of ANNs and GAs as well as DTs, compared with the rest of the methods.
Table2. A quantitative comparison of different ML Techniques applied to DM cases ML ANNs GAs ILP RI DTs IBk
95 48 59 3 13 88 38
96 261 159 8 4 129 52
97 501 426 16 10 174 66
98 791 544 5 12 226 70
99 684 517 12 11 221 59
00 725 847 11 14 258 71
01 1055 929 14 24 283 85
02 1154 1022 7 19 315 82
03 1915 1691 22 15 452 108
04 2624 2252 20 21 437 110
A Study on Applications of Machine Learning Techniques in Data Mining
5
3000
ANNs
2500
GAs
2000
ILP
1500
RI 1000 ANNs GAs ILP RI DTs
500
0 2004
2003
2002
2001
2000
DTs
IBk
IBk 1999
1998
1997
1996
1995
Fig. 2. Comparative results on ML techniques applied to DM tasks over the period 1995-2004, as reported in scientific literature. 3.2 A Classification Tasks Case Study Some tests using Weka [7] to apply some ML algorithms for DM problems for 10 datasets chosen from UCI ML Repository are presented. The objective is to correlate the statistical results described in the Section 3.1 with practical results of applying ML techniques to these datasets. Some information about these datasets is presented in Table 3. Table 3. Some information about the datasets Dataset Diabetes Iris Labor Heart Soybean Contact-Lenses Weather Glass Breast-Cancer Credit-a
NF 8 4 16 13 35 4 4 9 9 15
NN 0 0 5 3 35 4 1 0 7 6
NO 8 4 8 7 0 0 2 9 0 6
NB 0 0 3 3 0 0 1 0 2 3
NI 768 150 57 270 683 24 14 214 286 690
CD 268:500 50:50:50 37:20 120:150 19 classes 4:5:15 5:9 70:17:76:0:13:9:29 201:85 307:383
In Table 3, the meaning of the title in each column is as follows: NF-Number of Features, NN-Number of Nominal features, NO-Number of Ordinal features, NBNumber of Binary features, NI-Number of Instances, CD-Class Distribution. Four well-known algorithms from the Weka package have been applied (with implicit parameter values) for classification tasks: •
J4.8 is a Weka implementation of a decision tree learner.
•
JRip is a classifier applies rule induction techniques and implement a prepositional rule learner.
6
Areej Shhab, Gongde Guo and Daniel Neagu
•
IBK is an instance-based learning algorithm.
•
MLP (Multi-Layer Perceptron) is a supervised trained artificial neural network which uses back-propagation to classify instances.
The 10-fold cross validation method is used in the experiment to evaluate the different algorithms on 10 datasets. The detailed experimental results are presented in Table 4. Table 4. Classification results of four classifiers evaluated on ten public datasets Datasets/Algorithms Diabetes Iris Labor Heart statlog Soybean Contact-Lenses Weather Glass Breast-Cancer Credit-a Average
IBK 71.09 96.00 91.22 81.48 87.70 58.30 64.28 66.35 73.07 85.94 77.54
J4.8 73.82 96.00 73.68 76.66 91.50 83.30 64.28 66.82 75.52 86.09 78.76
JRip 76.04 94.60 77.19 78.88 92.50 75.00 57.14 68.69 70.97 85.80 77.68
MLP 75.39 97.30 85.96 78.14 93.41 70.80 78.57 67.76 64.68 84.20 79.62
From the experimental results, it is difficult to draw a conclusion that which DM technique is better than the others. Different DM techniques obtain better performances on some datasets but worse on other datasets. There is no significant difference in performance between these four methods. Obviously, hybrid intelligent systems and multiple classifier systems are worthy of investigating in order to obtain better performance for a specific application.
4
Predictive Toxicology Data Mining
The above described four different methods are further applied to a real-world application for toxicity prediction of the environmental effects of chemicals. We performed some experiments on a dataset of toxicity values for approximately 250 chemicals, all which contained a similar chemical feature, namely a phenolic group. The toxicity of the phenols was assessed in the ciliated protozoan Tetrahymena pyriformis, according to [8]. The full toxicity dataset was reported originally by Cronin et al [9] following experimental measurement of the effects by Schultz and coworkers (College of Veterinary Medicine, University of Tennessee, Knoxville TN, USA). As the original Phenols dataset contains 173 features, some of them contain irrelevant or redundant information, or contain noisy and unreliable data. Before conducting classification, we used different feature selection methods to extract eight subsets from the original dataset Phenols with as less irrelevant or redundant information as possible. These eight feature selection methods used in our
A Study on Applications of Machine Learning Techniques in Data Mining
7
experiments are: CFS – Correlation-based Feature Selection; Chi –Chi-squared ranking filter; CS – Consistency Subset evaluator; GR – Gain Ratio feature evaluator; IG - Information Gain ranking filter; kNNMFS – kNN Model-based Feature Selection [10,11]; ReliefF – ReliefF ranking filter; SVM – SVM feature evaluator. Table 5. Classification results of four classifiers evaluated on Phenols dataset and its eight subsets Datasets CFS Chi CS GR IG KNNMFS ReliefF SVM Phenols Average
IBK 73.2 71.6 74.0 72.4 74.0 72.8 69.2 78.0 72.4 73.1
J4.8 73.2 71.6 72.8 72.0 68.8 76.8 78.4 79.2 72.8 74.0
JRip 76.4 76.0 70.8 71.6 73.6 76.0 72.0 77.2 75.2 74.3
MLP 74.4 76.0 78.4 74.0 74.0 79.2 72.8 82.0 78.0 76.5
The experimental results are presented in Table 5. It is clear that MLP obtains the highest average classification accuracy among 4 different algorithms on Phenols dataset and its eight subsets. It shows that Artificial Neural Networks for predictive toxicology data mining is worthy of further research in order to obtain better performance.
5
Conclusions
This study discusses the usability and performance of ML techniques for DM problems. The outcomes of our statistical study and experiments proved that single classifier can not be discriminative well enough on all datasets considered. No matter these techniques are largely used or not in the scientific literature. Experimental results of four AI techniques tested on some public datasets and a real-world application of toxicity prediction further indicate that hybrid intelligent system and multiple classifier system methods are worthy of further research in order to obtain better performance for a specific application such as predictive toxicology data mining. Specifically referring to the results of Artificial Neural Networks technique, it presents the high ability to solve DM tasks likely better than others. The outputs of the public datasets came with undirected results in terms of ANN performance on randomly chosen datasets while for the predictive toxicology application dataset – with using feature selection method to raise the classification accuracy - came with the best results. Further work will be underpinned on our previous ones to justify the use of specific ML techniques for real-world applications such as predictive toxicology data mining.
8
Areej Shhab, Gongde Guo and Daniel Neagu
Acknowledgement This work was partly supported by the EPSRC project – Predictive Toxicology Knowledge Representation and Processing Tool based on a Hybrid Intelligent Systems Approach, Grant reference: GR/T02508/01.
References 1.
J. Han, M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2001. 2. D.Howe. The Free On-line Dictionary of Computing. (1993-2004). Supported by Imperial College Department of Computing, Copyright @1993 by Denis Howe. 3. T.Mitchell. Machine Learning. MIT Press and McGraw-Hill, 1997. 4. M. A. Bramer. Knowledge Discovery and Data Mining. London: The Institution of Electrical Engineers, 1999. 5. M. Negnevitsky. Artificial Intelligence. Essex: Pearson Educated Limited, 2002. 6. ISI Web of Knowledge, home page: http://wok.mimas.ac.uk/, Copyright © 2002 Thomson ISI. 7. L.H. Witten and E.Frank. Data Mining: Practical Machine Learning Tools with Java Implementations, Morgan Kaufmann, San Francisco, 2000. 8. T.W.Schultz, G.D.Sinks, M.T.D.Cronin. Identification of Mechanisms of Toxic Action of Phenols to Tetrahymena Pyriformis from Molecular Descriptor In: Chen, F., Schuurmann, G. (Eds.), Quantitative Structure-Activity Relationships in Environmental Sciences-VII. SETAC Press, Presacola, FL, USA, pp.329-342, 1997. 9. M.T.D.Cronin, A.O.Aptula, J.C.Duffy et al. Comparative Assessment of Methods to Develop QSARs for the Prediction of the Toxicity of Phenols to Tetrahymena Pyriformis, Chemosphere 49(2002), pp. 1201-1221, 2002. 10. G.Guo, D.Neagu and M.T.D.Cronin. A Study on Feature Selection for Toxicity Prediction. Proceedings of FSKD’05 (in press), 27-29 August, Changsha, China, 2005. 11. G.Guo, H.Wang, D.Bell, Y.Bi and K.Greer. KNN Model–Based Approach in Classification. CoopIS/DOA/ ODBASE 2003:986-996, 2003.