ILA-2: An Inductive Learning Algorithm for Knowledge Discovery
Mehmet R. Tolun (*)
Hayri Sever
Department of Computer Engineering
Department of Computer Science
Eastern Mediterranean University
Hacettepe University
Gazimagusa, T.R.N.C.
Beytepe – Ankara
Turkey
Turkey 06530
{
[email protected]}
{
[email protected]}
Mahmut 8OXGD÷
Saleh M. Abu-Soud
Information Security Research Institute
Department of Computer Science
7h%ø7$.8(.$(
Princess Sumaya University College for Technology
PK 21 Gebze Kocaeli
P.O. Box 925819
Turkey 41470
Amman - Jordan 11110
{
[email protected]}
{
[email protected]}
(*) Corresponding author
Running Title: ILA-2: An Inductive Learning Algorithm
This research is partly supported by the State Planning Organization of the Turkish Republic under the research grant 97-K-12330. All of the data sets were obtained from the University of California-Irvine’s repository of machine learning databases and domain theories, managed by Patrick M. Murphy. We acknowledge Ross Quinlan and Peter Clark for the implementations of C4.5 and CN2 and also Ron Kohavi, for we have used MLC++ library to execute OC1, CN2, and ID3 algorithms.
ILA-2: An Inductive Learning Algorithm
2
Abstract In this paper we describe the ILA-2 rule induction algorithm which is the improved version of a novel inductive learning algorithm, ILA. We first outline the basic algorithm ILA, and then present how the algorithm is improved using a new evaluation metric that handles uncertainty in the data. By using a new soft computing metric, users can reflect their preferences through a penalty factor to control the performance of the algorithm. ILA has also a faster pass criteria feature which reduces the processing time without sacrificing much from the accuracy that is not available in basic ILA. We experimentally show that the performance of ILA-2 is comparable to that of well-known inductive learning algorithms, namely CN2, OC1, ID3 and C4.5. Keywords: Data Mining, Knowledge Discovery, Machine Learning, Inductive Learning, Rule Induction.
1. Introduction A knowledge discovery process involves extracting valid, previously unknown, potentially useful, and comprehensible patterns from large databases. As described in (Fayyad, 1996; Simoudis, 1996), this process is typically made up of selection and sampling, preprocessing and cleaning, transformation and reduction, data mining, and evaluation steps. The first step in data-mining process is to select a target data set from a database and to possibly sample the target data. The preprocessing and data cleaning step handles noise and unknown values as well as accounting for missing data fields, time sequence information, and so forth. The data reduction and transformation involves finding relevant features depending on the goal of the task and certain transformations on the data such as converting one type of data to another (e.g., changing nominal values into numeric ones, discretizing continuous values), and/or defining new attributes. In the mining step, the user may apply one or more knowledge discovery techniques on the transformed data to extract valuable patterns. Finally, the
ILA-2: An Inductive Learning Algorithm
3
evaluation step involves interpreting the result (or discovered pattern) with respect to the goal/task at hand. Note that the data mining process is not linear and involves a variety of feedback loops, as any one step can result in changes in preceding or succeeding steps. Furthermore, the nature of a real-world data set, which may contain noisy, incomplete, dynamic, redundant, continuos, and missing values, certainly makes all steps critical on the path going from data to knowledge (Deogun et al., 1997; Matheus, Chan, and PiatetskyShapiro, 1993). One of the methods in data mining step is inductive learning, which is mostly concerned with finding general descriptions of a concept from a set of training examples. Practical data mining tools generally employ a number of inductive learning algorithms. For example, Silicon Graphics’ data mining and visualization product MineSet uses MLC++ as a base for the induction and classification algorithms (Kohavi, Sommerfield, and Dougherty, 1996). This work focuses on establishing causal relationships to class labels from values of attributes via a heuristic search that starts with values of individual attributes and continues to consider the double, triple, or further combinations of attribute values in sequence until the example set is covered. Once a new learning algorithm has been introduced, it is not unusual to see follow-up contributions appear that extend the algorithm in order to improve it in various ways. These improvements are important in establishing a method and clarifying when it is and is not useful. In this paper, we propose extensions to improve upon such a new algorithm, namely the Inductive Learning Algorithm (ILA) introduced by Tolun and Abu-Soud (1998). ILA is a consistent inductive algorithm that operates on non-conflicting example sets. It extracts a rule set covering all instances in a given example set. A condition (also known as a description) is defined as a pair of attribute and its value. The condition plays a building block in constructing the antecedent of a rule in which there may be one or more conditions conjuncted
ILA-2: An Inductive Learning Algorithm
4
with one to another; the consequent of that rule is associated with a particular class. The representation of rules in ILA is suitable for data exploration such that a description set in its simplest form is generated to distinguish a class from the other ones. The induction bias used in ILA selects a rule for a class from a set of promising rules if and only if the coverage proportion of the rule 1 is maximum. To implement this bias, ILA runs in a stepwise forward iteration that cycles as many times as the number of attributes until all positive examples of a single class is covered. Each iteration searches for a description (or a combination of descriptions) that covers a relatively larger number of training examples of a single class than the other candidates do. Having found such a description (or combination), ILA generates a rule with the antecedent part consisting of that description. It then marks the examples covered by the rule just generated so that they are not considered in further iteration steps. The first modification we propose is the ability to deal with uncertain data. In general, two different sources of uncertainty can be distinguished. One of these is noise that is defined as non-systematic errors in gathering or entering data. This may happen because of incorrect recording or transcription of data, or because of incorrect measurement or perception at an earlier stage. The second situation occurs when descriptions of examples become insufficient to induce certain rules. In this paper, a data set is called inconsistent if descriptions of examples are not sufficient to induce rules. This case is also known as incomplete data to point out the fact that some relevant features are missing to extract non-conflicting class descriptions. In real-world problems this often constitutes the greatest source of error, because data is usually organized and collected around the needs of organizational activities that causes incomplete data from the knowledge discovery task point of view. Under such
ILA-2: An Inductive Learning Algorithm
5
circumstances, the knowledge discovery model should have the capability of providing approximate decisions with some confidence level. The second modification is a greedy rule generation bias that reduces learning time at the cost of increased number of generated rules. This feature is discussed in section 3. ILA-2 is an extension of ILA with respect to the modifications stated above. We have empirically compared ILA-2 with ILA using real-world data sets. The results show that ILA-2 is better than ILA in terms of accuracy in classifying unseen instances, size of the classifiers and learning time. Some well-known inductive learning algorithms are compared to our own algorithm. These algorithms are ID3 (Quinlan, 1983), C4.5, C4.5rules (Quinlan, 1993), OC1 (Murthy, Kasif, and Salzberg, 1994) and CN2 (Clark and Niblett, 1989). respectively. Test results with unseen examples also show that ILA-2 is comparable to both CN2 and C4.5 algorithms. The organization of this paper is as follows: In the following section, we briefly introduce the ILA algorithm, the execution of the algorithm for an example task is also presented. In section 3, modifications to ILA algorithm are described. In section 4, the time complexity analysis of ILA-2 is presented. Finally, ILA-2 is empirically compared with five well-known induction algorithms over nineteen different domains.
2. The ILA Inductive Learning Algorithm ILA works in an iterative fashion. In each iteration the algorithm strives for searching a rule that covers a large number of training examples of a single class. Having found a rule, ILA first removes those examples from further consideration by marking them, and then appends the rule at the end of its rule set. In other words, the algorithm works on a rules-per-class
1
the coverage proportion of a rule is computed as the proportion of the number of positive (and none of negative) examples covered over the size of description (i.e., the number of conjunctives) is maximum amongst the others.
ILA-2: An Inductive Learning Algorithm
6
basis. For each class, rules are induced to separate examples in that class from the examples in the other classes. This produces an ordered list of rules rather than a decision tree. The details of ILA algorithm are given in Figure 1. A good description is a conjuncted pair of attributes and their values such that it covers some positive examples and none of negative examples for a given class. The goodness measure assesses the extent of goodness by returning the good description with maximum occurrences of positive examples. ILA constructs production rules in a general-to-specific way, i.e., starting off with the most general rule possible and producing specific rules whenever it is deemed necessary. The advantages of ILA can be stated as follows: Figure 1 • The rules are in a suitable form for data exploration; namely a description of each class in the simplest way that enables it to be distinguished from the other classes. • The rule set is ordered in a more modular fashion, which enables to focus on a single rule at a time. Direct rule extraction is preferred over decision trees, as the latter are hard to interpret particularly when there is a large number of nodes.
2.1 Description of the ILA Algorithm with a Running Example
In describing ILA we shall make use of a simple training set. Consider the training set for object classification given in Table 1, consisting of seven examples with three attributes and the class attribute with two possible values. Table 1 Let us trace the execution of the ILA algorithm for this training set. After reading the object data, algorithm starts by the first class (yes), and generates hypothesis in the form of
ILA-2: An Inductive Learning Algorithm
7
descriptions as shown in Table 2. A description is a conjunction of attribute-value pairs, they are used to form left-hand side of rules in the rule generation step. Table 2
For each description the number of positive and negative instances are found. The descriptions 6 and 8 are the ones with the most positives and no negatives. Since description 6 is the first, it is selected by default. Hence the following rule is generated:
Rule 1: IF shape = sphere -> class is yes
Upon generation of Rule 1, instances 2 and 6 covered by that rule are marked as classified. These instances are no longer taken into account in hypothesis generation. Next time, the algorithm generates descriptions shown in Table 3.
Table 3
In the second hypothesis space, there are four equivalent quality descriptions; 1, 2, 3 and 5. The system selects the first description for a new rule generation.
Rule 2: IF size = medium -> class is yes
The item 1 covered by this rule is marked as classified, and next descriptions are generated (Table 4). Table 4.
This time only the second description satisfies the ILA quality criterion, and is used for the generation of the following rule:
ILA-2: An Inductive Learning Algorithm
8
Rule 3: IF color = green -> class yes
Since example 4 is covered by this rule, it is marked as classified. All examples of the current class (yes) are now marked as classified. The algorithm continues with the next class (no) generating the following rules:
Rule 4: IF shape = wedge -> class is no
Rule 5: IF color = red and size = large -> class is no
Algorithm stops when all of the examples in the training set are marked as classified, i.e. all the examples are covered by the current rule-set.
3. Extensions to ILA Two main problems of ILA are over-fitting and long learning time. Over-fitting problem is due to the bias that ILA tries to generate a consistent classifier on training data. However training data, most of the time includes noisy examples causing over-fitting in the generated classifiers. We have developed a novel heuristic function that prevents this bias in the case of noisy examples. We also proposed another modification to make ILA faster, which considers the possibility of generating more than one rule after the iteration steps. The ILA algorithm and its extensions have been implemented by using the source code of C4.5 programs. Therefore, in addition to extensions stated above, the algorithm has also been enhanced by some features of C4.5, such as rule sifting and default class selection (Quinlan, 1993). During the classification process, the ILA system first extracts a set of rules using the ILA algorithm. In the next step, the set of extracted rules for the classes are ordered to minimize false positive errors, and then a default class is chosen. The default class is found as the one with most instances not covered by any rule. Ties are resolved in favor of more
ILA-2: An Inductive Learning Algorithm
9
frequent classes. Once the class order and the default class have been established, the rule set is subject to a post-pruning process. If there are one or more rules whose omission would actually reduce the number of classification errors in training cases, the first such rule is discarded and the set is checked again. This last step allows a final global scrutiny of the rule set as a whole for the context it will be used. ILA-2 algorithm can also handle continuous feature discretization through using the entropybased algorithm of Fayyad and Irani (1993). This algorithm uses a recursive entropy minimization heuristic for discretization and couples this with a Minimum Description Length criterion (Rissanen, 1986) to control the number of intervals produced over the continuous space. In the original paper by Fayyad and Irani, this method was applied locally at each node during tree generation. The method was found to be quite promising as a global discretization method (Ting, 1994). We have used the implementation of Fayyad and Irani’s discretization algorithm as provided within the MLC++ library (Kohavi, Sommerfeld, and Dougherty, 1996).
3.1 The Novel Evaluation Function
In general, an evaluation function’s score for a description should increase in proportion to both the number of positive instances covered, denoted by TP, and the number of negative instances not covered, denoted by TN. In order to normalize the score, a simple metric takes into account the total number of positive instances, P, and negative instances, N, which is given in (Langley, 1996) as (TP + TN ) / (P + N),
ILA-2: An Inductive Learning Algorithm
10
where the resulting value ranges between 0 (when no positives and all negatives are covered) and 1 (when all positives and no negatives covered). This ratio may be used to measure the overall classification accuracy of a description on the training data. Now, we may turn to description evaluation of the metric used in ILA, which can be expressed as follows. The description should not occur in any of the negative examples of the current class AND must be the one with maximum occurrences in positive examples of the current class. Since this metric assumes, however, no uncertainty to be present in the training data, it searches a given description space to extract a rule set that classifies training data perfectly. It is, however, well-known fact that an application targeting real-world domains should address how to handle uncertainty. Generally, uncertain tolerant classification requires relaxing the constraint that the induced descriptions must classify the consistent part of training data (Clark and Niblett, 1989), which is equivalent to say that classification methods should generate almost true rules (Matheus, Chan, and Piatetsky-Shapiro, 1993). Considering this point, a noise tolerant version of the ILA metric has been developed. The above idea is also supported by one of the guiding principles of soft-computing: "Exploit the tolerance for imprecision, uncertainty, and particular truth to achieve tractability, robustness, and low solution cost" (Zadeh, 1994). Using these ideas, a new quality criterion is established, in which the quality of a description (D) is calculated by the following heuristic: Quality(D) = TP - PF * FN, where TP (True Positives) is the number of correctly identified positive examples; FN (False Negatives) is the number of wrongly identified negative examples; and PF (Penalty Factor) is
ILA-2: An Inductive Learning Algorithm
11
a user-defined parameter. Note that to reason about uncertain data, the evaluation measure of ILA-2 maximizes the quality value of a rule for a given PF. The Penalty Factor determines the negative effect of FN examples on the quality of descriptions. It is similar to the well-known sensitivity measure used for accuracy estimation. Usually sensitivity (Sn) is defined as the proportion of items that have been correctly predicted to the number of items covered, i.e., Sn = TP / ( TP + FN ). Sensitivity may be equivalently rewritten in terms of PenaltyFactor (PF) as Sn = PF / ( PF + 1 ). The user defined PF values may be converted to the sensitivity measure using the above equation. As seen in Figure 2, sensitivity value approaches to one as the penalty factor increases. The advantage of the new heuristic may be seen by an example case. In respect to penalty factor of five, let us have two descriptions one with TP=70 and FN=2, the other with TP=6 and FN=0. The ILA quality criterion selects the second description. However, the soft criteria selects the first one, which is intuitively more predictive than the second. Figure 2 When penalty factor approaches to the number of training examples, the selection with the new formula is the same as ILA criteria. On the other hand, a zero penalty factor means the number of negative training examples has no importance and only the number of positive examples are looked at, which is quite an optimistic choice.
ILA-2: An Inductive Learning Algorithm
12
Properties of classifiers generated by different values of penalty factor for splice dataset are given in Table 5. The number of rules and the average number of conditions in the rules increase as the penalty factor increases. As seen in Table 5, i.e., ILA with smaller penalty factors, constructs smaller size classifiers. Table 5 As seen from Figure 3, for all reasonable values of penalty factors, (1-30), we get better results than ILA in terms of estimated accuracy. For this data set we get maximum estimated accuracy prediction when penalty factor is 10. Increasing the penalty factor further does not yield better classifiers. However, when we look at the accuracy on training data we see that there is a proportional increase in the accuracy as the penalty factor increases. Figure 3 3.2 The Faster Pass Criteria
In each iteration, after an exhaustive search for good descriptions in the search space, ILA selects only one description to generate a new rule. This approach seems an expensive way of rule extraction. However, it is possible that if there exists more than one description with the same quality then all these descriptions may be used for possible rule generation. Usually the second approach tends to decrease the processing time. On the other hand, this approach might result in redundant rules with an increase in the size of the output rule-set. The above idea was implemented in the ILA system and the option to activate this feature referred to as FastILA. For example, in case of promoter data set FastILA reduced the processing time from 17 seconds to 10 seconds. Also, the number of final rules decreased by one and the total number of conditions by two. The experiments show that, if the size of the
ILA-2: An Inductive Learning Algorithm
13
classification rules is not extremely important and less processing time is desirable then FastILA option would be more suitable to use. As seen in Table 6, ILA-2 (or ILA with fast pass criteria) generates a higher number of initial rules than ILA, which amounts to about 5.5 on average for the evaluation data sets. This is because of the faster pass criteria permits more than one rule to be asserted at once. However, after the rule generation step is finished, the sifting process sifts all the unnecessary rules. Table 6 4. The Time Complexity Analysis In the time complexity analysis of ILA, only the upper bounds for the critical components of the algorithm are provided because the overall complexity of any algorithm is domaindependent. Let us define following notation: - e is the number of examples in the training set. - Na stands for the number of attributes. - c stands for the number of class attribute values. - j is the number of attributes of the descriptions. - S is the number of descriptions in the current test descriptions set.
The following procedure is applied when ILA-2 runs for a classification task: ILA-2 algorithm comprises two main loops. The outer loop is performed for each class attribute value, which takes time O(c). In the loop, first the size of the descriptions for the hypothesis space, also known as version space, is set to one. A description is a combination of attribute-value pairs, they are used as left-hand side of the IF-THEN rules in the rule generation step. Then an inner loop is executed. The execution of the inner loop is continued until all examples of the current class are covered by the generated rules or the maximum size for the descriptions is reached, thus it takes time O(Na).
ILA-2: An Inductive Learning Algorithm
14
Inside the inner loop, first a set of temporary descriptions is generated for the given description size using the training examples of the current class, O(e/c.Na). Then, occurrences of these descriptions are counted, O(e.S). Next, good descriptions maximizing the ILA quality measure are searched, O(S). If some good descriptions are found, they are used to generate new rules. The examples matching these new rules are marked as classified, O(e/c). When no description found passing the quality criteria the size of the descriptions for the next iteration is incremented by one. Steps of the algorithm are given using short notations as follows; for each class (c) { for each attribute (Na) { Hypothesis generation in the form of descriptions; Frequency counting for the descriptions;
// O((e/c) . Na)
// O(e . S)
Evaluation of descriptions; // O(S) Marking covered instances; // O(e/c) } }
Therefore, the overall time complexity is then given by O(c.Na.(e/c.Na + e.S + S + e/c) ). O(e.Na2 + c. Na.S.(e+1) + e.Na) This may be simplified by replacing e+1 with e, and then the complexity becomes: O(e.Na2 + c. Na.e.S + e.Na) O(e. Na.(Na+c.S+1/c). As 1/c is also comparatively small, then O(e. Na.(Na+c.S).
ILA-2: An Inductive Learning Algorithm
15
Usually c.S is much larger than Na, e.g., in the experiments we selected S as 1500 while the maximum Na was only 60. In addition c is also comparatively smaller than S. Therefore we may simplify the complexity equation as: O(e. Na.S). So, the time complexity of the ILA is linear in the number of attributes and in the number of examples. Also, the size of the hypothesis space (S) linearly affects the processing time.
5. Evaluation of ILA-2 For evaluation purposes of ILA-2 we have mainly used two parameters: the classifier size and accuracy. Classifier size is the total number of conditions of the rules in the classifier. For decision tree algorithms classifier size refers to the number of leaf nodes in the decision tree; i.e., the number of regions that the data is divided by the tree. Accuracy is the estimated accuracy on test data. We have used the hold out method to estimate the future prediction accuracy on unseen data. We have used nineteen different training sets from the UCI repository (Merz and Murphy, 1997). Table 7 summarizes the characteristics of these datasets. In order to test the algorithms for the ability of classifying unseen examples a simple practice is to reserve a portion of the training dataset as a separate test set, which is not used in building the classifiers. We have employed the test sets related with these training sets from the UCI Repository in the experiments, to estimate the accuracy of the classifiers’. Table 7 In selecting the penalty factor (PF) as 1 and 5, we have considered the two ends and the middle of the PF spectrum. In the higher end we observe that the ILA-2 performs like the basic ILA, for this reason we have not included higher PF values in the experiments. The
ILA-2: An Inductive Learning Algorithm
16
lower end of the spectrum is zero, which eliminates the meaning of the PF. Therefore, PF=1 is included in the experiments to see the results with the most relaxed (error tolerant) case. PF=5 case is selected as it represents a moderate case, which corresponds to sensitivity value of 0.83 in training. We ran three different versions of ILA algorithm on these datasets: ILA-2 with penalty factor set to 1 and 5, and the basic ILA algorithm. In addition, we ran three different decision tree algorithms and two rule induction algorithms: ID3 (Quinlan, 1986), C4.5, C4.5rules (Quinlan, 1993), OC1 (Murthy Kasif, and Salzberg, 1994), and CN2 (Clark and Niblett, 1989). Algorithms other than ILA-2 were run using the default settings supplied with their systems. Estimated accuracies of generated classifiers on classifying the test sets are given in Table 8 on which our comparative interpretation is based. ILA algorithms have higher accuracies than OC1 algorithm in classifying thirteen out of nineteen domains’ test sets. Compared to CN2, ILA-2 performs better in eleven of the nineteen domains’ test sets and they have similar accuracies in other two domains. ILA algorithms performed marginally better than C4.5 among the nineteen test sets, producing higher accuracies in ten domains’ test sets. Table 8 Table 9 shows size of the output classifiers generated by the same algorithms above for the same data sets. Results in the table prove that ILA-2 or ILA with the novel evaluation function is comparable to C4.5 algorithms in terms of the generated classifiers’ size. When penalty factor set to 1, ILA-2 usually produced the smallest size classifiers for the evaluation sets. In regard to results in Table 9, it may be worth pointing out that ILA-2 solves over-fitting problem of basic (certain) ILA, in a similar fashion to C4.5 solves the over-fitting problem of
ILA-2: An Inductive Learning Algorithm
17
ID3 (Quinlan, 1986). The sizes of classifiers generated by the corresponding classification methods show the existence of this relationship clearly. Table 9
6. Conclusion We introduced an extended version of ILA, namely ILA-2, which is a supervised rule induction algorithm. ILA-2 has additional features that are not available in basic ILA. A faster pass criterion that reduces the processing time by employing a greedy rule generation strategy is introduced. This feature, called fastILA, is useful for situations where a reduced processing time is more important than the size of classification task performed. The main contribution of our work is the evaluation metric utilized in ILA-2 for evaluation of description(s). In other words, users can reflect their preferences via a penalty factor to tune up (or control) the performance of ILA-2 system with respect to the nature of the domain at hand. This provides a valuable advantage over most of the current inductive learning algorithms. Finally, using a number of machine learning and real-world data sets, we show that the performance of ILA-2 is comparable to that of well-known inductive learning algorithms, CN2, OC1, ID3 and C4.5. As a further work, adapting different feature subset selection (FSS) approaches is planned to be embedded into the system as a pre-processing step in order to yield a better performance. With FSS, the search space requirements and the processing time will probably be reduced due to the elimination of irrelevant attribute value combinations at the very beginning of the rule extraction process.
ILA-2: An Inductive Learning Algorithm
18
References Clark, P. and T. Niblett. 1989. The CN2 Induction Algorithm. Machine Learning 3: 261-283. Deogun, J.S., V.V. Raghavan, A. Sarkar, and H. Sever. 1997. Data Mining: Research Trends, Challenges, and Applications. Rough Sets and Data Mining: Analysis of Imprecise Data, (Ed. T.Y. Lin and N. Cercone), 9-45. Boston, MA: Kluwer Academic Publishers. Fayyad, U.M. 1996. Data Mining and Knowledge Discovery: Making Sense Out of Data. IEEE Expert 11(5): 20-25. Fayyad, U.M., and K.B. Irani. 1993. Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning. In Proceedings of 13th International Joint Conference on Artificial Intelligence, (Ed. R.Bajcsy,) 1022-1027. Philadelphia, PA: Morgan Kaufmann. Kohavi, R., D. Sommerfield, and J. Dougherty. 1996. Data Mining Using MLC++: A Machine Learning Library in C++. Tools with AI: 234-245. Langley, P. 1996. Elements of Machine Learning. San Francisco: Morgan Kaufmann Publishers. Matheus, C.J., P.K. Chan, and G. Piatetsky-Shapiro. 1993. Systems for Knowledge Discovery in Databases. IEEE Trans. on Knowledge and Data Engineering 5(6): 903-912. Merz, C. J., and P. M. Murphy. 1997. UCI Repository of Machine Learning Databases, http://www.ics.uci.edu/∼mlearn/MLRepository.html. Irvine, CA: University of California, Department of Information and Computer Science. Murthy, S.K., S. Kasif, and S. Salzberg. 1994. A System for Induction of Oblique Decision Trees. Journal of Artificial Intelligence Research 2: 1-32.
ILA-2: An Inductive Learning Algorithm
19
Quinlan, J.R., 1983. Learning Efficient Classification Procedures and their Application to Chess End Games. In Machine Learning: An Artificial Intelligence Approach, (Eds, R.S. Michalski, J.G. Carbonell, and T.M. Mitchell) 463-482. Palo Alto, CA: Tioga. Quinlan, J.R. 1986. Induction of Decision Trees. Machine Learning 1: 81-106. Quinlan, J.R. 1993. C4.5: Programs for Machine Learning. Philadelphia, PA: Morgan Kaufmann. Rissanen, J. 1986. Stochastic Complexity and Modeling. Ann. Statist. 14: 1080-1100. Simoudis E., 1996. Reality Check for Data Mining. IEEE Expert, 11(5): 26-33. Thornton, C.J., 1992. Techniques in Computational Learning-An Introduction. London: Chapman and Hall. Ting, K.M., 1994. Discretization of Continuous-valued Attributes and Instance-based Learning. Technical Report 491, Univ. of Sydney, Australia. Tolun, M.R., and S.M. Abu-Soud. 1998. ILA: An Inductive Learning Algorithm for Rule Discovery. Expert Systems with Applications. 14: 361-370. Zadeh, L. A. 1994. Soft Computing and Fuzzy Logic. IEEE Software. November: 48-56.
ILA-2: An Inductive Learning Algorithm
20
Table 1. Object Classification Training Set (Thornton, 1992).
Example no.
Size
Color
Shape
Class
1
medium
Blue
brick
yes
2
small
red
wedge
no
3
small
red
sphere
yes
4
large
red
wedge
no
5
large
green
pillar
yes
6
large
red
pillar
no
7
large
green
sphere
yes
ILA-2: An Inductive Learning Algorithm
21
Table 2. The First Set of Descriptions No
Description
True Positive
False Negative
1
size = medium
1
0
2
color = blue
1
0
3
shape = brick
1
0
4
size = small
1
1
5
color = red
1
3
6
shape = sphere
2
0
7
size = large
2
2
8
color = green
2
0
9
shape = pillar
1
1
√
ILA-2: An Inductive Learning Algorithm
22
Table 3. The Second Set of Descriptions No
Description
True Positive
False Negative
1
size = medium
1
0
2
color = blue
1
0
3
shape = brick
1
0
4
size = large
1
2
5
color = green
1
0
6
shape = pillar
1
1
√
ILA-2: An Inductive Learning Algorithm
23
Table 4. The Third Set of Descriptions No
Description
True Positive
False Negative
1
size = large
1
2
2
color = green
1
0
3
shape = pillar
1
1
√
ILA-2: An Inductive Learning Algorithm
24
Table 5. Results for Splice Dataset with Different Values of Penalty Factors
Penalty Number of Factor rules
Average number of conditions
Total Accuracy on Accuracy on number of training data test data conditions
1
13
2.0
26
82.4%
73.4%
2
31
2.3
71
88.7%
81.6%
3
38
2.3
86
94.7%
87.6%
4
50
2.5
125
96.1%
87.2%
5
53
2.4
128
97.9%
86.9%
7
63
2.5
158
98.7%
85.4%
10
66
2.5
167
99.6%
88.5%
30
87
2.6
228
100.0%
71.7%
ILA
91
2.6
240
100.0%
67.9%
√
ILA-2: An Inductive Learning Algorithm
25
Table 6. Effect of Fast-ILA Option in Terms of Four Different Parameters
Training set
Number of initial rules
Number of final rules
Estimated accuracy (%)
Time (second)
ILA
ILA-2
ILA
ILA-2
ILA
ILA-2
ILA
ILA-2
Lenses
6
9
6
5
50
62.5
1
1
Monk1
23
30
21
22
100
94.4
1
1
Monk2
61
101
48
48
78.5
81.3
3
2
Monk3
23
37
23
23
88.2
86.3
1
1
Mushroom
24
55
16
15
100
100
1476
1105
Parity5+5
35
78
15
14
50.8
50.8
9
5
Tic-tac-toe
26
31
26
26
98.1
98.1
90
52
Vote
33
185
27
22
96.3
94.8
31
25
Zoo
9
55
9
9
91.2
85.3
1
0
Splice
93
825
91
76
67.9
97.7
1569
745
Coding
141
1073
108
112
68.7
100
1037
345
Promoter
14
226
14
13
100
100
17
10
Totals
488
2703
404
383
85.25
90.58
4236
2290
ILA-2: An Inductive Learning Algorithm
26
Table 7. The Characteristics Features of Tested Data-sets
Name
Number of attributes
Number of examples in training data
Number of class values
Number of examples in test data
Lenses
4
16
3
8
Monk1
6
124
2
432
Monk2
6
169
2
432
Monk3
6
122
2
432
Mushroom
22
5416
2
2708
Parity5+5
10
100
2
1024
Tic-tac-toe
9
638
2
320
Vote
16
300
2
135
Zoo
16
67
7
34
Splice
60
700
3
3170
Coding
15
600
2
5000
Promoter
57
106
2
40
Australia
14
460
2
230
Crx
15
465
2
187
Breast
10
466
2
233
Cleve
13
202
2
99
Diabetes
8
512
2
256
Heart
13
180
2
90
Iris
10
100
3
50
Domain
ILA-2: An Inductive Learning Algorithm
27
Table 8. Estimated Accuracy’s of Various Learning Algorithms on Selected Domains
Domain Name
ILA-2 PF = 1
ILA-2 PF = 5
ILA
ID3
C4.5pruned
OC1
C4.5rules
CN2
Lenses
62.5
50
50
62.5
62.5
37.5
62.5
62.5
Monk1
100
100
100
81.0
75.7
91.2
93.5
98.6
Monk2
59.7
66.7
78.5
69.9
65.0
96.3
66.2
75.4
Monk3
100
87.7
88.2
91.7
97.2
94.2
96.3
90.7
Mushroom
98.2
100
100
100
100
99.9
99.7
100
Parity5+5
50.0
51.1
51.2
50.8
50.0
52.4
50.0
53.0
Tic-tac-toe
84.1
98.1
98.1
80.9
82.2
85.6
98.1
98.4
Vote
97.0
96.3
94.8
94.1
97.0
96.3
95.6
95.6
Zoo
88.2
91.2
91.2
97.1
85.3
73.5
85.3
82.4
Splice
73.4
86.9
67.9
89.0
90.4
91.2
92.7
84.5
Coding
70.0
70.7
68.7
65.7
63.2
65.9
64.0
100
Promoter
97.5
97.5
100
100
95.0
87.5
97.5
100
Australia
83.0
76.5
82.6
81.3
87.0
84.8
88.3
82.2
Crx
80.2
78.1
75.4
72.5
83.0
78.5
84.5
80.0
Breast
95.7
96.1
95.3
94.4
95.7
95.7
94.4
97.0
Cleve
70.3
76.2
76.2
64.4
77.2
79.2
82.2
68.3
Diabetes
71.5
73.8
65.6
62.5
69.1
70.3
73.4
70.7
Heart
60.0
82.2
84.4
75.6
83.3
78.9
84.4
77.8
Iris
96.0
94.0
96.0
94.0
92.0
96.0
92.0
94.0
ILA-2: An Inductive Learning Algorithm
28
Table 9. Size of the Classifiers Generated by Various Algorithms Domain Name
ILA-2, PF = 1
ILA-2, PF = 5
ILA
ID3
C4.5pruned
C4.5rules
Lenses
9
13
13
9
7
8
Monk1
14
37
58
92
18
23
Monk2
9
115
188
176
31
35
Monk3
5
48
63
42
12
25
Mushroom
9
13
22
29
30
11
Parity5+5
2
67
81
107
23
17
Tic-tac-toe
15
88
88
304
85
66
Vote
18
35
69
67
7
8
Zoo
11
16
17
21
19
14
Splice
26
128
240
201
81
88
Coding
64
256
319
429
109
68
Promoter
7
18
27
41
25
12
Australia
13
69
116
130
30
30
Crx
12
39
111
129
58
32
Breast
5
16
38
37
19
20
Cleve
8
55
64
74
27
20
Diabetes
6
22
33
165
27
19
Heart
1
25
19
57
33
26
Iris
5
16
4
9
7
5
Totals
238
1079
1570
2119
648
527
ILA-2: An Inductive Learning Algorithm
Figure 1. ILA Inductive Learning Algorithm
29
ILA-2: An Inductive Learning Algorithm
Figure 2. Penalty Factor vs. Sensitivity Values
30
ILA-2: An Inductive Learning Algorithm
Figure 3. Accuracy Values on Training and Test Data for the Splice Dataset.
31
ILA-2: An Inductive Learning Algorithm
For each class attribute value perform { 1. Set j, which keeps the size of descriptions, to 1. 2. While there are unclassified examples in current class and j is less than or equal to the number of attributes: { 1. Generate the set of all descriptions in the current class for unclassified examples using the current description size. 2. Update occurrence counts of all descriptions in the current set. 3. Find description(s) passing the goodness measure. 4. If there is any description that has passed the goodness measure: { 1. Assert rule(s) using the ’good’ description(s). 2. Mark Items covered by the new rule(s) as classified. } Else increment description size j by 1. } }
Figure 1.
32
ILA-2: An Inductive Learning Algorithm
33
\WL LYWL VQ H6
3H QDOW\IDFWRU
Figure 2.
ILA-2: An Inductive Learning Algorithm
34
100
60
Accuracy on training data
40
Accuracy on test data
20
penalty factor
Figure 3.
ILA
10
5
3
0 1
Accuracy%
80