ILA-2: An Inductive Learning Algorithm over uncertain data Mehmet R. Tolun
Hayri Sever
Department of Computer Engineering Middle East Technical University Inonu Bulvari - Ankara Turkey 06531 {
[email protected]}
Department of Computer Science Hacettepe University Beytepe - Ankara Turkey 06530 {
[email protected]}
Mahmut Uludag Artificial Intelligence Group TÜBÝTAK Marmara Research Center PK 21 Gebze Kocaeli Turkey 41470 {
[email protected]}
Abstract ABSTRACT AND CONCLUSION NEEDS TO BE RE-WRITTEN. ESPECIALLY WE SHOULD EMPHASIZE OUR CONTRIBUTION AND ORGINALITY OF THE WORK IN CONCLUSION. In this paper we describe the ILA-2 rule induction algorithm from the machine learning domain. ILA2 is the improved version of a novel inductive learning algorithm, namely ILA. We first describe the basic algorithm ILA, then present how the algorithm was improved. We also compare ILA-2 to a range of induction algorithms, including ILA. According to the empirical comparisons, ILA-2 appears to be comparable to CN2 and C4.5 algorithms in terms of output classifiers’ accuracy and size.
Keywords: Data Mining, Knowledge Discovery, Machine Learning, Inductive Learning, Rule Induction.
1. Introduction A data-mining process involves extracting valid, previously unknown, potentially useful, and comprehensible patterns from large databases. As described in~\cite{fayyad96,simoudis96}, this process is typically made up of selection and sampling, preprocessing and cleaning, transformation and reduction, data mining, and evaluation steps. The first step in data-mining process is to select a target data set from a database (or a data warehouse) and to possibly sample the target data. The preprocessing and data cleaning step handles noise and unknown values as well as accounting for missing data fields, time sequence information, and so forth. The data reduction and transformation involves finding relevant features depending on the goal of the task and certain transformations on the data such as converting one type of data to another (e.g., changing nominal values into numeric ones, discretizing continuous values), and/or defining new attributes. In the mining step, the user may apply one or more knowledge discovery techniques on the transformed data to extract valuable patterns. Finally, evaluation step involves interpreting the result (or discovered pattern) with respect to the goal/task at hand. Note that the data-mining process is not linear and involves a variety of feedback loops, because any one step can result in changes in preceding or succeeding steps. Furthermore, the nature of a real-world data set,
1
which may contain noisy, incomplete, dynamic, redundant, continous, and missing values, certainly makes all steps critical on the path going from data to knowledge ~\cite{RagHay96,MCS93}.
One of the methods in data mining step is inductive learning which is mostly concerned with the finding of general descriptions of a concept from a set of training examples. Practical data mining tools generally employ a number of inductive learning algorithms. For example, Silicon Graphics’ data mining and visualization product MineSet uses MLC++ as a base for the induction and classification algorithms (Kohavi et al. 1996). This paper focuses on establishing casual relationships to class labels from values of attributes via a heuristic search that starts with values of individual attributes and continues to consider the pairwise, triple, or further combinations of attribute values in sequel until the example set is covered.
Once a new learning algorithm has been introduced, it is not unusual to see additional papers appear that modify the algorithm to improve it in various ways. These improvements are important in establishing a method and clarifying when it is and is not useful. In this paper, we propose three modifications to improve upon such an algorithm, namely Inductive Learning Algorithm (ILA) introduced by Tolun and Abu-Soud (1997). The ILA is a consistent inductive algorithm that operates on non-conflicting example set. It extracts a rule set that covers all instances in a given example set. The rules are in a suitable form for data exploration; namely a description set of each class in the simplest way that enables it to be distinguished from the other classes. The rule set is ordered in a more modular fashion which enables to focus on a single rule at a time. The induction bias used in ILA is to extract a rule from a set of promising rules for a class if and only if the rule has the greatest proportion of the number of positive and (none of negative) examples it covers over the size of the description (i.e., the number of conjunctives) among the others. To implement this bias, the ILA works in an iterative fashion, each iteration searching for a rule that covers a large number of training examples of a single class. Having found a rule, ILA removes those examples it covers from the training set by marking them and appends a rule at the end of its rule set. In other words, the ILA works on a rules-per-class basis. For each class, rules are induced to separate examples in that class from examples in all the remaining classes. This produces an ordered list of rules rather than a decision tree. In the following, we present our extensions to the ILA. The first modification we propose is the ability to deal with uncertain data. Two different sources of uncertainty can be distinguished. One of these is noise that is defined as nonsystematic errors in gathering or entering data. This may happen because of incorrect recording or transcription of data, or because of incorrect measurument or perception at an earlier stage. The second situation occurs when descriptions of examples become insufficient to induce certain rules. We say in this paper that a data set is inconsistent iff descriptions of examples are not sufficient to induce certain rules. This phenomena is also known as incomplete data to point out the fact that some relevant features are missing to extract nonconflicting class descriptions. In real-world problems this often constitutes the greatest source of error, because the fact that data has been organized and collected around the needs of organizational activities causes incomplete data from the view point of the
2
knowledge discovery task. Under such circumstances, the knowledge discovery model should have the capability of providing approximate decisions with some confidence level. As opposed to incomplete data, the given data set may contain redundant or insignificant attributes with respect to the problem at the hand. This case might arise in several situations. For example, combining relational tables to gather relevant data set may result in redundant attributes that the user is not aware of, since un-normalized relational tables may involve redundant features in their contents. Fortunately, there exist many near-optimal solutions, or optimal solutions in special cases, with reasonable time complexity that eliminate insignificant (or redundant) attributes from a given attribute set by using weights for either individual attributes or combination of some attributes. These type of algorithms are known as feature selection (or reduction). The importance of feature selection in a broader sense is not only to reduce the search space, but also to speed up the processes of both concept learning and classifying objects and to improve the quality of classification ~\cite{RagHay95c,KiR92,AlD91}. In ILA-2, a feature subset is selected as a pre-processing step where features are selected based on the gain ratio criterion of the C4.5. The Feature Subset Selection (FSS) property of ILA-2 system has two available modes: In mode-1, the features having negative values according to the gain ratio criteria are eliminated. However, in mode-2 this threshold is the average gain. ILA-2 is an extension of ILA with respect to the modifications stated above. We have empirically compared ILA-2 with ILA using real-world data sets. The results show that ILA-2 is better than ILA in terms of accuracy in classifying unseen instances, size of the classifiers and learning time. Test results with unseen examples also show that ILA-2 is comparable to both CN2 and C4.5 algorithms. The organization of this paper is as follows. In the following section, we briefly summarize inductive learning algorithms that were compared to our own algorithm. These algorithms are ID3, C4.5, C4.5rules, OC1 and CN2 respectively. Section 3 introduces the ILA algorithm, execution of the algorithm for an example task is also presented. In section 4, modifications to ILA algorithm are described in detail. In section 5, the time complexity analysis of ILA-2 is presented. In the last section ILA-2 is empirically compared to five well-known algorithms in twelve different domains. 2. Induction Algorithms In the 1980¶s, the best known algorithm, which generates a consistent decision tree with respect to an example set, was Quinlan’s ID3 algorithm (Quinlan, 1983). ID3 was derived from the Concept Learning System (CLS) algorithm described by Hunt, Marin & Stone (1966). It has equipped with two new features to improve the CLS algorithm. First an information-theoretic splitting heuristic was used to enable small and efficient decision trees to be constructed. Second, the incorporation of windowing process that enabled the algorithm to cope with large training sets (Quinlan, 1986). With these advantages ID3 has become a mainstream of symbolic learning approaches and a number of derivatives are proposed by many researchers. For example, ID4 incrementally builds a decision tree based on individually observed instances by maintaining positive and negative instance counts of every attribute that could be a test attribute (Schlimmer and Fisher, 1986). ID5 provides an incremental method for building ID3 type decision trees but differs from ID4 in its method for replacing the test attribute (Utgoff, 1988), GID3 and GID3* does not branch on each
3
value of the chosen attribute to reduce the unnecessary sub-division of data (Irani, Cheng, Fayyad, and Qian, 1993). The basic structure of ID3 is iterative; at each step, a new node is added to the decision tree by partitioning the training examples based on their values along a single, most-informative attribute. The most-informative attribute is the one which minimizes the expected information for the tree. The expected information with the attribute A as root is obtained as the weighted average (Shavlik, Mooney and Towell, 1991) as: k ji Si N kji E(A) = - ∑ ∑ log2 Si i = 1 S j = 1 Si V
where, V is the number of values for attribute A, kji is the number of examples in the jth category with the ith value for attribute A, S is the total number of examples, Si is the number of examples with the ith value for attribute A, and N is the number of categories. The information gained by branching on A is calculated as gain(A) = info(S) - E(A). Here, info(S) is the information needed to classify examples, given only the class totals as a whole and defined as the entropy of the set S. It is the gain criterion that selects a test to maximize this information gain. On other words, the value of this information measure depends on the likelihood of the various possible attribute values. If they are equally likely (and so the prabability of occurance of attribute values are equal), there is the greatest amount of uncertainty and the information gained will be greatest. The less equal the probabilities, the less information there is to be gained. Along with this line, each resulting partition is processed recursively, unless it contains examples of only a single category, in which a case leaf is produced and labeled with the category. The information gain criterion that determines the splitting attribute acts as a hill-climbing heuristic, which tends to minimize the size of the resulting decision tree. One problem with the gain criterion is that it has a strong bias in favor of tests with many outcomes (Quinlan, 1993). This can be seen by considering a hypothetical medical diagnosis task in which one of the attributes contains a patient identification. Partitioning any set of training cases on the values of this attribute will lead to subsets each containing just one case. Since all of these subsets necessarily contain cases of a single class, the information gain from using this attribute to partition the set of training cases is maximal. However, such a division is quite useless, from the point of view of prediction. Another problem with ID3 is that the decision tree produced overfits the training examples (also known as the problem of small splits) because it performs a stepwise splitting that attempts to optimize at each individual split, rather than on an overall basis (McKee, 1995). This leads to decision trees
4
that are too specific. From statistical point of view, very small (or overfitted) groups are quite likely to be chance occurances and therefore unreliable for predicting new sets of data. The C4.5 algorithm is a descendent of ID3 which solves the drawbacks of ID3 stated above. First, the bias inherent in the gain criterion was rectified by a sort of normalization in which the apparent gain attributable to tests with many outcomes was adjusted (Quinlan, 1993). The information content of a message pertaining to a case, which indicates the outcome of the test rather than indicating the class to which the case belongs, is defined as:
\ J 7 M 7 split_info(A) = - I /2* J 1
,
7
,
,
J 7 ]
REPLACE Ti WITH Si AND T WITH S. This represents the potential information generated by dividing T into n subsets, whereas the information gain measures the information relevant to classification that arises from the same division. Then the proportion of information generated by the split that is useful is expressed as; gain_ratio(X) = gain(X) / split_info(X) If the split is nontrivial, split information will be small and this ratio will be unstable. To avoid this, the gain ratio criterion selects a test to maximize the ratio above, subject to constraint that the information gain must be greater than the average gain over all tests examined. The second problem of ID3, i.e., the over-fitting problem, is solved by pruning. C4.5 prunes by using the upper bound of a confidence interval on the re-substitution error estimate; since nodes with fewer instances have a wider confidence interval, they are removed if the difference in error between them and their parents is not significant. C4.5 rules generates rule-based classifiers from decision trees (Quinlan, 1993). The conversion process takes every leaf of a tree and initially creates one rule for each leaf. It makes rules by tracing back to the tree and collecting all the tests into a set joined as a conjunction, which becomes the antecedent of the rule. Usually rules are generalized by dropping one or more of these conditions. The class label is the consequent of the rule. In simplifying the rules, a greedy strategy is used that considers eliminating each of the conditions in turn. The error rate of the rule is estimated when each of the conditions have been deleted. Then, the condition that increases the error rate the most is eliminated. This continues as long as the error rate can be reduced. One problem with the C4.5rules conversion process is that the resulting rule set might contain overlapping rules, and it might not cover all possible cases. Thus one needs to decide which rule to apply when a case discovered by more than one rule. Also, one needs to add a default rule that covers cases not otherwise covered. The solution in C4.5rules is to look for subsets of rules that cover each class in the data, using the Minimum Description Length (MDL) principle for guidance in finding small subsets (Quinlan, 1994). Then an
5
order for these class rule subsets is determined and a default class is chosen. The default class is the one with most instances not covered by any rule. If there are more than a few rules, the considerations of subsets is not exhaustive (Quinlan & Cameron-Jones 1995). C4.5rules carries out a series of greedy searches, starting first with no rules, then a randomly chosen 10% of the rules, then 20%, and so on; each search attempts to improve the current subset by adding or deleting a single rule until no further improvement is possible. The best subset found in any of these searches is retained. As an option simulated annealing may also be used to search for the best subset. Other algorithms include OC1 (Murthy et al. 1994) which is a system for the induction of oblique decision trees suitable for domains where attributes have numeric values. Oblique decision trees are trees in which each node may contain a (linear) multivariate test on the attributes of the data. Oblique decision trees are a natural extension to the well-known axisparallel trees. OC1 also constructs standard axis-parallel trees, which contain tests of just one attribute at each node. However, using multivariate tests at each node of a decision tree has both advantages and disadvantages. The resulting trees may be smaller and/or more accurate, but they may be more time-consuming to induce than univariate trees. OC1 attempts to divide the d-dimensional attribute space into homogeneous regions; i.e., regions that contain examples from just one category. The goal of adding new nodes to a tree is to split up the sample space so as to minimize the impurity of the training set. The system works with a large class of impurity measures such as information gain, the Gini Index, and the Twoing Rule (Murthy et al., 1994). OC1 uses Breiman et al.’s (1984) Cost Complexity pruning as the default pruning method. This method requires a separate pruning set. The idea behind is to create a set of trees of decreasing size from the original, complete tree. All these trees are used to classify the pruning set, and accuracy is estimated from that. Cost Complexity pruning then chooses the smallest tree whose accuracy is within k standard errors squared of the best accuracy obtained. The CN2 algorithm induces an ordered list of classification rules from examples using entropy as its search heuristic (Clark and Niblett, 1989). The algorithm is suitable for domains where there might be noise. CN2 consists of two main procedures: a search algorithm performing a beam search for a good rule and a control algorithm for repeatedly executing the search. To avoid selecting highly specific rules, CN2 uses a significance test which ensures that the distribution of examples among classes covered by the rule is significantly different from that which would occur by chance. In this way, many rules covering only a few examples are eliminated. Later using the Laplacian error estimate as a heuristic, the algorithm’s performance was improved by Clark and Boswell (1991). The Laplace expected accuracy estimate is given by the formula: LaplaceAccuracy = ( Nc + 1)/(Ntot + k) where, k is the number of classes in the domain, Nc is the number of examples in the predicted class c covered by the rule,
6
Ntot is the total number of examples covered by the rule. When generating a rule list, the predicted class c for a rule is simply the class with the most covered examples in it.
3. The ILA Inductive Learning Algorithm Having reviewed some of the induction algorithms in the previous section, we can now turn to the ILA inductive learning algorithm that was extended in this study to cover uncertain data. ILA generates a set of classification rules from a collection of training examples. An example is described in terms of a fixed set of attributes, each with its own set of possible values. The algorithm works in an iterative fashion, each iteration searching for a rule that covers a large number of training examples of a single class. Having found a rule, ILA removes those examples it covers from the training set by marking them and appends a rule at the end of its rule set. In other words, the algorithm works on a rules-per-class basis. For each class, rules are induced to separate examples in that class from examples in all the remaining classes. This produces an ordered list of rules rather than a decision tree. Details of ILA algorithm is given in Figure 1.
For each class attribute value perform { 1. Set size for descriptions(j) to 1. 2. While there are unclassified examples in current class and j is less than or equal to the number of attributes: { 1. Generate the set of all descriptions in the current class for unclassified examples using the current description size. 2. Update occurrence counts of all descriptions in the current set. 3. Find description(s) passing the goodness measure. 4. If there is anydescription that has passed the goodness measure: { 1. Assert rule(s) using the 'good' description(s). 2. Mark Items covered bythe new rule(s) as classified. } Else increment description size j by1. } }
Figure 1. ILA inductive learning algorithm ILA constructs production rules in a general-to-specific way, i.e., starting off with the most general rule possible and producing specific rules whenever it is deemed necessary. The advantages of ILA can be stated as follows: • The rules are in a suitable form for data exploration; namely a description of each class in the simplest way that enables it to be distinguished from the other classes.
7
• The rule set is ordered in a more modular fashion which enables to focus on a single rule at a time. Direct rule extraction is preferred over decision trees as the latter are hard to interpret, particularly when the number of nodes is large. 3.1 Description of the Algorithm by use of an Example In describing ILA we shall make use of a simple training set, consider the training set for object classification given in Table 1, consisting of seven examples with three attributes and the class attribute with two possible values.
Table 1. Object Classification Training Set (Thornton, 1992). Example no. 1 2 3 4 5 6 7
Size medium small small large large large large
Color blue red red red green red green
Shape brick wedge sphere wedge pillar pillar sphere
Class yes no yes no yes no yes
Let us trace the execution of the ILA algorithm for this training. After reading the object data, algorithm starts by the first class (yes), and generates hypothesis in the form of descriptions. A description is a conjunction of attribute-value pairs, they are used to form left-hand side of rules in the rule generation step. Table 2. First Set of Descriptions Description size = medium color = blue shape = brick size = small color = red shape = sphere size = large color = green shape = pillar
True Positives 1 1 1 1 1 2 2 2 1
False Negatives 0 0 0 1 3 0 2 0 1
√
For each description the number of true positives and false negatives are found. The descriptions 6 and 8 are the ones with the most true positives and no negatives. Since description 6 is the first it is selected by default, and the following rule is generated using it.
8
Rule 1: IF shape = sphere -> class is yes
After the rule is extracted, instances 2 and 6, covered by this rule, are marked as classified. These instances are not used in hypotheses generations afterwards. Next time algorithm generates descriptions as shown in Table 3.
Table 3. Second Set of Descriptions Description size = medium color = blue shape = brick size = large color = green shape = pillar
True Positive 1 1 1 1 1 1
False Negative 0 0 0 2 0 1
√
In the second hypothesis space there are four equivalent quality descriptions; 1, 2, 3 and 5. The system selects the first description for a new rule generation. Rule 2: IF size = medium -> class is yes
The item 1 covered by this rule is marked as classified, and next descriptions are generated. Table 4. Third set of descriptions Description size = large color = green shape = pillar
True Positive 1 1 1
False Negative 2 0 1
√
This time only one description satisfies the ILA quality criteria, and is used for the generation of the following rule. Rule 3: IF color = green -> class yes
The example 4 is covered by this rule, it is marked as classified. Now, all examples of the current class (yes) are marked as classified, and the algorithm continues with the next class (no). The fourth set of descriptions extracted from the class no examples are shown in Table 5.
9
Table 5. Fourth set of descriptions Description size = small color = red shape = wedge size = large shape = pillar
True Positive 1 3 2 2 1
False Negative 1 1 0 2 1
√
Only the description 2 passes the ILA quality criteria; therefore, the following rule is extracted. Rule 4: IF shape = wedge -> class is no
The first example in the training set is covered by the last rule and it is marked as classified. Table 6. Fifth Set of Descriptions Description size = large color = red shape = pillar
True Positive 1 1 1
False Negative 2 1 1
This time no description have enough quality to be used in rule generation. Hence the description size is incremented by one. Table 7. Sixth Set of Descriptions Description color = red size = large shape = pillar size = large shape = pillar color = red
True Positive 1
False Negative 0
1
1
1
0
First and third descriptions pass the ILA quality criteria, by default the first one is selected for rule generation. Rule 5: IF color = red and size = large -> class is no
10
The last rule is generated and the item 5 is marked. Now, algorithm stops, because all the examples in the training set were marked as classified, i.e. all the examples are covered by the current rule-set. 4. Extensions to ILA Two main problems of ILA are over-fitting and long learning time. Over-fitting problem is due to the bias that ILA tries to generate a consistent classifier on training data. However, most of the time training data includes noisy examples causing over-fitting in the generated classifiers. We developed a novel heuristic function that prevents this bias in the case of noisy examples. We also proposed two modifications to make ILA faster, one modification considers the possibility of generating more than one rule generation after the iterations. Other modification is the feature subset selection using the gain ratio criterion as the feature evaluation metric. The ILA algorithm and its extensions have been implemented by using the source code of C4.5 programs. Therefore, in addition to extensions stated above, the algorithm has also been enhanced by some features of C4.5, such as rule sifting and default class selection (Quinlan, 1993). During classification process, the ILA system first extracts a set of rules using the ILA algorithm. In next step, the set of extracted rules for the classes are ordered to minimize false positive errors, and then a default class is chosen. The default class is found as the one with most instances not covered by any rule. Ties are resolved in favor of more frequent classes. Once the class order and default class have been established, the rule set is subject to a post-prunning process. If there are one or more rules whose omission would actually reduce the number of classification errors in training cases, the first such rule is discarded and the set is checked again. This last step allows a final global scrutiny of the rule set as a whole for the context it will be used. 4.1 Novel Evaluation Function The ILA algorithm generates classifiers with minimum number of certain rules that are consistent with training data, provided that there exist no two examples that have the same values for all the features in the training data but have conflicting classes. This approach has a drawback that when noise is present in the training data, over-fitting problem occurs, i.e., unnecessarily longer rules are generated. Over-fitting usually decreases the performance of the classifiers, espicially in classifying unseen examples. To prevent overfitting, one must usually relax the requirement that the induced rules be consistent with all the training data. In order to solve over-fitting problem of ILA, we have developed a new heuristic function that permits noise by also considering the preferences defined by a user. In this section we present a detailed description of the evaluation function. In general, evaluation function’s score for a description should increase in proportion to both the number of the positive instances covered, denoted by p, and the number of negative instances not covered, denoted by nc. In order to normalize the score, a simple metric takes into account the total number of positive instances, P, and negative instances, N, which is given in (Langley, 1996) as
11
p + nc / (P + N), where the resulting value ranges between 0 (when no positives and all negatives are covered) and 1 (when all positives and no negatives covered). This ratio may be used to measure the overall classification accuracy of a description on the training data. Since as P and N are constant this measure can actually be reduced to p - n, where n is the number of negatives covered and is equal to N - nc. Now, we may turn to description evaluation metric used in ILA, which can be expressed as follows. The description should not occur in any of the negative examples of the current class AND must be the one with maximum occurrence in positive examples of the current class. Since this metric assumes, however, no uncertainty to be present in the training data, it searches a given description space to extract a rule set that classifies training data perfectly. It is, however, well-known fact that an application targeting real-world domains should address how to handle uncertainty present in real-world data. Generally, uncertain tolerant classification requires relaxing the constraint that the induced descriptions must classify the consistent part of training data (Clark and Niblett, 1989), which is equivalent to say that classification methods should generate almost true rules (...). Considering this point, an uncertain tolerant version of the ILA metric has been developed. The above idea is also supported by one of the guiding principles of soft-computing, "Exploit the tolerance for imprecision, uncertainty, and particular truth to achieve tractability, robustness, and low solution cost" (Zadeh, 1994). Using the ideas presented above a new quality criteria is established, in which the quality of a description (D) is calculated by the following heuristic: Quality(D) = p - PF * n Here, p is the number of correctly identified positive examples, n is the number of wrongly identified negative examples, PF is a Penalty Factor, a user-defined minimum for the proportion of p to n. The Penalty Factor is similar to the well-known sensitivity measure used for accuracy estimation. Usually sensitivity (Sn) is defined as the proportion of items that have been correctly predicted to the number of items covered. Sn = p / ( p + n ) Sensitivity may be equivalently rewritten in terms of PenaltyFactor (PF) as Sn = PF / ( PF + 1 ) The user defined PF values may be converted to the sensitivity measure using the above equation.
12
6HQVLWLYLW\
3H Q DOW\I DFW RU
Figure 2. Penalty factor vs. sensitivity values As seen in Figure 2, sensitivity value approaches to one as the penalty factor increases. The advantage of the new heuristic may be seen by an example case. Let us have two descriptions one with p=70 and n=2, the other with p=5 and n=0. The ILA quality criteria selects the second description. However, the soft criteria selects the first one, which is intuitively more predictive than the second. When penalty factor approaches to the number of training examples the selection with the new formula is the same as ILA criteria. On the other hand, a zero penalty factor means the number of negative training examples has no importance and only the number of positive examples are looked at, which is quite an optimistic choice. Properties of classifiers generated by different values of penalty factor for splice dataset are given in Table 8. The number of rules and average number of conditions in the rules increase as the penalty factor increases as seen in Table 8. ILA-2 or ILA with penalty factor constructs always smaller size classifiers. Table 8. Results for splice data-set with different values of penalty factors Penalty Number of Average Factor rules number of conditions 1 13 2.0 2 31 2.3 3 38 2.3 4 50 2.5 5 53 2.4 7 63 2.5 10 66 2.5 30 87 2.6 ILA 91 2.6
Total Accuracy Accuracy number of on training on test conditions data data 26 82.4% 73.4% 71 88.7% 81.6% 86 94.7% 87.6% 125 96.1% 87.2% 128 97.9% 86.9% 158 98.7% 85.4% 167 99.6% 88.5% √ 228 100.0% 71.7% 240 100.0% 67.9%
As seen from Figure 3, for all reasonable values of penalty factors, (1-30), we get better results than ILA in terms of estimated accuracy. For this data set we get maximum estimated accuracy prediction when penalty factor is 10. Increasing the penalty factor further does not give better classifiers. However, when we look at the accuracy on training data we see that there is a linear increase in the accuracy as the penalty factor increases.
13
100
Accuracy%
80 60
Accuracy on training data
40
Accuracy on test data
20
penalty factor
ILA
10
5
3
1
0
Figure 3. Comparison of accuracy on training and test data for splice dataset. 4.2 Faster Pass Criteria In each iteration, after an exhaustive search for good descriptions in the search space, ILA selects only one description to generate a new rule. This approach seems an expensive way of rule extraction. However, it is possible that if there exists more than one description with the same quality then all these descriptions may be used for possible rule generation. Usually the second approach tends to decrease the processing time. On the other hand, this approach might result in redundant rules with an increase in the size of the output rule-set. The above idea is programmed in the ILA system and the option to activate this feature is named as FastILA. For example, in case of promoter data set FastILA reduced the processing time from 17 seconds to 10 seconds. Also, the number of final rules decreased by one, and the total number of conditions by two. The experiments show that, if the size of the classification rules is not extremely important and less processing time is desirable then FastILA option would be more suitable to use. Table 9. Effect of fast-ILA option in terms of evaluation parameters
Training set
Number of initial rules
Number of final rules
Total size
Time
(conditions)
(second)
ILA
ILA-2
ILA
ILA-2
ILA
ILA-2
ILA
ILA-2
splice
93
825
91
76
240
206
1569
745
coding
141
1073
108
112
319
338
1037
345
promoter
14
226
14
13
27
25
17
10
As seen in Table 9, ILA-2 (or ILA with fast pass criteria) generates higher number of initial rules, 10 to 15 times more rules than ILA. This is because of the faster pass criteria that it permits more than one rules to be asserted once. However, the sifting process, sifts all unneccessary rules.
14
4.3 Feature Selection Practical algorithms in supervised machine learning degrade in performance when faced with many features that are not necessary for predicting the desired output. Feature selection, a pre-pruning process in inductive learning, is the problem of choosing a small subset of features that is necessary and sufficient to describe target concept(s). It is well known that searching for the smallest subset of features in the feature space takes time that is bounded by $O(2^{l}J),$ where: $l$ is the number of features, and $J$ is the computational effort required to evaluate each subset. This type of exhaustive search would be appropriate only if $l$ is small and $J$ is computationally inexpensive. Greedy approaches like stepwise backward/forward techniques ~\cite{Jam85,Mod93}, dynamic programming~\cite{Chang73}, and branch and bound algorithm~\cite{narendra77} are non-exhaustive and efficient search techniques, which can be applied with some feature selection criterion. For near-optimal solutions or optimal solutions in special cases, weights of either individual features or combinations of features are computed with respect to some feature selection criteria (or measures) such as Bhattacharya coefficient, divergence, Kolmogorov variational distance, etc., in statistics~\cite{Devijver82,Miller90}; Shannon's entropy criterion, classification accuracy, or classification quality based on {\it dice coefficient} in pattern recognition and machine learning ~\cite{RagHay95c, Fayyad92, kohavi94}. The FOCUS algorithm (or their versions) (Almuallim and Dietterich, 1994) exploits, for example, the notion of probably approximately correct learning (Blumer at al., 1989) to implement the induction bias, which is to select minimum subset of features for a given example set. FOCUS algorithm approximates some unknown target concept with externally defined accuracy and confidence parameters, provided that the concept is a binary concept and attributes’ values are drawn from a binary domain. Furthermore, it assumes as a special case that the given features are independent in their relevancy to the unknown concept so that greedy algorithms using the stepwise forward technique can be applied to reduce the size (or complexity) of search space. It should be, however, noted that finding minimum subset of features might be too restrictive requirement for practical induction algorithms as reported in (kohavi94). The authors have used the accuracy measure for induction classifiers based on rough set model to find useful feature subsets. The similar idea is supported by Choubey at al. in their study on a family of sequential backward algorithms of polynamial time complexity to improve upper classifiers -- one of the classification methods in rough set theory (Choubey at al., 1996). They have proved that the sequential backward selection algorithm finds a small subset of relevant features that are ideally sufficient and necessary to define target concepts with respect to a given threshold. This threshold value indicates acceptable degradation of classification quality. Filter and wrapper methods are two usual approaches for the elimination of irrelevant features. In ILA-2, the filter approach is used. Filter methods, filter out irrelevant attributes before the induction process gets started. In ILA-2, a feature subset is selected as a preprocessing step where features are selected based on the gain ratio criterion of the C4.5. The Feature Subset Selection (FSS) property of ILA-2 system has two available modes: In mode-1, the features having negative values according to the gain ratio criteria are eliminated. However, in mode-2 this threshold is the average gain. Therefore, generally mode-2 marks more features as irrelevant. For example, there are 57 features in the 15
promoter data set; in FSS mode-1, 18 of the features are selected as relevant; however in mode-2 only 7 of the features are found to be relevant. Table 10. Properties of classifiers by FSS options Full features
Training set
FSS-mod 1
FSS-mod 2
features
size
accuracy features
size
accuracy features
size
accuracy
splice
60
240
67.9%
18
226
72.2%
9
184
83.7%
coding
15
319
68.7%
8
259
64.5%
0
0
50%
promoter
57
27
100%
18
31
100%
7
23
100%
5. Time Complexity Analysis In the time complexity analysis of ILA, only the upper bounds for the critical components of the algorithm is provided because the overall complexity of any algorithm is domaindependent. Let us define following notation. - e is the number of examples in the training set. - Na stands for the number of attributes. - c stands for the number of class attribute values. - j is the number of attributes of the descriptions. - S is the number of descriptions in the current test descriptions set. The following procedure is applied when ILA-2 runs for a classification task: ILA-2 algorithm comprises two main loops. The outer loop is performed for each class attribute value, it takes time O(c). In the loop, first the size of the descriptions for the hypothesis space, also known as version space, is set to one. A description is a combination of attribute-value pairs, they are used as left-hand side of the IF-THEN rules in the rule generation step. Then an inner loop is executed. The execution of the inner loop is continued until all examples of the current class are covered by the generated rules or the maximum size for the descriptions is reached, thus it takes time O(Na). Inside the inner loop, first a set of temporary descriptions is generated for the given description size using the training examples of the current class, O(e/c.Na). Then, occurrences of these descriptions are counted, O(e.S). Next, good descriptions maximizing the ILA quality measure are searched, O(S). If some good descriptions are found, they are used to generate new rules. The examples matching these new rules are marked as classified, O(e/c). When no description found passing the quality criteria the size of the descriptions for the next iteration is incremented by one. Steps of the algorithm is given using short notations as follows; for (1 to c) { for (1 to Na) { Hypothesis generation; // O((e/c) . Na) Occurrence counting; // O(e . S) Evaluation of descriptions; // O(S) Marking covered instances; // O(e/c)
16
} }
Therefore, overall time complexity is then given by O(c.Na.(e/c.Na + e.S + S + e/c) ). O(e.Na2 + c. Na.S.(e+1) + e.Na) This may be simplified by replacing e+1 with e. Then, the complexity becomes: O(e.Na2 + c. Na.e.S + e.Na) O(e. Na.(Na+c.S+1/c). As 1/c is also comparatively small, then O(e. Na.(Na+c.S). Usually c.S is much larger than Na, e.g., in the experiments we selected S as 1500 while the maximum Na was only 60. In addition c is also comparatively smaller than S. Therefore we may simplify the complexity equation as: O(e. Na.S). So, the time complexity of the ILA is linear in the number of attributes and in the number of examples. Also, the size of the hypothesis space (S) linearly affects the processing time. 6. Evaluation of Inductive Learning Algorithm (ILA-2) The evaluation of a learning algorithm is a complex task. One way it can be assessed is in terms of its performance on specific tasks which are assumed to be representative of the range of tasks which the system is intended to perform (Cameron-Jones and Quinlan, 1994). For evaluation purposes of ILA-2 we have mainly used two parameters: classifier size and accuracy. Classifier size is the total number of conditions of the rules in the classifier. For decision tree algorithms classifier size refers to the number of leaf nodes in the decision tree; i.e., the number of regions that the data is divided by the tree. Accuracy is the estimated accuracy on test data. We have used hold-out method to estimate the future prediction accuracy on unseen data (...). We have used twelfth different training sets from UCI repository (Merz and Murphy, 1997). Table 9 summarizes the characteristics of these datasets. In order to test the algorithms for the ability of classifying unseen examples a standard practice is to reserve a portion of the training dataset as a separate test set, which is not used in building the classifiers. We have found the test sets related with these datasets from UCI Repository, and used them to estimate the accuracy of the classifiers.
Table 9. The Characteristics Features of Datasets Domain name
Number of
Number of examples in
Class frequencies Number of in training data examples in
17
Class frequencies in test data
attributes Lenses
4
training data 16
Monk1
6
124
Monk2
6
169
Monk3
6
122
mushroom
22
5416
parity5+5
10
100
Tic-tac-toe
9
638
vote
16
300
zoo
16
67
splice
60
700
coding
15
600
promoter
57
106
test data 1 13% - 2 13% 3 75% no 50% yes 50% 0 62% 1 38% no 51% yes 49% edible 52% poisonous 48% 0 45% 1 55% positive 66% negative 34% democrat 61% republican 39% 1 43% - 2 1% 3 1% - 4 3% 5 1% - 6 9%7 10% EI 30% IE 30% Neither 40% coding 50% non-coding 50% promoter 50% non-promoter 50%
8 432 432 432 2708 1024 320 135 34
3170 5000 40
1 25% - 2 37.5% 3 37.5% no 50 yes 50% 0 67.1% 1 32.9% no 47.2% yes 52.8% edible 51.1% poisonous 48.9% 0 50% 1 50% positive 63.4% negative 36.6% democrat 61% republican 39% 1 35% - 2 18% 3 12% - 4 12% 5 9% - 6 6%7 9% EI 24% IE 24% Neither 52% coding 50% non-coding 50% promoter 57% non-promote 43%
We ran three different versions of ILA algorithm on these datasets: ILA-2 with penalty factor set to 1 and 2, and the basic ILA algorithm. In addition, we ran three different decision tree algorithms and two rule induction algorithms: ID3 (Quinlan, 1986), C4.5 (Quinlan, 1994), OC1 (Murth et al. 1994), C4.5rules (Quinlan, 1994), and CN2 (Clark and Niblett). Algorithms other than ILA-2 were run using the default settings supplied with their systems. Estimated accuracies of generated classifiers on classifying test sets are given in Table 10 on which our comparative interpretation is based.
Table 10. Estimated accuracies of the algorithms Domain Name Lenses Monk1 Monk2 Monk3 mushroom parity5+5 Tic-tac-toe
ILA + PF = 1 62.5 100 59.7 100 98.2 50.0 84.1
ILA + PF = 5 50 100 66.7 87.7 100 51.1 98.1
ILA
ID3
50 100 78.5 88.2 100 51.2 98.1
62.5 81.0 69.9 91.7 100 50.8 80.9
18
C4.5pruned 62.5 75.7 65.0 97.2 100 50.0 82.2
OC1 37.5 91.2 96.3 94.2 99.9 52.4 85.6
C4.5rules 62.5 93.5 66.2 96.3 99.7 50.0 98.1
CN2 62.5 98.6 75.4 90.7 100 53 98.4
97.0 88.2 73.4 70.0 97.5
vote zoo splice coding promoter
96.3 91.2 86.9 70.7 97.5
94.8 91.2 67.9 68.7 100
94.1 97.1 89.0 65.7 100
97.0 85.3 90.4 63.2 95.0
96.3 73.5 91.2 65.9 87.5
95.6 85.3 92.7 64.0 97.5
95.6 82.4 84.5 100 100
ILA algorithms have higher accuracies than both C4.5 and OC1 algorithms in classifying six and nine out of twelve domains’ test sets, respectively. Compared to CN2, ILA-2 performs better than CN2 in six of the twelve domains’ test sets and they have similar accuracies in three domains. Table 11 shows size of the output classifiers generated by the same algorithms above for the same data sets. Results in the table proves that ILA-2 or ILA with the novel evaluation function is comparable to C4.5 algorithms in terms of generated classifiers size. When penalty factor set to 1, soft ILA produces smaller size classifiers than C4.5 algorithms for seven of the twelve domains. In regard to results in Table 11, it may be worth pointing out an interesting finding that C4.5 and ILA-2 solves over-fitting problem of ID3 and certain ILA, respectively. The sizes of classifiers generated by corresponding classification methods show the existence of these relationships clearly.
Table 11. Size of the Classifiers Generated by Various Algorithms Domain Name Lenses Monk1 Monk2 Monk3 mushroom parity5+5 Tic-tac-toe vote zoo splice coding promoter
Penalty Factor 1 9 14 9 5 9 2 15 18 11 26 64 7
Penalty Factor 5 13 37 115 48 13 67 88 35 16 128 256 18
Pure ILA
ID3
13
9 92 176 42 29 107 304 67 21 201 429 41
22 81 88 69 17 240 319 27
C4.5pruned 7 18 31 12 30 23 85 7 19 81 109 25
C4.5rules 8 23 35 25 11 17 66 8 14 88 68 12
7. Conclusion ILA-2 is a supervised rule induction algorithm for classifying symbolic data. In this paper, using a number of machine learning and real-world data sets, we show that the performance of ILA-2 is comparable with that of well-known algorithms, namely CN2, OC1, ID3 and C4.5.
19
The important contribution of our work becomes the evaluation metric utilized in ILA-2 for selecting description(s) of a rule. On other word, the user can reflect their preferences via penalty factor to tune up (or control) the performance of ILA-2 system with respect to the nature of the domain at hand, which provide valuable advantage over most of the inductive learning algorithms. Acknowledgments All of the data sets were obtained from the University of California-Irvine’s repository of machine learning databases and domain theories, managed by Patrick M. Murphy. We acknowledge Ross Quinlan and Peter Clark for the implementations of C4.5 and CN2 and Ronny Kohavi, for we have used MLC++ to execute OC1 CN2 and ID3 algorithms.
20
References Ali, K.M., and Pazzani, M.J. (1993). “HYDRA: A Noise-tolerant Relational Concept Learning Algortihm”, Proceedings of 13th International Joint Conference on Artificial Intelligence, (Ed. R. Bajcsy) Philadelphia, PA: Morgan Kaufmann, pp.1064-1070. Almuallim, H., and Dietterich, T.G. (1991). “Learning with Many Irrelevant Features”, Proceedings of AAAI-91, vol.2, pp.547-552. Almuallim, H., and Dietterich, T.G. (1994). “Learning Boolean Concepts in the Presence of Many Irrelevant Features”, Artificial Intelligence, 69, pp.279-305. Blumer, A., Ehrenfeucht, D., Haussler, D., and Warmuth M. (1987). “Learnability and the Vapnik-Chervonenkis Dimension”, J. ACM, 36(4), pp. 929-965. Breiman, L., Friedman, J.H., Olshen, R.A., and Stone, C.J. (1984). Classification and Regression Trees. Monterey, Calif.:Wadsworth and Brooks. Cameron-Jones, R.M., and Quinlan, J.R. (1994). “Efficient Top-down Induction of Logic Programs”, ACM Sigart Bulletin, 5(1), pp.33-42. Carter, C., and Catlett, J. (1987). “Assessing Credit Card Applications Using Machine Learning”, IEEE Expert, 2(3), pp.71-79. Catlett, J. (1991). “On Changing Continuous Attributes into Ordered Discrete Attributes”, Proceedings of EWSL-91, (Ed. Y. Kodratoff), Berlin, Germany: Springer-Verlag, pp.164178. Cestnik, B., Kononenko, I., & Bratko, I. (1987). “ASSISTANT 86:A KnowledgeElicitation Tool for Sophisticated Users”, in I. Bratko & N. Lavrac(eds.), Progress in Machine Learning, Wilmslow, UK: Sigma Press, pp.31-45. Ching, J.Y., Wong, A.K.C., and Chan, K.C.C. (1995). “Class-Dependent Discretization for Inductive Learning from Continuous and Mixed-Mode Data”, IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(7), pp.641-651. Choubey, K.S., Deogun, J.S., Raghavan V.V., and Sever, H. (1996). “A Comparison of Feature Selection Algorithms in the Contex of Rough Classifiers”, Proceedings of the fifth International Conference on Fuzzy Systems (FUZZ-IEEE’96), New Orleans, LA, USA, pp. 1122-1128. Clark, P. & Niblett, T.(1989).“The CN2 Induction Algorithm”. Machine Learning, 3, pp.261-283. Clark, P. & Boswell, R.(1991). “Rule Induction with CN2 : Some Recent Improvements”. Proceedings of EWSL-91, (Ed. Y. Kodratoff), Berlin, Germany: Springer-Verlag, pp.151163.
21
Deogun, J. S., Raghavan, V. V., Sarkar, A., and Sever, H. (1997). “Data Mining: Research Trends, Challenges, and Applications”, Proceedings of ACM CSC ’95, (Ed. T. Y. Lin), Kluwer Academic Publishers. Dougherty, J., Kohavi, R. and Sahami, M. (1995). “Supervised and Unsupervised Discretization of Continuous Features”, In: A. Prieditis and S. Russell, eds. Proc. of the Twelfth International Conference on Machine Learning, Morgan Kaufmann, San Francisco. Fayyad, U.M., and Irani, K.B. (1993). “Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning”, Proceedings of 13th International Joint Conference on Artificial Intelligence, (Ed. R. Bajcsy) Philadelphia, PA: Morgan Kaufmann pp.10221027. Fayyad U.M., et al. (1996). Advances in Knowledge Discovery and Data Mining, AAAI Press. Holsheimer, M., & Siebes, A. (1994). “Data Mining-The Search for Knowledge in Databases”, (Report No. CS-R9406). CWI, Amsterdam, The Netherlands. Hunt, E.B., Marin J., & Stone, P.J. (1966). Experiments in Induction. New York, London : Academic Press. Irani, Cheng, Fayyad, and Qian, (1993).”Applying Machine Learning to Semiconductor Manufacturing”, IEEE Expert, 8(1), pp.41-47. Kohavi, R. and Sommerfield, D., (1995) Feature Subset Selection Using the Wrapper Method: Overfitting and Dynamic Search Space Topology, First Int. Conference on Knowledge Discovery and Data Mining, KDD-95, 1995.
Langley P., (1996). Elements of Machine Learning, San Francisco: Morgan Kaufmann Publishers. McKee, T.E. (1995). “Predicting Bankruptcy via Induction”, Journal of Information Technology, 10, pp.26-36. Merz, C. J., & Murphy, P. M., (1997), UCI Repository of machine learning databases, http://www.ics.uci.edu/∼mlearn/MLRepository.html, Irvine, CA: University of California, Department of Information and Computer Science Michalski, R.S., & Larson, J.B. (1978). “Selection of most representative training examples and incremental generation of VL1 hypothesis: The underlying methodology and the descriptions of programs ESEL and AQ11 (Report No. 867)”. Urbana, Illinois: Department of Computer Science, University of Illinois.
22
Michalski, R.S. (1983). “A Theory and Methodology of Inductive Learning”. In R.S. Michalski, J.G. Carbonell & T.M. Mitchell, Machine Learning, an Artificial Intelligence Approach, Palo Alto, CA: Tioga. Michalski, R.S., Mozetic, I., Hong, J., & Lavrac, N. (1986). “The Multipurpose Incremental Learning System AQ15 and Its Testing Application to Three Medical Domains”, Proc. of the Fifth National Conference on Artificial Intelligence, Philadelphia, PA: Morgan Kaufmann, pp.1041-1045. Murthy, S.K., Kasif, S., & Salzberg, S. (1994). “A System for Induction of Oblique Decision Trees”, Journal of Artificial Intelligence Research, 2, pp.1-32. Pham, D.T. & Aksoy, M.S.(1995). “RULES: A Simple Rule Extraction System”, Expert Systems with Applications, 8(1), pp.59-65. Quinlan, J.R.(1983). “Learning Efficient Classification Procedures and their Application to Chess End Games”. In R.S. Michalski, J.G. Carbonell & T.M. Mitchell, Machine Learning, an Artificial Intelligence Approach, Palo Alto, CA: Tioga, pp.463-482. Quinlan, J.R.(1986). “Induction of Decision Trees”, Machine Learning, 1, pp.81-106. Quinlan, J.R.(1993). C4.5: Programs for Machine Learning. Philadelphia, PA: Morgan Kaufmann. Quinlan, J. R. (1994) The Minimum Description Length Principle and Categorical Theories, Proceedings of the Eleventh International Conference on Machine Learning, pp. 233-241. Quinlan, J. R. (1995) Cameron-Jones R. M., Oversearching and Layered Search in Empirical Learning, Proceedings of IJCAI-95. Rivest, R.L. (1987).“Learning Decision Lists”. Machine Learning, 2, pp.229-246. Salzberg, S., On Comparing Classifiers: A critique of current research and Methods. Technical Report JHU-95/06, Department of Computer Science, Johns Hopkins University, May 1995.
Schlimmer, J.C. & Fisher, D. (1986). “A Case Study of Incremental Concept Induction”. Proc. of the Fifth National Conference on Artificial Intelligence, Philadelphia, PA: Morgan Kaufmann, pp.496-501. Thornton, C.J. (1992). Techniques in Computational Learning-An Introduction, London: Chapman & Hall. Tolun, M.R., & Abo-Soud, S.M. (1997). “ILA-1: An Inductive Learning Algorithm for Rule Discovery”, (Technical Report-97-04). Department of Computer Engineering, Middle East Technical University, Ankara, Turkey.
23
Uludað, M. (1997) “Application of Rule Induction Algorithms to DNA Sequence Analysis”, M.S. Thesis, Middle East Technical University- Dept. Of Computer Engineering. Utgoff, P.E. (1994). “An Improved Algorithm for Incremental Induction of Decision Trees”, Machine Learning: Proc. of the Eleventh Interational Conference, pp.318-325. Uthurusamy, R., Fayyad, U.M. and Spangler, S. (1991). "Learning Useful Rules from Inconclusive Data", Knowledge Discovery in Databases, (Eds. G. Piatetsky-Shapiro and W.J. Frawley), AAAI/MIT Cambridge, MA, pp.141-157. Zadeh, L. A., (1994). “Soft Computing and Fuzzy Logic”, IEEE Software, pp 48-56. Zhang, J. (1990). “A Method that Combines Inductive Learning with Exemplar-Based Learning”, Proc. of the Second International Conference on Tools for Artificial Intelligence, San Jose, CA, pp.31-37.
24