Juan José del Coz, Oscar Luaces, José Ramón Quevedo,. Jaime Alonso, José Ranilla and Antonio Bahamonde. Centro de Inteligencia Artificial. Universidad de ...
Self-Organizing Cases to Find Paradigms Juan José del Coz, Oscar Luaces, José Ramón Quevedo, Jaime Alonso, José Ranilla and Antonio Bahamonde Centro de Inteligencia Artificial. Universidad de Oviedo at Gijón (www.aic.uniovi.es) Campus de Viesques. E-33271 Gijón, España {juanjo, oluaces, quevedo, jalonso, ranilla, antonio}@aic.uniovi.es
Abstract. Case-based information systems can be seen as lazy machine learning algorithms; they select a number of training instances and then classify unseen cases as the most similar stored instance. One of the main disadvantages of these systems is the high number of patterns retained. In this paper, a new method for extracting just a small set of paradigms from a set of training examples is presented. Additionally, we provide the set of attributes describing the representative examples that are relevant for classification purposes. Our algorithm computes the Kohonen self-organizing maps attached to the training set to then compute the coverage of each map node. Finally, a heuristic procedure selects both the paradigms and the dimensions (or attributes) to be considered when measuring similarity in future classification tasks.
1
Introduction
Let us consider an information system able to classify any kind of event based on similar situations observed in the past; usually this is called a case-based reasoning (CBR) system. The competence of such a system relies on the quality of the solutions provided; so the author should select a representative set of paradigms whose behavior he or she would like to clone for future classifications. Additionally, a wise procedure to compute similarities should also be included. However, the actual performance of these systems also depends on the efficiency of the retrieval process needed to select an item of knowledge to be used in a particular problem-solving task. Moreover, there is a kind of qualitative reward for small sets of representative cases: the system can provide reasonable explanations. In fact, the whole information system can be understood by a human being whenever the amount of explicit knowledge available is manageable. On the other hand, the classifications provided by the system for unseen cases can be endowed with a reference to the most similar paradigm stored that must necessarily be representative of a densely populated region of cases of the same class. If we had a vast set of stored cases, we would be recording a lot of individual peculiarities, and therefore explanations would be confusing and misleading; the excess of details produce noise. In this paper we present an effective way of building a competent, efficient and explainable case-based information system. The output is not only a reasonably and representative small subset of the initial examples, but also an explicit function for computing similarities with future problems.
We shall assume that the initial examples, as well as unseen cases, will be described by a set of attributes or features whose values can be either numeric or symbolic (nominal); additionally the training data will be endowed with a class that must be a symbolic label. The output will be a subset of these data specifying, in each case, which attributes are relevant to be taken as models for future classifications; in other words, we conclude with a set of rules whose conditions are part of some well chosen training examples. The evaluation procedure of these rules will follow a minimum distance criterion, as in a nearest neighbor (NN) learning algorithm [5], [1], [13] [14], [15]. The solutions available in the literature range from the NN algorithm, which stores all the training set, to Aha’s IBx family [1], where a serious effort is made to reduce the number of stored instances. However, the number of retained cases is of the same order as the initial training set. Additionally, some hazardous methods have also been proposed; see for instance [16]. Our method is conceptually based on Smyth’s analysis in [17]. He suggests a model for filtering a set of cases taking into account both the competence and the efficiency of the remaining cases. The model relies on the definition of the coverage 3 of an instance, which is essentially an operation of O(n ), where n is the number of training instances. Hence, as the typical size of real-world problems makes a straightforward implementation intractable, we follow a heuristic method. The first step is to arrange the training data in a Kohonen [9] self-organizing map; then we compute the coverage of labeled nodes of the map. In order to weight the promising classification quality of the original examples gathered in a node, we use a function called impurity level. This is a measure inspired by Aha’s IB3 [1] and was developed and very successfully used in FAN [13] [14], a machine learning system that produces classification rules. In this context, an initial selection takes place: we separate the nodes with best impurity levels in their coverage until a previously fixed proportion of the original data is involved. Now we try to discover the attributes that are really important so as to define the similarities of the examples or cases with our first draft of paradigms. The approach followed is not the standard one [19]: we think that some attributes may be irrelevant in certain regions of the decision space, but may be very important elsewhere. Thus we aim to compute a kind of relevance degree of each attribute in the surroundings of each paradigm candidate. The mechanism devised for this purpose is a Kohonen-like self-adjusting algorithm inspired by the philosophy of the most popular attribute filters [19]. Since the distances can now be calculated taking into account the carefully computed relevance degrees, we can proceed to decide (in a linear algorithm) which is the best set of examples: the final paradigms. To finish the job, we need only determine the set of relevant attributes, which is done following another linear procedure very similar to the one used to select the paradigms. Given that our algorithm can be seen as a machine learning program, a number of experimental results are presented in order to show its actual performance when dealing with a set of well-known benchmarks taken from the UCI repository [3]. A Section with some concluding remarks close the paper.
2
The Coverage of a Training Example
Let us consider a set T of training examples. Following Smyth [17], the coverage of an example e ∈ T can be defined as the set of target problems that e can solve. Intuitively, the larger the coverage of an example, the more representative it is. The first difficulty to be faced when computing these sets for each e ∈ T is obvious, since the target problems are not available. However, we can assume the representativeness of our training material, and then use T as a faithful sample of the target problems; otherwise, we could not even begin to try to learn anything from T. A rational and computable definition for coverage is the following: an example e’ ∈ T is within the coverage of e if it is nearer to e than to any other example of T, excluding all the examples of the same class as e, and excluding e’ itself, of course. In symbols, coverage(e) := {e’ ∈ T: d(e,e’) < minimum {d(e’,e” ): e” ∈ to_consider(e,e’)} to_consider(e,e’) := T – ({e’} ∪ {x: class(x) = class(e)})}
(1)
To compute the distance of examples, we use d(e, e') :=
∑ (| att(e) − att(e') | : att ∈ Set of attributes) 2
(2)
where att(e) stands for the value of example e for attribute att. 4 3 2 1 0 0
1
2
3
4
5
6
7
8
9
10
Fig. 1. Graphical interpretation of coverage. In a training set with 2 classes (squares and triangles), the coverage of solid black examples are the shadowed areas
The problem when computing the coverage of an example is its complexity: it is a cubic operation on the size of T. So, in order to reduce the problem we use the Kohonen self-organizing map [9] of the original set T. In this way, each node of the map gathers a cluster of examples, and inherits the class of the majority. Therefore, instead of computing the coverage of T examples, we compute the coverage of each labeled node of the map between the nodes of the map. Hence, the complexity becomes acceptable. Since our algorithm should work autonomously, we must establish the dimensions of the map to be built. The experiments carried out showed that square maps give the best results whenever the total amount of nodes is about one fifth of the size of the original training set. Therefore, although the user can modify the behavior, the algorithm sets map dimensions, as a result of our experience we establish limits of 3 x 3 and 20 x 20 to avoid ill or enormous situations.
3
The Impurity Level of a Set of Decision Patterns
We now want to separate a subset of nodes whose coverages include all (or almost all) labeled nodes of the map. But there are many ways to do this; for instance, we can cover the map with very small pieces, or allow a lot of misclassifications. What is
more, our real interest is in the original training data; map nodes are just auxiliary devices. Again we are facing a complex problem whose solution will be heuristically devised by means of the impurity level. Attached to each map node there is a set of original examples. So, the coverage of map nodes can be thought of as a group of training cases arranged like a decision pattern for future problems, and what we are interested in is computing a measure of the classificatory quality of their performance on this task. Since we are assuming a nearest neighbor principle, we shall prefer decision sets with a high density of training cases of the same class inside, because they will probably behave well when classifying cases in the surroundings. Thus, what we must really compute is the degree of similarity of the observed decisions taken in the given training cases. Usually most of them will have the same class attached, so our measure should be based on the confidence interval of the proportion p of examples of the majority class as in Aha’s IB3 [1]. The formula used here is [18] p+
z2 p(1 − p) z2 ±z + 2n n 4n2 2 z 1+ n
(3)
where z can be found in a normal distribution table according to the confidence level (by default 95%), and n is the number of original examples in the coverage considered. In this context, we are interested in coverages whose confidence intervals are narrow and near to 1.0; in other words, with many examples of the same class. But, in order to appreciate the difficulty of collecting such a decision set, we can compare the confidence interval of a coverage set [left(cs), right(cs)] with the same interval obtained for the random rule of the same class: [left(class), right(class)]. That is, the rule that for any case always concludes that class as its classification decision. Finally, the impurity level [13] [14] of a coverage set of a node labeled by class is given by: impurity − level(cs) := 100 ∗
4
right(class) − left(cs) right(cs) − left(cs)
(4)
An Initial Selection of Paradigms
Once we have computed the coverage sets of all map nodes and their impurity level according to the uniformity and abundance of classes of the original training examples represented, we proceed to make an initial selection of paradigms as follows. We order map nodes according to their impurity levels (the lower, the better) and pick the best one. All map nodes included in its coverage are now excluded for selection since we consider that they are already represented. The process continues with the remaining nodes until the coverage of chosen nodes covers a given proportion of the original set; by default, 90%. However, we are looking for representative original nodes. So, each selected map node is replaced by the nearest original example of the same class.
5
Attribute Relevance
To compute the similarities of future cases, and to explain the classification decisions, it is of core importance to measure the relevance of the attributes that describe our examples. There are many proposals in the literature concerning the solution of this problem: [1], [15], [19], [16]. The aim is always to obtain a list of attributes whose values are irrelevant throughout the space in order to determine the class of present and future examples. The approach of the present paper tries to compute attribute relevance for each paradigmatic example: just as when classification rules are induced, the attributes present in the conditions are not always the same. The algorithm that we use to compute the relevance of the attributes in the surroundings of our paradigm candidate is a Kohonen-like self-adjusting device inspired by the philosophy of the most popular attribute filters. To spell out the details, we first need to change the formula for computing distances between examples: each summand will be modified by the corresponding weight. Given that these weights depend on the attribute and the paradigm, our distances are no longer a metric in the examples space; instead, we have just a family of auxiliary functions which are useful for computing the most similar paradigm to a given example from among a set of candidate paradigms. d(e,paradigm) :=
∑ ( weight(att,paradigm)∗ | att(e) − att(e') | : att ∈ Set of attributes) 2
(5) The weights are initialized with a neutral value: 0.5 in all cases. The self-adjusting algorithm tries to favor (bring near to 1.0) the relevance (weight) of attributes in paradigms 1 • whose covered examples have small 0,8 differences along these attributes whenever 0,6 the classification is correct, or 0,4 • there are big differences and incorrect 0,2 0 classifications. -0,2 In all other cases, we move the weight away -0,4 from 1.0. To determine the proximity we shall use -0,6 a sigmoid -0,8 σ(x) :=
2
1+ e
( 50 x − 5)
−1
(6)
whose intended semantics (see Figure 2) is that we obtain σ(difference) • near to 1.0 when the difference is small, • near to –1.0 when the difference is big, • near to 0 for medium differences.
-1 0
0,2
0,4
0,6
0,8
1
Fig. 2. Sigmoid used to quantify the proximity
To be precise, the pseudocode of our algorithm to compute the relevance of attributes: { /* Initialize considering trivial weights */ for each representative example re and for each attribute att do weight(att,re) := 0.5; t:= 0; /* the number of examples processed */ for each example ex to be processed do { t := t +1; find the nearest representative example: nre; for each attribute att do { /* update weights */ att_dif := |att(ex) – att(nre)|; If class(ex) = class(nre) then weight(att,nre) := max{0, weight(att,nre)+ h(t,att_dif)*(1-weight(att,nre))} else weight(att,nre) := max{0, weight(att,nre)h(t, att_dif)*(1-weight(att,nre))}; }; }; } The examples to be processed are the original training examples previously randomly sorted. We repeat the whole set as many times as needed to make the number of examples seen by the algorithm greater than upper_limit_examples, by default 2,000. The function h is called (following [9]) the neighborhood function, and is given by h(t,d) := α(t) * σ(d);
(7) α(t) := 0.05 * (1 - (t / #_examples_treated)) Thus, the intention of h is to take into account the influence of examples over attribute weights. So, at the beginning of the process, h outputs higher values according to a learning-rate factor quantified by α; the parameter #_examples_treated stands for the number of examples that are effectively going to be considered.
6
Final Selection of Paradigms and their Relevant Attributes
Although any of these decisions depend heavily on the others, we are first going to select a set of paradigmatic examples, and then their relevant attributes. In fact, the previous self-adjusting mechanism for computing attribute weights proceeds in a massively parallel way that now allows this sequential procedure. Thus, we consider the list of paradigm candidates ordered according to their impurity level as described in Section 4. The idea is to avoid paradigms whose presence does not improve the classification selection when tested on the training data itself. Hence, we explore the possibility of skipping paradigms from the worst (according to its attached impurity level) to the best, and in a return run from the best
to the worst. Here we use Formula (5). with the weights just computed in the previous Section. 4 3 2 1 0 0
1
2
3
4
5
6
7
8
9
10
Fig. 3. The paradigms chosen by our algorithm in the artificial problem presented in Figure 1 are marked in solid black. Here we have adjoining areas with a clear presence of a given class of individuals; hence, our paradigms are strategically placed on the frontiers, trying to somehow mark out their properties.
To return the relevant attributes of the paradigms thus obtained we need only fix a relevancy threshold so as to discriminate which attributes to consider for future classifications. Our method here is quite simple, we examine the classification accuracy in the original training set of our paradigms for each possible relevancy cut (at most, the number of attributes times the number of paradigms). Notice that once a cut has been proposed, we reduce attribute weights so that they become 0 or 1. Hence, some paradigms may be dropped in this phase: when all their attributes have a null weight. In the case of a class not having any paradigm, we include the best exemplar (according to its impurity level) with the most relevant attribute in our final result. 3 2,5 2 1,5 1
0,5 0 0
1
2
3
4
5
6
7
Fig. 4. In the well-known iris problem [6], we choose only 3 flowers, one for each class. Here we have drawn only the most representative attributes, petal length and width. It can be seen how our system deals with noise in an acknowledged benchmark for the entire machine learning community.
7
Experimental Results
We report some experimental results in this Section to illustrate the performance of our system. The name our system in the following tables of scores is BETS, an acronym of Best Examples in Training Set. The benchmark problems used are only a subset of Holte’s problems [7]. For this reason, we should emphasize that these experiments cannot at all be considered definitive. Our intention is to show the main features of the solutions provided by our system. We thus selected a collection of learning problems presenting a wide variety of situations, which are summarized in Table 1. Since the natural context of BETS are numeric values, inherited from Kohonen’s SOM, for each attribute with n different symbolic values we created, in a straightforward way, n binary attributes. Table 1. Summary of the most relevant features of the benchmark problems used in our experiments taken from the UCI repository [3]. We used the names of the problems that appear in [7]. For each problem, we describe the size of its training set, the number of attributes with numeric and symbolic values and the number of classes available. Problem Training Numeric Symbolic Classes 286 0 9 2 BC 303 5 8 2 HD 155 6 13 2 HE 386 7 15 2 HO 150 4 0 3 IR 141 2 16 4 LY 47 0 35 4 SO 435 0 16 2 V1 435 0 15 2 VO
The average accuracy scores of BETS and a set of well-known and established machine learning algorithms [12], [1], [5], [4], [11] are presented in Table 2. It is important to point out that all the systems, including BETS, were executed with their proclaimed default values in their performance parameters. However, accuracy is not the sole comparison dimension of machine learning algorithms. Depending of the kind of application and taking for granted an acceptable proportion of successful classifications, some other aspects might be considered in order to adopt a learned solution, for instance: • The explanations: If some explanations are going to be provided with the classification decisions, then the number of possibilities (size of the learned solution) and the naturalness of the conditions (nearness of learned material to human perceived true reasons) become very important • Knowledge drift: When some evolution of knowledge needed to classify a phenomenon is assumable, then to have a set of rules deeply founded on densely populated areas of evidence is very useful for the tasks of maintenance
Table 2. Errors in a cross validation experiment of 10 folders repeated 20 times. The last row gives the average errors of the algorithm across all data sets. For C4.5, we used release 8. The algorithms named Bayes and Disc. Bayes are the Bayesian classifiers in the MLC++ [8] implementation. All systems received the same train-test pairs of sets, since all the experiments were carried out with the assistance of MLC++ [8] using the seed 2032 (the in-house telephone number of our laboratory) for its random generation subsystem.
BC HD HE HO IR LY SO V1 VO Average
BETS
C4.5-R
C4.5
CN2
OC1
Bayes
Disc. Bayes
NN
IB3
28.25 21.07 17.67 18.03 6 24.09 0.1 10.65 6.08 14.66
29.29 20.75 20.98 17.44 4.67 23.42 2.85 9.9 4.31 14.85
26.17 23.65 21.24 15.75 5.37 23.29 2.75 9.9 3.51 14.63
28.86 21.67 20.02 17.64 6.41 19.33 2.17 10.52 4.82 14.6
33.66 20.77 20.56 21.63 3.87 21.73 8.06 10.99 5.19 16.27
28.01 17.07 15.56 21.11 4.8 20.33 10.6 12.71 9.71 15.54
28.01 16.77 15.06 20.5 7.67 18.72 10.6 12.71 9.71 15.53
29.35 24.06 19.02 20.96 4.64 19.23 0 11.06 7.74 15.12
28.07 17.7 19.59 19.12 4.31 20.93 38.35 11.26 8.33 18.63
• The cost of decisions: If the decisions supplied by a learned solution entail actions with an attached cost, then the errors will transcend a simple quantitative point of view. Training samples are not always as faithful as one may hope and unseen examples near dense training areas may be much more frequent than those coming from sparse training regions. Thus, we would like to emphasize the results presented in Table 3. These suggest that the solutions learned by BETS have a very high quality in accordance with the points mentioned in the previous paragraph. Table 3. Average number of paradigms or rules. We excluded instance-based systems, and algorithms producing decision trees. BC HD HE HO IR LY SO V1 VO Average
8
BETS
C4.5R
CN2
9.981 11.215 4.835 6.005 3.9 10.27 4.01 8.525 5.25 7.11
9.815 14.185 9.29 7.04 5.02 11.355 5 11.97 7.02 8.966
57.82 23.825 18.855 51.175 7.71 17.975 5 35.69 19.465 26.391
Concluding Remarks
An algorithm for discovering the most representative examples (paradigms) in a training set has been presented: BETS. The key ideas underlying it are the use of Kohonen’s SOM [9], the concept of coverage [17], and a classification quality heuristic called impurity level [13] [14]. The main accomplishments of our approach are the small number of paradigms needed to achieve a reasonable level of accuracy,
and the philosophy of our solutions that stresses the rewards of providing sound, natural, useful explanations in addition to classification decisions.
References 1. Aha, D. W.: A Study of Instance-based Algorithms for Supervised Learning Tasks: Mathematical, Empirical and Psychological Evaluations. Ph. D. Dissertation, University of California at Irvine (1990). 2. Bahamonde, A., De La Cal, E. A., Ranilla, J., & Alonso, J.: Self-organizing symbolic learned rules. In: Mira, J., Moreno-Díaz, R., Cabestany (eds.): Biological and Artificial Computation: From Neuroscience to Technology. Lecture Notes in Computer Science, Vol. 1240. Springer-Verlag, Berlin Heidelberg New York (1997) 536-545 3. Blake, C., Keogh, E. & Merz, C.J.: UCI Repository of machine learning databases [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, Department of Information and Computer Science (1998) 4. Clark, P., & Niblett, T: The CN2 Induction Algorithm. Machine Learning, 3 (1988) 261-284 5. Cover, T. M., & Hart, P. E.: Nearest Neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), (1967) 21-27 6. Fisher, R.: The Use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics, 7, 1, (1936) 179-188 7. Holte, R.C.: Very simple classification rules perform well on most commonly used datasets. Machine Learning, 11 (1993) 63-91 8. Kohavi, R., John, G., Long, R., Manley, D., & Pfleger, K.: MLC++: A machine learning library in C++. In: Proceedings of the Sixth International Conference on Tools with Artificial Intelligence. IEEE Computer Society Press (1994) 740-743 9. Kohonen, T.: Self-Organizing Maps. Springer-Verlag, Berlin Heidelberg New York (1995) 10. Luaces, O., Alonso, J., De La Cal, E. A., Ranilla, J., & Bahamonde, A.: Machine Learning usefulness relies on accuracy and self-maintenance. In: Pobil, A.P., Mira, J., Ali, M. (eds.): Tasks and Methods in Applied Artificial Intelligence. Lecture Notes in Artificial Intelligence, Vol. 1416. Springer-Verlag, Berlin Heidelberg New York (1998) 448-457 11. Murthy, S. K., Kasif, S., & Salzberg, S.: A system for induction of oblique decision trees. Journal of Artificial Intelligence Research, 2 (1994) 1-32 12. Quinlan, J. R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA (1993) 13. Ranilla, J., Mones, R., & Bahamonde, A.: El Nivel de Impureza de una regla de clasificación aprendida a partir de ejemplos. Rev. Iberoamer.. de Inteligencia Artificial, 4 (1998) 4-11 14. Ranilla, J., & Bahamonde, A.: FAN: Finding Accurate iNductions. Technical Report, Artificial Intelligence Center, University of Oviedo at Gijón. November (1998) [ftp://ftp.aic.uniovi.es/publications/Machine_Learning/FANprn.ZIP] 15. Salzberg, S.: Learning with nested generalized exemplars. Kluwer, Boston, MA (1990) 16. Skalak, D. B.: Prototype and Feature Selection by Sampling and Random Mutation Hill Climbing Algorithms. In: Cohen, W., Hirsch, H. (eds.): Machine Learning. Proceedings of the Eleventh International Conference. Rutgers University, New Brunswick, New Jersey (1994) 293-301 17. Smyth, B.: Case-Based Maintenance. In: Pobil, A.P., Mira, J., Ali, M. (eds.): Tasks and Methods in Applied Artificial Intelligence. Lecture Notes in Artificial Intelligence, Vol. 1416. Springer-Verlag, Berlin Heidelberg New York (1998) 507-516 18. Spiegel, M. R. (1970): Estadística. McGraw-Hill, Atlacomulco, México (1970) 19. Wettschereck, D., Aha, D. W., & Mohri, T.: A review and empirical evaluation of feature weighting methods for a class of lazy learning algorithms. Artificial Intelligence Review, 11 (1997) 273-314