Conceptual Clustering in a First Order Logic Representation Gilles Bisson Laboratoire de Recherche en Informatique, URA 410 du CNRS Equipe Inférence et Apprentissage Université Paris-sud, Bâtiment 490, 91405 Orsay Cedex France email :
[email protected], phone : (33) 69-41-63-00 A b s t r a c t*: We present the Conceptual Clustering system KBG. The knowledge representation language used, both for input and output, is based on first order logic with some extensions to handle quantitative and procedural knowledge. From a set of observations and a domain theory, KBG structures this information into a directed graph of concepts. This graph is generated by an iterative use of clustering and generalization operators, both guided by similarity measures. K e y w o r d s : Conceptual Clustering, Generalization, Similarity Measure, First Order Representation.
1 Introduction Currently, most conceptual clustering systems such as CLUSTER/2 (Michalski, Stepp 83), COBWEB (Fisher 87), CLASSIT (Gennari et al. 89), ADECLU (Decaestecker 91) work in the Valued Propositional Logic framework. Other systems such as LABYRINTH (Thompson, Langley 89) can deal with structured objects. However, none of these representations is powerful enough to express the learning set in a lot of domains (Kodratoff et al. 91). Typically, this problem occurs when the examples simultaneously describe several objects whose relational structures change from one example to the other (Bisson 92b). First Order Logic (FOL) representations allow one to overcome these problems. Nevertheless, two reasons limit the use of FOL : first, its use generally involves a large increase i n computation complexity ; second, in its classical definition, FOL does not deal efficiently with numbers. The system KBG presented in this paper is an attempt to deal with a FOL representation language i n the framework of conceptual clustering. Although our aims are similar to the ones defined in the context of Concept Formation (Gennari et al. 89), some differences exist. First, in contrast to the Concept Formation approach, the learning process is not incremental and uses a bottom up strategy based on an iterative use of * This work is partially supported by CEC through the
ESPRIT-2 contract Machine Learning Toolbox (2154) and also by french MRT through PRC-IA.
generalization and clustering operators. In that sense, our approach is close to Data Analysis, in which bottom up clustering is the most often used mechanism (Diday et al. 1985). Second, the observations are not organized i n a strict hierarchical structure (tree) but in a Directed Acyclic Graph (DAG). Finally, according to the attributes to predict, the system transforms this structure of concepts in a hierarchical system of rules directly usable by an inference engine (Bisson 92a). This method allows to increase both the readability of the learned knowledge and its efficiency (Ganascia 88). This paper is organized as follows. In section 2 we define the representation language used in KBG to express the learning set and the domain theory. The general structure of KBG is described in the section 3 and the algorithm used to build the graph of concepts in the sections 4 & 5. After providing an example of running, we compare our approach with some related work.
2 Knowledge Representation Language The representation language used in KBG is based on a subset of FOL without negation or functional symbols. However, as in APC (Michalski 83) and "Hordes" (Diday 89) two kinds of extensions have been implemented. The first one to take into account the quantitative information such as numerical values. The second one t o create new data types and to perform procedural calls. The atom (or positive literal) is the ground structure of our representation. An atom is built with a predicate of any arity, whose arguments are typed. We distinguish two kind of arguments : the entities (or objects) and the values. The entities are symbols whose meaning only depends on their occurrences inside the predicates. The values have their own semantic and belong to a data type. We associate with our language the following semantic. The predicates qualify the properties of an entity, or the relations between several entities; the values quantify these properties and relations. The initial knowledge is divided into two parts : the example set and the domain theory. The examples are composed of a conjunction of instantiated terms.
Type definitions
Training examples Domain theory
New module Hierarchical system of rules & New-concepts
Inference Engine
Saturated Examples
Similarities computation
New examples
Explanation module
Rules generator
Graph of generalization
Generalisation module
Similarity tables
Clustering module
Clusters
Figure 1 : General architecture of KBG The domain theory is divided into three parts. In the first part, the user declares the types of values used to describe the domain. These declarations are important since during the clustering and generalization phases, the behavior of the learning processes depends on the semantic of the values. The classical types of data such as Real, Ordered, Taxonomy are currently implemented. However, internally, KBG does not directly call the functions related to these types, but uses a set of 8 generic functions describing the syntax and the semantic of the types. Therefore, according to the application domain, the user can define new data types without doing any change in the body of the system. When defining a new type T, two important functions (or methods) are required. The first one, used by the clustering process, i s the function V-SIMT : T x T ∅ [0..1] allowing t o evaluate the degree of "similarity" between any pair of values of the type T. The second one, used by the generalization process, is the function V-GENT : T x T ∅ T allowing to generalize any two values belonging to T and which returns a new value (belonging to T) which subsumes both initial values. The second part of the domain theory is composed of the predicate declarations. To each predicate can be associated a weight expressed by an integer. This number allows to express a relevance scale (defining a total order on the predicates) between the predicates. The third part of the domain theory is composed of classical rules "If ... Then ..." organized in the form of modules. A module of rules can be either directly provided by the expert or has been learned by KBG in the course of a previous learning session. The premisses of a rule can explicitly call some procedures to process some computations (addition, square-root, etc) or to verify some constraints between the variables (less, different, etc). As for the types, the definitions of these procedures are external to the system, thereby, the user can define his own set of functions adapted to his needs.
3 The system KBG The system architecture (figure 1) is composed of six modules. The most important one is the similarity computation module (section 4) providing the information used both to generalize and to cluster. The Explanation and Rule modules are not presented here. From a given example set and a domain theory, the aim of KBG is to build a hierarchical system of rules and to learn new concepts. Then, the learned knowledge (constituting a new module) can be added to the current domain theory and used in the course of subsequent learning sessions performed with some new example sets. Thus, from an ordered sequence of learning sets illustrating several parts of an application domain, KBG iteratively learns an ordered set of modules, each of these modules taking advantage of the knowledge previously learned. The learning mechanism is divided into three successive steps: the saturation step, the learning step and the rule generation step. The learning step is bottom-up and based on an iterative use of the operators of clustering and generalization (section 5).
4 Similarity computation In FOL both processes of generalization and clustering involve similar problems. As shown in (Bisson 92b) these problems can be solved in an homogeneous way by evaluating, for each pair of examples of the learning set, a similarity between the entities occurring in these examples. However, the current similarity measures described in the literature do not allow to deal completely with our representation language. We need at the same time a measure able to deal efficiently with numerical values and above all to take into account the overall relations structure expressed within the example. We have broadly detailed and justified such a similarity measure in (Bisson 92b) that we will just recall here.
● Let E1 and E2 be any 2 examples of the learning set. ● Let X and Y be two entities whose similarity must be
computed, such as X belongs to E1 and Y to E2. ● Let Occ(X) and Occ(Y) be respectively the list of the
occurrences of X and Y in the examples E1 and E2. For each of the common occurrence Tk and T'k between X and Y (that is X and Y appear in the same predicate at the same place R) we compute the value T-SIM (Tk,T'k): ● Let the atom Tk and T'k:
Tk = P (A1, ..., Ai, ... , An, V1, ... , Vj, ... , Vm) T'k = P (B1, ..., Bi, ... , Bn, U1, ... , Uj, ... , Um) With: P expresses the predicate of the occurrence Ai and Bi express the entities of the atoms (i≥1) Vj and Uj express the typed values (j≥0) 1 T-SIM(Tk,T'k)=( n
n
m
i=1
j=1
∑ dist(Ai,Bi) *m1 ∑ dist(Vj,Uj) )*Wght(P)
with: dist (Ai, Bi) = 1 when i=R (X and Y) dist (Ai, Bi) = SIM (Ai, Bi) when i≠R dist (Uj, Vj) = V-SIMT (Uj, Vj) (see section 2) The similarity (SIM) between X and Y corresponds to the sum of all the T-SIM's evaluated for each of the common occurrences (C-O) divided by the total weights (Σ-Wght) of the predicates brought into play by these entities. C-O
∑ T-SIM (Ti, T'i) SIM(X,Y) =
i=0
MAX(Σ-Wght(Occ(X)), Σ-Wght(Occ(Y))) As we can easily notice, to compute the similarity of one pair of entities (X, Y), one needs the similarity of the entities appearing in the same occurrences. From this standpoint, this similarity computation comes down t o the problem of solving a system of equations in several unknowns. We are going to illustrate this method on the examples E1 and E2 (see table 1). The examples E1 and E2 contain four entities PAUL, YVES, JOHN and ANN. The values "male" and "female" belong to a type SET-OF and the numbers to a type INTEGER. For the values appearing in AGE and SEX, we define the function V-SIMT in the following way : V-SIMage (V1,V2) = [ 99 - | V1 - V2 | ] / 99 V-SIMsex (V1,V2) = If V1=V2 Then 1 Else 0 Here are the values of T-SIM found for the all the entities among the predicates FATHER, AGE and SEX.
Predicate Pair of entities FATHER (paul, john) (yves, ann) AGE (paul, john) (yves, ann) (paul, ann) (yves, john) SEX (yves, ann)
Value of T-SIM 1/2 (1 + SIM (yves, ann)) 1/2 (SIM (paul, john) + 1) V-SIMage (33,58) = 0,75 V-SIMage (13,28) = 0,84 V-SIMage (33,28) = 0,95 V-SIMage (13,58) = 0,55 V-SIMsex (male,female) = 0
By defining the variables A=SIM (paul, john), B=SIM (yves, ann), C=SIM (paul, ann), D=SIM (yves, john), the similarity computation is equivalent to the system: A = [1/2(1 + B) + 0.75] / 3 B = [1/2(A + 1) + 0.84 + 0] / 3 C = [0.95] / 3 D = [0.54] / 3
⇒ ⇒ ⇒ ⇒
A = 50.5% B = 53.4% C = 31.7% D = 18.2%
5 Learning algorithm In KBG, the domain theory is taken into account with a saturation step. The saturation consists in using the rules of the domain theory as production rules (in forward chaining) to add to the initial examples all of the information which can be deduced at the beginning of the learning. This method is both simple and welladapted to our similarity-based approach (Bisson 90). The learning step is performed in accordance with the following algorithm. It turns the set of saturated examples into a generalization graph in which each node (or concept) corresponds to a cluster of examples and is characterized by a generalization formula. WHILE Cardinal of the Learning Set > 1 1) We compute the similarity between the entities of each pair of examples of the learning set. Then, we build a similarity matrix between these examples b y computing for each pair of examples the mean of the similarities between entities for the "best" matching. For instance, in the previous examples E1 and E2, the similarities between entities show that the best matching are YVES with ANN (53.4%) and PAUL with JOHN (50.5%). Thereby, the similarity between the examples E1 and E2 will be equal to 52%. 2) Provided with this similarity matrix, we can use a data analysis algorithm to cluster the current examples. Currently, we use a "threshold algorithm" (Diday et al. 85) which allows the clusters to overlap. Consequently, the generated structure of concepts is a DAG. 3) We generalize the clusters containing at least two examples. To generalize the examples belonging to any cluster, we use a method very close to the Structural Matching algorithm (Kodratoff, Ganascia 86). First,
E1 = father (paul, yves), E2 = father (john, ann), E3 = father (marc, michel), E4 = mother (elena, igor), E5 = father (fred, jane),
sex (yves, male), sex (ann, female), sex (michel, male), brother (igor, ivan), sex (jane, female), Table
age (yves, 13), age (ann, 28), age (marc, 35), sex (igor, male), age (jane, 12), 1 : Examples
with the help of the computed similarities, we choose the entities to match. Second, the generalization is composed of the predicates which are simultaneously present in all examples and whose occurrences are compatible with the matching choices previously done. The values are generalized with the functions V-GENT. At the end of this step, the generalized examples are removed from the learning set and replaced by new examples built by exemplifying the generalizations.
6 Example of learning In order to illustrate this learning process, we are going to show, step by step, how this algorithm runs on the examples described in the previous set (table 1). Iteration 1 : We compute the similarity between the entities, then between the examples. The similarity matrix found by KBG is provided in the following table: E1 E2 E3 E4 E5
E1 52% 80% 28% 57%
E2
E3
E4
E5
35% 14% 94%
28% 38%
12%
-
In order to simplify, we use here a clustering method which simply consists in gathering together the two examples having the higher similarity in the table and keeping the other examples alone. This algorithm leads to build a binary clustering tree. During the first iteration, we create the cluster containing the examples (E2, E5). The similarity computation between the entities of E2 and E5 provided the following results: SIM (john, fred)= 95%; SIM (john, jane) = 18% SIM (ann, fred)= 27%; SIM (ann, jane) = 94% Thereby, we match the entities JOHN with FRED and ANN with JANE which have the higher similarities. This matching leads to the generalization: G-1=Gen (E2, E5). The function V-GENage used to generalize the numbers appearing in the predicate AGE is the classical "Enlarging Interval Rule" (Michalski 83). The predicate NOT-EQUAL is automatically added by KBG. It means
age (paul, 33), age (john, 58), french (marc) age (elena, 30), age (fred, 46),
french (paul) american (john) russian (elena) american (fred)
that the variable ?X and ?Y are never instantiated by the same value. More generally, according to the type of the arguments treated, KBG adds automatically some constraints between the variables in the generalization formula. It is the user who decides for each data type the constraints to test during the generalization. G-1 = father(?X,?Y), age(?Y,[12,28]), sex(?Y,female), american(?X), age(?X,[46,58]), not-equal(?X,?Y) The generalized examples E2 and E5 are then removed from the learning set. After, having exemplified the generalization formula G-1, we obtain a new example NE1 which is added to the current learning set: NE-1: father (x,y), american (x), age (x,[46,58]), age (y,[12,28]), sex (y,female) Iteration 2 : We update the previous similarity table by computing the similarities between NE-1 and the other examples. It is worth noting that the similarity values between NE1 and the other examples are coherent with the similarities previously computed. For instance, we had SIM (E2,E3) = 35% and SIM (E5,E3) = 38% and the similarity between NE1 and E3 is equal to 37%. NE-1
E1 55%
E3 37%
E4 13%
NE-1 -
The best similarity is now between E1 and E3. After generalization (G-2), we obtain the new example NE-2: NE-2: father (x,y), age (x,[33,35]), french (x), sex (y,male) Iteration 3 & 4 : They build the new examples NE-3 from Gen (NE-1, NE-3) and NE-4 from Gen (NE-3, E4): NE-3: father(x,y), age(x,[33,58]), sex(y,(male,female)) NE-4: age (x,[30,58]), sex (y,(male,female)) Step Current set of examples 4 NE-4 3 NE-3 E4 2 NE-1 NE-2 E4 1 NE-1 E1 E3 E4 init E1 E2 E3 E4 E5
Generalization tree G-4 G-3 G-2 E1
E3
G-1 E2
E5
E4
7 Related work Most recent Conceptual Clustering systems differ from the older ones such as CLUSTER/2 in the sense that they are guided more by a numerical measure than a symbolic one. As well as Concept Formation systems such as COBWEB evoked in the introduction, there exist several systems which, like KBG, are borderline between Learning and Data Analysis (table 2). For instance, the system "Pyramids" (Brito 91) uses the same iterative
Strategy Obtained structures Rep. language Method used
Cut-off (pruning of the structure)
learning method as KBG. One originality in "Pyramids" is the structure of concepts learned. A pyramid corresponds to a DAG with a topological constrain: there is no crossing between the edges of the graph. Therefore, this structure is both more general than a tree and remains easily readable by the user. The system developed by (Lermann et al. 91) uses a pure data analysis algorithm to cluster and a symbolic algorithm working in FOL to generalize. One difference with KBG is that the clustering and generalization are successive.
Concept Formation (Lermann et al.) Pyramids incremental, top-down bottom-up bottom-up tree tree pyramid LPV FOL LPV Operators used : Sequential use of Iterative use of : Create, Split and - Clustering - Clustering Merge. - Generalization - Generalization COBWEB: No Learning by heart Learning by heart CLASSIT,ADECLU: yes Table 2 : Comparison between several approaches
8 Conclusion We have presented the system KBG which organizes a set of observations in the form of an oriented graph and uses a knowledge representation based on first order logic. KBG is implemented in Common Lisp and used i n real world applications (electronic & medicine) through the ESPRIT project "Machine Learning Toolbox". Our work is now continuing in two directions. First, we are improving the feedback between KBG and the user. This is done by providing explanations about the role played by the predicates during the entity matching. In this way, the user could better understand the reasons that lead the system to build the structure of concepts, and it will be easier for him to improve the knowledge representation. Second, both in KBG and ADECLU, the learning process is fully guided by a similarity measure. By merging this approach and our own, it seems possible to build an incremental tool working in FOL. A c k n o w l e d g e m e n t s : I thank Yves Kodratoff, my thesis supervisor, for the support provided and all the members of the Inference and Learning group at LRI. References B ISSON G. 1990. A Knowledge Based Generalizer. In Proceedings of the Seventh International Conference on Machine Learning, 9-15. Austin (Texas). B ISSON G. 1992a. Transformation d'un graphe conceptuel en système de règles de diagnostic. Actes des 3ème journées Symbolique-Numérique. 171-181.
KBG bottom-up DAG FOL Iterative use of : - Clustering - Generalization Pruning while building the rules.
BISSON G. 1992b. Learning in FOL with a similarity measure. In Proceeding of 10th AAAI Conference. San-Jose. B RITO P. 1991. Analyse de données symboliques. Pyramides d'héritage. Thèse de l'université Paris XI Dauphine. D ECAESTECKER C. 1991. Apprentissage en Classification Conceptuelle Incrémentale. Thèse de Docteur en Sciences, Université libre de Bruxelles. D IDAY E., LEMAIRE J. 1985. Eléments d'analyse des données. Edition Dunod. D IDAY E. 1989. Introduction a l'Analyse des Données Symboliques. Rapport interne INRIA n° 1074. F ISHER D.H 1987. Knowledge Acquisition via Incremental Conceptual Clustering. Machine Learning 2, 139-172. G ANASCIA J.G. 1988. Improvement and Refinement of the Learning Bias Semantic In Proceeding 8th ECAI. 384-389. G ENNARY J., L ANGLEY P., F ISHER D. 1989. Model of Incremental Concept Formation. Artificial Intelligence Journal, Volume 40, 11-61. KODRATOFF Y., ADDIS T., MANTARAS R.L., MORIK K., PLAZA E. 1991. Four Stances on Knowledge Acquisition and Machine Learning. EWSL 91, Springer-Verlag, 514-533. K ODRATOFF Y., G ANASCIA J.G. 1985. Improving the generalization step in Learning. Machine Learning 2 an Artificial Intelligence Approach, Morgan Kaufmann, 215-244. L ERMANN I.C., NICOLAS J., O UALI M., P ETER P. 1991. Classification conceptuelle: une approche centrée sur la similarité. Induction Symbolique et Numérique, Cepaduès edition. 153-177. MICHALSKI R.S., STEPP E. 1983. Learning from Observation : Conceptual Clustering. In Machine Learning 1 an Artificial Intelligence Approach, Tioga, 331-363. S TEPP E., MICHALSKI R.S. 1986. Conceptual Clustering of Structured Objects: A Goal-Oriented Approach. Artificial Intelligence 28, 43-69. T HOMPSON K., LANGLEY P. 1989. Incremental concept formation with composite objects. In Proceeding of 6th International Workshop on Machine Learning. 371-374.