Strongly Typed Inductive Concept Learning - Bristol CS - University of ...

11 downloads 0 Views 29KB Size Report
minates the relationship between attribute-value learning and inductive logic ... machine learning and ILP literature and represent them in Escher, a typed,.
Strongly Typed Inductive Concept Learning P.A. Flach, C. Giraud-Carrier and J.W. Lloyd Department of Computer Science, University of Bristol Merchant Venturers Building, Woodland Road, Bristol BS8 1UB, United Kingdom {flach,cgc,jwl}@cs.bris.ac.uk http://www.cs.bris.ac.uk/Research/MachineLearning/

Abstract. In this paper we argue that the use of a language with a type system, together with higher-order facilities and functions, provides a suitable basis for knowledge representation in inductive concept learning and, in particular, illuminates the relationship between attribute-value learning and inductive logic programming (ILP). Individuals are represented by closed terms: tuples of constants in the case of attribute-value learning; arbitrarily complex terms in the case of ILP. To illustrate the point, we take some learning tasks from the machine learning and ILP literature and represent them in Escher, a typed, higher-order, functional logic programming language being developed at the University of Bristol. We argue that the use of a type system provides better ways to discard meaningless hypotheses on syntactic grounds and encompasses many ad hoc approaches to declarative bias.

1. Motivation and scope Inductive concept learning consists of finding mappings of individuals (or objects) into discrete classes. Individuals and induced mappings are represented in some formal language. Historically, attribute-value languages (AVL) have been most popular in research in machine learning. In an attribute-value language, individuals are described by tuples of attribute-value pairs, where each attribute represents some characteristic of the individuals (e.g., shape, colour, etc.). Although very useful, attribute-value languages are also quite restrictive. In particular, it is not possible to induce relations explicitly in that framework. In recent years, researchers have thus proposed the use of first-order logic as a more expressive representation language. In particular, the programming language Prolog has become the almost exclusive representation mechanism in inductive logic programming (ILP). The move to Prolog alleviates many of the limitations of attribute-value languages. However, traditional application of Prolog within ILP has also caused the loss of one critical element inherent in attribute-value languages: the notion of type. Implicitly, each attribute in AVL is a type, which can take on a number of possible values. This characteristic of AVL makes it possible to construct efficient learners since the only way to define mappings of individuals to classes consists of constructing expressions (e.g., conjunctions) that extract a particular attribute (i.e., tuple projection) and tests

its value against some value of that type. On the other hand, Prolog has no type system. All characteristics of individuals are captured by predicates. As a result, the way to construct mappings becomes rather unconstrained and a number of ad hoc mechanisms (e.g., linked clauses, mode declarations, determinacy, etc.) have to be introduced to restore the tractability of the learning problem. A major aim of this paper is to demonstrate the usefulness of a complexity of strongly typed language representation of individuals for inductive concept learning, where we define a strongly typed tuples, language as one having lists, sets, a fully-fledged type constants system. We also propose ILP that the natural extension of attribute-value learning to first- and highertuples of order consists in repreconstants senting individuals by A-V le arning terms. Hence, individuals may have arbitrary AV L E sche r P rolog structure, including simple tuples (as in the attribute-value learning Fig. 1. The relation between attribute-value learning and ILP is case), lists, sets and illuminated by viewing it through a strongly typed language indeed any composite such as Escher. One of the main differences lies in the comtype. This provides us plexity of the terms representing individuals. with a unified view on attribute-value learning and inductive logic programming, in which a typed language such as Escher1 (a typed, higher-order, functional logic programming language being developed at the University of Bristol) acts as the unifying language (Fig. 1). At Bristol, we have begun the implementation of a decision-tree learner, generalised to handle the constructs of the Escher language. Several of the illustrative examples below were run on this learner and the results produced are reported here. Full details about the learning system, its implementation, and the results of some largerscale practical experiments will be reported elsewhere. The paper is organised as follows. Section 2 describes Escher, the programming language which serves as a vehicle for the implementation of the aforementioned extension. Section 3 contains a number of illustrative learning tasks reformulated in Escher. In Section 4 we discuss the main implications of our approach for inductive logic programming, and Section 5 contains some conclusions. 1

M.C. Escher® is a registered trademark of Cordon Art B.V., Baarn, Nederland. Used by permission. All rights reserved.

2. Elements of Escher This section highlights the main features of the Escher language [3] and assumes familiarity with Prolog. We mainly deal with list-processing functions. It should be noted that the syntax of Escher is compatible with the syntax of Haskell (a popular and influential functional programming language). Consequently, Escher programs may look uncomfortably unfamiliar to Prolog aficionados (lowercase variables, constants starting with a capital), but we hope the reader will be able to abstract away from syntax. The Escher definition of list membership is as follows (y:z stands for the list with head y and tail z, == is the equality predicate, and || is disjunction). member::(a,[a])->Bool; member(x,[]) = False; member(x,y:z) = (x==y) || member(x,z);

The first statement defines the signature of member as a function taking a pair of arguments, an element and a list of elements (a is a type variable) and mapping them into the type Bool, which is a built-in type with data constructors True and False. The definition is inductive on the second argument. If this is the empty list, then the function maps to False; if it is non-empty, then the function maps to the truth-value of the disjunction in the last statement. Prolog programmers will recognise this last statement as essentially the Prolog definition of member. Here is the Escher function to concatenate two lists: concat::([a],[a])->[a]; concat([],x) = x; concat(x:y,z) = x:concat(y,z);

As can be seen from the signature, concat is a function rather than a predicate. It corresponds to the Prolog a p p e n d predicate with input-output mode append(+X,?Y,-Z). (Notice that the first argument must be instantiated in order to select the right statement.) The Escher function corresponding to append(?X,?Y,+Z), with the list to be split as the third argument, looks as follows (&& stands for conjunction): split::([a],[a],[a])->Bool; split(x,y,[]) = x==[] && y==[]; split(x,y,u:w) = (x==[] && y==u:w) || exists \v -> x==u:v && split(v,y,w);

Notice the explicit existential quantification of local variables (‘exists \v ->’ should be read ‘there exists a v such that’). It is also possible to get the Escher equivalent of append(?X,?Y,?Z). In that

case, the third argument is no longer used as the induction argument and the function is defined by a single statement. append::([a],[a],[a])->Bool; append(x,y,z) = (x==[] && y==z) || exists \u v w -> x==u:v && z==u:w && append(v,y,w);

Notice the close correspondence with the Prolog definition. Escher computations proceed by rewriting terms to simpler expressions. If s is the goal term and t is the answer term of a computation, then s==t is a logical consequence of the program. For instance, the goal term concat([1,2],[3]) reduces, via 1:concat([2],[3]) and 1:2:concat([],[3]), to the answer term [1,2,3]. The goal term split(x,y,[2,3]) reduces to the following answer term: (x==[] && y==[2,3]) || (x==[2] && y==[3]) || (x==[2,3] && y==[])

From a Prolog perspective, Escher computations provide all answers at once. Note however that there is no backtracking, and no failure (the Escher equivalent of failure is to return the answer False). Further features of Escher include facilities for representing (extensional and intensional) sets and a full range of set-processing functions. A set is identified with a predicate, its characteristic function. As an example of an intensional set, the set comprehension {s | likes(Fred,s)}

where likes has the signature likes::(Person,Sport)->Bool, denotes the set of sports that Fred likes. Sets play a crucial role in the representation of ILP examples. From an ILP perspective, Escher provides the following advantages over Prolog. First of all, no additional constructs such as ij-determinacy or integrity constraints are needed to specify that certain relations are functional. Escher can deal with functions and predicates in a unified way. Secondly, this leads to a more uniform schema for the function that is learned: if there are several statements making up the definition of a function, these are always mutually exclusive by virtue of a particular input argument. For instance, in the definition of member above we have two cases: either the second argument is the empty list (in which case member returns False) or it isn’t, in which case we evaluate a disjunction. Notice that the Prolog definition of member does not allow clause indexing on the second argument. The most important feature of Escher in the present context is its type system, which not only restricts possible instantiations of variables, but more importantly constrains the hypothesis language because every type carries with it a set of operations. For instance, projections are associated with tuple types; list membership, nth element selection and list length are associated with list types; set membership and cardinality are associated with set types; and so on. In the next section we illustrate this by representing various well-known learning tasks in Escher.

3. Representing learning tasks in Escher We start with representing in Escher a typical attribute-value learning task which involves learning the concept of playing, or not playing, tennis, according to the weather [4]. In attribute-value learning, individuals are tuples (elements of a cartesian product) of atomic values. Translated to Escher, this means the definition of a tuple type, which is the domain of the function to be learned, and a data type for each attribute. In the following, the keyword data indicates the declaration of a type and the data constructors of that type, and type indicates a type synonym. We want to learn the definition of the function playTennis. data Outlook = Sunny | Overcast | Rain; data Temperature = Hot | Mild | Cool; data Humidity = High | Normal | Low; data Wind = Strong | Medium | Weak; type Weather = (Outlook,Temperature,Humidity,Wind) playTennis::Weather->Bool;

The examples each specify an input-output pair for the function to be learned: playTennis(Overcast,Hot,High,Weak) = True; playTennis(Sunny,Hot,High,Weak) = False; …

In this setting, our learning system finds the following definition. playTennis(w) = if (outlookP(w)==Sunny && humidityP(w)==High) then False else if (outlookP(w)==Rain && windP(w)==Strong) then False else True;

Here, outlookP is the projection function returning the value of the Outlook attribute, humidityP the projection function returning the value of the Humidity attribute, and so on. Their definitions follow in a straightforward way from the type definitions. For instance, outlookP could be defined as follows: outlookP::Weather->Outlook; outlookP(o,t,h,w) = o;

Clearly such definitions could be provided in a pre-processing stage. It is worthwhile to reflect for a moment on how the hypothesis language is largely determined by the type definitions above. The induced function is of the form playTennis(w)= Body, where w is a variable of type Weather and Body is an if-then-else expression. What are the atomic expressions from which the boolean expression after the if can be built up? Initially we have only w to work on and, since w is of tuple type, the only associated operations are projections. Each projection returns a term of one of the four data types. Since the data types above are simple types without internal structure, the only operation available for, say, outlookP(w) is a test for equality with one of the constants of type Outlook. Our next example concerns one of the Bongard problems [1]. We start with the appropriate type definitions. data Shape = Circle | Triangle | Square | Inside(Shape,Shape); data Class = Class1 | Class2; type Diagram = {(Shape,Int)};

class::Diagram->Class;

A Bongard diagram consists of a set of tuples each of which consists of a shape together with the number of times that shape occurs in the diagram. The main difference with the previous attribute-value problem is that the Shape data type has a binary constructor Inside. This means that now we can construct, for a term s of type Shape, complex conditions which introduce new variables, such as exists \t u -> s == Inside(t,u) && t == Circle && u == Triangle

Here are a few examples: class({(Inside(Triangle,Circle),1)}) = Class1; class({(Circle,1),(Triangle,1),(Inside(Triangle,Circle),1)}) = … Class1; class({(Inside(Circle,Triangle),1)}) = Class2; …

Our learning system finds the following definition. class(d) = if (exists \p -> p 'in' d && (exists \s t -> shapeP(p) == Inside(s,t) && s == Circle)) then Class2 else Class1;

Here in is a built-in predicate for set membership (the quotes make it infix). It follows that p is of type (Shape,Int). The shapeP projection selects the first element of the pair, i.e. a shape, and tests whether it equals the term Inside(Circle,t), for some t. Notice again how the Escher type definitions naturally generate the hypothesis space. The existential variables appear because (i) the top-level type is a set, from which we may select an element, and (ii) one of the data types defines a constructor. Our third and final example involves mutagenicity [5]. An abstract view of a molecule is that it is a graph with atoms as nodes and bonds as edges. Below we represent this graph by the set of atoms and the set of bonds; in the next section we give an atom-centered representation. The type signature is as follows: data Element = Br | C | Cl | F | H | I | N | O | S; type Ind1 = Bool; type IndA = Bool; type Lumo = Float; type LogP = Float; type Label = Int; type AtomType = Int; type Charge = Float; type BondType = Int; type Atom = (Label,Element,AtomType,Charge); type Bond = ({Label},BondType); type Molecule = (Ind1,IndA,Lumo,LogP,{Atom},{Bond}); mutagenic::Molecule->Bool;

Notice the use of labels to model the complex geometric structure of individuals. Labels, here simply represented as integers, are used to name part of an individual (here: atoms) in order to be able to refer to it in other parts. The reason for labelling

the atoms is that we need to record the bonds between individual atoms in the molecule. The last component of a molecule sextuple records the bond information. Here is (part of) an example, classifying one particular molecule as mutagenic. mutagenic(True,False,-1.246,4.23, {(1,C,22,-0.117),(2,C,22,-0.117),…,(26,O,40,-0.388)}, {({1,2},7),…,({24,26},2)}) = True;

So, for instance, in this molecule there is a bond between Carbon atom 1 and Carbon atom 2 of type 7. The following is a possible (partial) induced definition based on the theories induced by Progol. mutagenic(m) = ind1P(m) == True || lumoP(m) a 'in' atomSetP(m) && elementP(a)==C && atomTypeP(a)==26 && chargeP(a)==0.115) || (exists \b1 b2 -> b1 'in' bondSetP(m) && b2 'in' bondSetP(m) && bondTypeP(b1)==1 && bondTypeP(b2)==2 && not disjoint(labelSetP(b1),labelSetP(b2)) || (exists \a -> a 'in' atomSetP(m) && elementP(a)==C && atomTypeP(a)==29 && (exists \b1 b2 -> b1 'in' bondSetP(m) && b2 'in' bondSetP(m) && bondTypeP(b1)==7 && bondTypeP(b2)==1 && labelP(a) 'in' labelSetP(b1) && not disjoint(labelSetP(b1),labelSetP(b2)))) || …;

The projection atomSetP is the projection onto the fifth component of a molecule. Similarly, labelSetP is the projection onto the first component of a bond. The predicate disjoint returns false if its arguments intersect and true otherwise. Notice the use of labels in the fifth disjunct to denote that there is a bond b1 from atom a to another atom, and a bond b2 from that atom to a third. The definition above is a direct translation of some clauses induced by Progol; we expect our learner will produce a similar result in the form of an if-then-else statement.

4. Discussion We have seen how to use Escher as an ILP language through a number of examples. We will now discuss in more detail what we believe are the advantages of using a strongly typed language for learning. We will follow the traditional structure of an ILP task and discuss the representation of examples, hypotheses, and background knowledge. Representation of examples. The view of attributes as types naturally leads to a representation of individuals by terms. This avoids naming individuals by constants, as is done in many ILP systems; instead, an individual is ‘named’ by the collection of all of its characteristics. The language of terms, and thus the example language, is fully determined by the type signature. Information about an individual is localised,

and naming of subterms is only needed if we want to refer to them from other subterms, as in the mutagenesis example. The representation of individuals by interpretations [2] is also motivated by localisation of examples. However, we believe that representing individuals by terms offers considerably more opportunities for localisation. For instance, we could easily adapt the representation of molecules to include all information about an atom, including the bonds it has with other atoms, in one subterm. We would have to adapt the types as follows: type Bond = (Label,BondType); type Atom = (Label,Element,AtomType,Charge,{Bond}); type Molecule = (Ind1,IndA,Lumo,LogP,{Atom});

The representation of an example then becomes mutagenic(True,False,-1.246,4.23, {(1,C,22,-0.117,{(2,7),(6,7),(7,1)}), (2,C,22,-0.117,{(1,7),(3,7),(8,1)}), …, (26,O,40,-0.388,{(24,2)})}) = True;

While this representation has redundancy (each bond occurs twice), it has the advantage that all information pertaining to an atom is located in a single subterm. One possible advantage of individuals as interpretations is that it may be easier to represent incomplete information by leaving things unspecified, while in our individuals-as-terms approach we would have to introduce explicit null values. Representation of hypotheses. Given a type signature f::X->Y, Escher definitions that we learn have the form f(x) = if E then s else t

where x is a variable of type X, s and t are either values of type Y or if-thenelse expressions and E is a boolean expression. This is a very general format, and seems to be feasible as long as the number of classes in Y is limited. Learning such function definitions means instantiating E, s and t. We believe that the use of a type system yields significant advantages when it comes to searching the hypothesis space because it helps us in ruling out useless hypotheses. More precisely, the construction of a boolean expression E is triggered by the available terms. Initially, the only variable we have is the head variable x of type, say, set. At this stage there are few options: we can use the card function to count the number of elements in the set, we can extract an element by means of in, or we can use background functions defined on sets. Although the language may contain many more functions, they are not considered at this stage. One possibility that will be considered is to instantiate E with (exists \y -> y 'in' x && E 1 ). Next, we consider the construction of E1, for which our options are now slightly more numerous. On one hand we could extract another element of x, or we may proceed to consider y. In the latter case, we consider the type of y, which again gives us a limited number of choices. For instance, if y is a tuple, the only thing we can do is to project upon one of its components. Top-down refinement approaches typically consider large parts of the search space. For instance, a naive refinement operator for Prolog may generate a new body literal

with all new variables. Furthermore, at any point in the refinement process all available literals in the language will be considered for inclusion. While useless literals may be discarded later in the evaluation step, for instance because they fail to produce any information gain, the fact that they are generated by the refinement operator in the first place is wasteful. As sketched above, the use of a type system avoids the introduction of many useless literals altogether. Inductive hypotheses expressed in Escher are linked by definition: there is no possibility of introducing a literal bondTypeP(b1)==7 unless the type of b1 is determined by the preceding part of the rule. The notion of a non-linked clause seems more an artefact of using a non-typed language than anything else. Similarly, the need for syntactic biases such as mode declarations is subsumed by an appropriate type system. A type system is a powerful tool for expressing and employing declarative bias. Types and their associated functions thus strongly, but not completely, constrain the hypothesis language. An important complexity dimension is given by the use of existential variables. In the case of attribute-value learning these are not allowed, and we might say in this case that the hypothesis language is completely determined by the types. Notice that when using functions in addition to predicates it is possible, without the use of variables, to represent ‘relational’ information such as the equality of the values of two attributes. The distinction between propositional and first-order learning tasks depends thus in part on the representation formalism. In fact, we would argue that the distinction between propositional learning and first-order learning is rather artificial. The number of existential variables provides another dimension, orthogonal to the complexity of the type system, along which to express the complexity of the learning task (cf. Fig. 1). Representation of background knowledge. The use of a type system also suggests that background knowledge comes in flavours. First of all, with each complex type one needs selector functions for extracting subterms from terms of that type. For instance, projection functions come with tuple types; a set membership function comes with set types; a list membership function comes with list types; and so on. Without selector functions, the internal structure of the type could not be employed by the learner. Secondly, there are generic, application-independent functions such as the standard set and list operations. Such functions are again closely associated with the types in the signature. The difference is that selector functions are automatically included in the hypothesis language, while this is optional for other background functions. In Escher, both selector and generic functions are provided in separate modules, and do not need to be defined by the user. What remains are auxiliary functions that are neither useful outside the context of a particular learning task, nor to be easily found by the learner itself. In our view, such functions represent true background knowledge. The definitions of such background functions have to be provided by the user. Notice that if an ILP system does not represent individuals by terms, a significant part of what is conventionally called ‘background knowledge’ consists of descriptions of individuals, which are really artefacts of a flattened knowledge representation approach. In our approach this is avoided by representing all the knowledge about an individual in one term.

5. Summary and conclusion If we were to summarise the previous discussion in one slogan, it would be: attributes are types. The reader may claim that many ILP systems use types. However, these are not the kinds of type system found in modern programming languages. Types are more than just labels attached to logical variables to prevent meaningless unifications, or to generate all possible instantiations of a variable. In a typed ILP language the type system provides many meaningful restrictions on the space of possible hypotheses. Escher is such a typed language, which in addition provides functions and higher-order constructs. Related to the use of type systems, we would argue that individuals are properly represented by terms. This allows the learner to make full use of the type system, and it localises the information in a hierarchical way. Furthermore, we have argued that this viewpoint improves our understanding of the way in which ILP generalises attribute-value learning, by increasing the complexity of terms. Many of the claims we make in this paper are rhetorical, and must be followed by experimental validation. We are currently working on the implementation of a decision-tree learner that employs Escher as an implementation and representation language to establish these claims.

Acknowledgements Thanks are due to the other members of the Machine Learning Research group at the University of Bristol (Antony Bowers, Torbjørn Dahl, Claire Kennedy, Nicolas Lachiche, and René MacKinney-Romero) for their contributions to the ideas which led to this paper. Antony Bowers is implementing the learning system. Discussions with Nada Lavrac greatly helped to improved the presentation of the paper, as did the comments of the referees. This work was partially supported by EPSRC Grant GR/L21884 and by ESPRIT IV Long Term Research Project 20237 Inductive Logic Programming 2.

References 1. 2. 3. 4. 5.

L. De Raedt & W. Van Laer. Inductive constraint logic. Proc. 6th Int. Workshop on Algorithmic Learning Theory, LNAI 997, pp.80–94, 1995. L. De Raedt & L. Dehaspe. Clausal Discovery. Machine Learning 26(2/3):99–146, 1997. J.W. Lloyd. Programming in an Integrated Functional and Logic Language. Journal of Functional and Logic Programming, 1998 (to appear). T.M. Mitchell. Machine Learning. McGraw-Hill, 1997. A. Srinivasan, S. Muggleton, R. King & M. Sternberg. Mutagenesis: ILP experiments in a non-determinate biological domain. Proc. 4th Inductive Logic Programming Workshop, GMD-Studien 237, 1994.

Suggest Documents