Logical approaches to Machine Learning --- an overview - CiteSeerX

Instituut voor Taal- en Kennistechnologie Institute for Language Technology and Artificial Intelligence

Logical approaches to Machine Learning --- an overview Peter A. Flach

Introduction The ability to learn from observations and experience seems to be crucial for any intelligent being. Likewise, Machine Learning plays a central role in Artificial Intelligence research. Learning is no different from many other cognitive phenomena in that it resists a general definition. A first try could be: learning is the ability to improve on a certain task. For instance, the task may be face recognition, weather prediction, or taking the derivative of functions. This definition has a behavioral flavour, taking the performer of the task as a black box and judging its performance only from the outside. We will make the additional assumption that the performer consists of a procedural engine and a declarative knowledge base, and that its performance is improved by improving the knowledge base rather than the engine. For instance, the knowledge needed for recognising faces might be expressed in terms of classification rules, listing the characteristic features of specific faces; the ability to recognise faces can be improved by improving the classification rules. From this viewpoint, learning is the ability to acquire new or better knowledge. This declarative view of learning does not cover systems which improve their behaviour by improving their procedural engine, such as neural nets. This is not to say that such systems don’t learn, but it is a fundamentally different kind of learning, which falls beyond the scope of this paper. Logical approaches to Machine Learning assume that knowledge is represented in some logical formalism. In this paper, we will adopt the formalism of first-order clausal logic, restricted to Horn clauses. In other words, we are adopting a Logic Programming framework. The problem of learning may then be formulated as follows: given a domain theory T and external input E, produce a domain theory T* that improves on T. The type of external input, and the way the quality of a domain theory is measured, depends on the type of learning. For instance, in concept learning the external input is given by descriptions of instances and non-instances of the concept to be learned, a domain theory is a definition of the concept, and the quality of a concept definition is determined by how well it classifies unseen instances. In Inductive Logic Programming, the external input consists of true and false facts, and a domain theory is a logic program

implying those facts. In explanation-based learning, the external input is a particular instance described by the domain theory, and the result of learning is a specialised version of the domain theory that describes the same instance, but only in terms of operational (for instance, observable) predicates. We discuss each of these forms of learning in a separate section.

Concept learning Concepts as extensions A concept can be defined extensionally as the set of its instances: for instance, the concept ‘table’ is the set of all tables in a particular universe U. Given two sets of objects, one containing instances of the concept (the positive examples) and the other containing non-instances (the negative examples), a concept learner is to come up with a set containing every positive example and no negative example. Unless every possible object is an example, there are several possibilities for such a set: we might choose just the set of positive examples as a most ‘conservative’ guess, or the complement of the set of negative examples, or any set between these two.

Figure. 1: A concept C which contains every positive example and no negative example. Suppose that we want to learn a concept incrementally: that is, we have a current concept which agrees on the examples seen so far, and we encounter a new example. There are four possibilities: 1. 2. 3. 4.

the example is positive and belongs to the current concept; the example is positive and does not belong to the current concept; the example is negative and belongs to the current concept; the example is negative and does not belong to the current concept.

In cases 1 and 4, the new example is correctly classified by the current concept, which therefore needs not be changed. In case 2, the current concept should be enlarged so as to contain the new positive example, while in case 3, it should be made smaller. Enlarging a concept is also called generalisation, the opposite of which is specialisation. Generalisation and specialisation are the basic operations of incremental concept learning. A useful view of generalisation and specialisation is obtained by considering the concept space: the set of all concepts, partially ordered by the subset relation. Concepts higher in this ordering have more instances, thus being more general. Concepts found lower in the ordering are more specific. A generalisation of a concept is obtained by climbing the generality ordering, and a specialisation by descending the ordering. Given sets of positive and negative examples, the Version Space is defined as the set of all concepts

containing every positive example and no negative example (Mitchell, 1982). It is easily seen, that this set is convex: it has an upper and a lower bound, and every concept between those bounds belongs to the Version Space. Both upper and lower bound consist of a single concept: the lower bound is the set of all positive examples, and the upper bound is the complement of the set of negative examples.

Concepts as intensions In practice, concept learners don’t manipulate extensions of concepts, but rather intensions: descriptions in some formal language. Most concept description languages do not have sufficient expressive power to distinguish every possible concept extension. Therefore, the concept space becomes less irregular. Two things are however maintained: the generality ordering, and the convexity of the Version Space. Intensional generality is also referred to as subsumption. Often, the set of all positive examples will be describable in the language, so the lower bound may still consist of a single most specific concept. Since many concept description languages allow only a limited form of negation (or no negation at all), the complement of the set of negative examples will in general not be describable in the language, and the upper bound may consist of several most general concepts. We thus obtain a picture like in Figure 2, where plus-signs correspond to too specific concepts (contradicted by positive examples), and minus-signs to too general concepts (contradicted by negative examples).

Figure 2: The Version Space Languages which are fit for many simple concept learning problems are the so-called attribute-value languages, that describe concepts and instances in terms of values of a number of attributes such as colour, size, and shape. Such an attribute-value expression is size=large and colour=red and shape=square

denoting the concept of large red squares. The same expression is also used to describe an instance that is a large red square. Thus, attribute-value languages are unable to distinguish between instances and sets of instances. They have the same expressive power as propositional logic. Using the same representation for instances and concepts is sometimes referred to as the Single Representation Trick (Dietterich et al., 1982). As a result of this trick, testing whether an instance description is covered by a concept description (the instance is an element of the concept) is the same as testing whether one concept description is subsumed by another (the first concept is a subset of the second).

The complexity of concept learning depends strongly on the complexity of the representation language. If only conjunctive attribute-value expressions are allowed, subsumption is easily implemented by checking whether every attribute-value pair occurring in one expression occurs also in the other. Generalisation means climbing the subsumption ordering, so we can generalise by dropping a conjunct from the expression. For instance, if the concept size=medium and colour=red and shape=circle

should cover the positive example size=small and colour=red and shape=circle and weight=heavy

then the concept must be generalised to colour=red and shape=circle

Note that this is a minimal or least general generalisation (LGG). Similarly, specialisation means adding a conjunct. For instance, if the concept colour=red and shape=circle

should not cover the negative example size=large and colour=red and shape=circle

then it should be specialised by adding some attribute-value pair that is not present in the negative example. Note that there is no unique least specific specialisation in this case. The expressive power of a purely conjunctive attribute-value language is quite limited. It can be extended in many ways: one could use disjunctions of conjunctions to describe concepts (a normal form), one could introduce negated attribute-value pairs, and so on. Two things should be kept in mind, however: an increased expressive power results in higher complexity of the learning task; and a limited description language is more likely to generalise from the positive examples. For instance, in a language with unrestricted disjunction, the LGG of the positive examples is simply the disjunction of those examples. This is a trivial solution to any concept learning problem, because the positive examples are simply remembered, and no generalisation takes place. Restrictions on the description language, then, seem to be essential for learning; these restrictions are also called the inductive bias.

Inductive Logic Programming As said before, attribute-value languages are quite limited in expressive power. In this section, we describe approaches to inductive learning which use clausal logic as a representation language. These

approaches are capable of constructing logic programs inductively, thence the term Inductive Logic Programming (ILP).

Reformulating concept learning in clausal logic Let us start by reformulating attribute-value descriptions of concepts and instances in clausal logic. Conjunctive concepts are described by clauses like concept1(X):red(X), circle(X).

where the predicate concept1 represents the concept to be learned. Note that the attributes themselves are missing from the clause. That is, we have lost information like ‘red and green are values of the same attribute, and are therefore mutually exclusive’. If we need such information, it should be explicitly stated in a background theory T. There could be several clauses defining one concept, for instance, concept2(X):red(X), square(X). concept2(X):green(X), circle(X).

which corresponds to the attribute-value expression (colour=red and shape=square) or (colour=green and shape=circle)

Instances are described by ground facts like small(smallredcircle). red(smallredcircle). circle(smallredcircle). smallredcircle is an instance of concept1, since concept1(smallredcircle) can be proved. In general, given a conjunction of ground facts Desc(Instance) describing Instance, a concept Concept(X) :- Conditions(X) classifies Instance positively if T

^ ^

Desc(Instance) Concept(X) :- Conditions(X) |= Concept(Instance)

Thus, we add the description of the instance to the background theory, and we try to prove Concept(Instance) by means of the concept definition. If this proof fails, we interpret this as a negative classification (negation as failure). In case we obtain the wrong classification, the concept needs to be specialised or generalised.

Subsumption in clausal logic

The next step is to define subsumption in clausal logic, such that the clause concept3(X):red(X), circle(X).

subsumes the clause concept3(X):small(X), red(X), circle(X).

The second clause is more specific, since it contains all the literals of the first clause plus one more. However, this is not the only case for subsumption. Consider the following two clauses: concept4(X):square(X), triangle(Y), same_colour(X,Y). concept4(X):square(X), triangle(t), same_colour(X,t).

The first clause describes the set of squares that have the same colour as some triangle. The second clause describes the set of squares that have the same colour as one particular triangle t. Obviously, the second set is contained in the first set. and the first clause describes a more general concept than the second. Combining these two cases, we arrive at a form of logical subsumption called theta-subsumption: Clause1 theta-subsumes Clause2 if there is a substitution theta that can be Clause1, such that every literal in the resulting clause occurs in Clause2.

applied to

Note that if Clause1 theta-subsumes Clause2, then also Clause1 |= Clause2. The reverse, however, is not always true. Consider the following two clauses: list([V|W]):list(W). list([X,Y|Z]):list(Z).

Given the empty list, the first clause constructs lists of any given length, while the second constructs lists of even length. All lists constructed by the second clause are also constructed by the first, which is therefore more general. However, there is no substitution that can be applied to the first clause to yield the second (such a substitution should map W both to [Y|Z] and to Z, which is impossible). In order to cover cases like this, we can define logical subsumption as logical implication: Clause1 subsumes Clause2 Clause1 |= Clause2.

if

This, however, introduces two problems. One is that this a semantic definition, which suggests neither a

procedure to check subsumption, nor a procedure to generalise clauses. The second problem is that LGGs are not always unique if subsumption is defined as logical implication. Consider the two clauses list([A,B|C]):list(C). list([P,Q,R|S]):list(S).

Under logical implication, these clauses have two LGGs: one is list([X|Y]) :- list(Y), and the other is list([X,Y|Z]) :- list(V). Under theta-subsumption, only the latter is an LGG. Note that the first LGG looks in fact more plausible! Due to its computational advantages, theta-subsumption is often preferred over logical implication in Inductive Logic Programming (for details, see Niblett, 1988). There are two generalisation steps under theta-subsumption: 1. drop a (negative) literal; 2. replace a term by a variable. To illustrate the second method of generalisation, consider the following two ground facts: element(1,[1]). element(z,[z,y,x]).

The following clause is their LGG: element(X,[X|Y]).

This clause is obtained by replacing terms by variables. That is, it is obtained from the first fact by applying the inverse substitution {1 --> X, [ ] --> Y}, and from the second fact by means of {z --> X, [y, x] --> Y}. (In general, only some occurrences of terms at specified positions are replaced by variables). The dual operation of specialising a clause proceeds by 1. adding a (negative) literal; 2. replacing a variable by a term (that is, applying a substitution).

Induction by refinement Subsumption determines a partial ordering of the clause space: the set of clauses that can be formulated from a given vocabulary of predicates, functors and constants. The above specialisation and generalisation operations can be used to search this space. An ILP system avant la lettre, Shapiro’s Model Inference System (Shapiro, 1981, 1983), searched this space in a breadth-first, top-down manner, applying minimal specialisations or refinements only. Shapiro calls the clause space, ordered by theta-subsumption, the refinement graph. As an example, suppose that we want to induce the definition of element. We start with the most general definition of this predicate: element(X,Y).

This theory states that everything is an element of everything. As long as the examples are positive, this

theory does not need to be changed. When the first negative example is encountered, we know that the theory is too general, and we have to minimally specialise its only clause. This can be done in several ways: we can apply substitutions like {Y --> X} or {Y --> [ V | W ]}, or we can add a negative literal :element(Y,X). Part of the refinement graph is depicted in Figure 3.

Figure 3: Part of the refinement graph for element. Most of the possible refinements will be false in the intended domain, and will be refuted by future examples. For instance, if we specialise it to element(X,X), we might later encounter the negative example element(1,1). Since the graph is searched in a breadth-first manner, we will eventually arrive at the clause element(X,[Y|Z]). Although this clause is also false in the intended interpretation, it is the parent of the following two clauses: element(X,[X|Z]). element(X,[Y|Z]):element(X,Z).

These clauses are true in the intended interpretation, so they will never be refuted by a negative example. Moreover, they cover all positive examples, so no clause needs to be added. It can be shown that MIS will eventually converge to a correct program, although it is not known in advance when. MIS is also able to refine clauses by adding literals from a user-defined background theory, although it will not order the clauses relative to this background theory. The next section discusses a method that can generalise relative to a background theory.

Inverse resolution Until now, we have only considered subsumption between two single clauses. However, if we want to learn clausal theories consisting of several clauses, we must define subsumption between theories also. For instance, consider the theory concept(X):small(X), triangle(X). polygon(X):triangle(X).

This theory is logically implied by the following theory:

concept(X):polygon(X). polygon(X):triangle(X).

since any model of the second theory is a model of the first theory. However, it is not the case that the clause concept(X) :- small(X), triangle(X) is logically implied by the clause concept(X) :polygon(X). It seems that subsumption between theories can not be reduced to subsumption between clauses. To solve this problem, we introduce the notion of relative subsumption between clauses (This definition extends subsumption as logical implication between clauses. There is also a definition of relative subsumption, based on theta-subsumption, but we will not consider it here). Clause1 subsumes Clause2 T ^ Clause1 |= Clause2.

relative to T if

Suppose T contains the clause polygon(X):-triangle(X).

then we have T

^

concept(X):-polygon(X) |= concept(X):-small(X),triangle(X)

thus concept(X) :- polygon(X) subsumes concept(X) :- small(X), triangle(X) relative to T. Inverse resolution (Muggleton & Buntime, 1988) is an approach which operates on theories rather than clauses. The name of this approach reflects its basic idea, which is to invert the resolution rule of inference. Suppose we already know that the following clause C1 is true in the intended interpretation: X < s (X).

that is, any number is smaller than its successor. Furthermore, suppose that we encounter a new positive example C Y < s (s(Y)).

(in this framework, examples don’t need to be ground). We now want to find a clause C2 which, together with C1, logically implies C, such that we can add the more general clause C2 to the theory instead of C. Obviously, C2 satisfies this condition if it resolves with C1 to yield C. A trivial solution is obtained by treating resolution propositionally, without variable substitutions: this yields the clause Y < s(s(Y)) :- X < s(X).

which, although it is true in the intended interpretation, is not very meaningful, since head and body don’t share any variables. We therefore look for terms in the body of this clause, of which instances occur in the head. There are several possibilities: Y can be seen as an instance of X, s(s(Y)) can be seen as an instance of X, and s(s(Y)) can be seen as an instance of s(X). There are no inherent criteria to choose either one of these, but suppose we make the last choice; that is, we build the substitution theta1

= {X -->(Y)} and apply it to the clause, which yields Y < s(s(Y)) :s(Y) < s(s(Y)).

This clause is even more specific than the previous one. We generalise it by replacing the occurrences of s(s(Y)) with a new variable, such that the final result is Y < Z :s(Y) < Z.

From Figure 4 it can be seen that C can indeed be derived from C1 and C2 in one resolution step.

Figure 4: C2 is constructed from C1 and C by absorption. The operator just illustrated is called the absorption operator. It is non-deterministic, in the sense that it can produce several C2’s for the same C and C1, thus, defining a search space. If we aim at finding a theory which is small, we can use some measure for the size of the theory as a heuristic. Alternative generalisation operators can be devised by combining several resolution steps. One operator which is particularly interesting is called intra-construction; it combines two resolution steps back-to-back. Consider the following two clauses A and C1: min(X,[Y|Z]):X < Y, min(X,Z). X < s(X).

One resolution step results in the clause B1: min(X,[s(X)|Z]):min(X,Z).

Similarly, applying resolution to the clauses A and C2: min(X,[Y|Z]):X < Y, min(X,Z). X < s(s(X)).

yields the clause B2: min(X,[s(s(X))|Z]):min(X,Z).

The two resolution steps are illustrated in Figure 5.

Figure 5: An illustration of intra-construction. The intra-construction operator now constructs A, C1 and C2, given B1 and B2. The interesting thing is, that A, C1 and C2 contain a predicate symbol which is not present in B1 and B2, that is, E. If we represent T by its set of ground atomic consequences (its least Herbrand model), the righthand side corresponds to a set of ground Horn clauses. Golem randomly picks two of these, and conjectures its LGG as the hypothesis H. Golem applies certain syntactic restrictions on background theory and induced clauses.

Explanation-based learning Inductive learning takes as input an incomplete domain theory and a set of examples, and produces a more general domain theory that explains the classification of the examples. Here, an explanation is a deductive proof. We will now discuss a technique, called explanation-based learning (EBL), that analyses and generalises such explanations, such that they can be applied to similar but different examples. Furthermore, the generalised explanation is operationalised, in the sense that it will only contain predicates used in the description of the example. As such, EBL is very similar to a programming technique called partial evaluation (Van Harmelen & Bundy, 1988). Consider the following domain theory. % humans like nice animates, % and pets like their owner likes(X,Y):human(X), nice_animate(Y). likes(X,Y):pet_of(X,Y). % friendly humans and pets are % nice animates nice_animate(X):human(X), friendly(X). nice_animate(X):pet(X). pet(X):pet_of(X,Someone).

% instances human(peter). human(yvonne). friendly(yvonne). pet_of(picasso,peter).

The fact likes(peter,yvonne) is a logical consequence of this theory. A proof tree is given in Figure 6.

Figure 6:.Proof tree for likes(peter,yvonne) Note that a verbal explanation for this fact would be something like ‘Peter likes Yvonne because they’re both humans, and Yvonne is friendly’, that is, listing only specific properties of the instances involved. Such properties are called operational predicates, and an operational explanation can be obtained by collecting the leaves of the explanation tree. This is typically done by a specialised meta-interpreter, resulting in a conjunctive explanation human(peter), human(yvonne), friendly(yvonne)

Note that this explanation is not only valid for Peter and Yvonne, but for any pair of humans of which the second is friendly. EBL therefore produces the explanation likes(X,Y):human(X), human(Y), friendly(Y)

This clause is a version of the domain theory that can be used to prove goals like likes(peter,yvonne) more efficiently. Therefore, this form of generalisation is also called speed-up learning. It should be noted that this clause follows logically from the domain theory, and therefore doesn’t produce new knowledge. Rather, explanation-based learning is a form of learning by experience: solving a goal results in a version of the domain theory that can be used to solve similar goals more efficiently. We add that operationality does not need to be restricted to leafs of the proof tree: any predicate in the background theory can be deemed operational. Matwin and Szpakowicz (this issue) apply EBL to the problem of knowledge acquisition from text, where the narrative part of an expository text is used to build the domain theory, while examples in the text serve as instances for which to build and generalise an explanation.

Concluding remarks We have given an overview of several logical approaches to Machine Learning, thereby concentrating on the main ideas rather than technical details. We have given the main references for those who want to find out more. In addition, we point at the first collection of papers on Inductive Logic Programming

(Muggleton, 1992b).

References T.G. Dietterich, B. London, K. Clarkson & G. Dromey (1982), ‘Learning and inductive inference’. In P. Cohen & E.A. Feigenbaum (eds.), The Handbook of Artificial Intelligence, Vol. III, William Kaufmann. F. Van Harmelen & A. Bundy (1988), ‘Explanation-based generalization = partial evaluation’, Artificial Intelligence 36, 401--412. S. Lapointe & S. Matwin (1992), ‘Sub-unification: a tool for efficient induction of recursive programs’, Proc. Ninth International Conference on Machine Learning, D. Sleeman & P. Edwards (eds.), Morgan Kaufmann, 273--281. T.M. Mitchell (1982), ‘Generalization as search’, Artificial Intelligence 18:2, pp. 203--226. S. Muggleton & W. Buntine (1988), ‘Machine invention of first-order predicates by inverting resolution’. In Proc. Fifth International Conference on Machine Learning, J. Laird (ed.), pp. 339--352, Morgan Kaufmann, San Mateo. S. Muggleton & C. Feng (1990), ‘Efficient induction of logic programs’, Proc. First Conference on Algorithmic Learning Theory, Ohmsha, Tokyo. S. Muggleton (1992a), ‘Inverting implication’, Proc. ECAI workshop on Logical approaches to Machine Learning. S. Muggleton ED. (1992b), Inductive Logic Programming. Academic Press. T. Niblett (1988), ‘A study of generalisation in logic programs’, Proc. European Working Sessions on Learning, D. Sleeman (ed.), Pitman, London, 131--138. J.R. Quinlan (1986), ‘Induction of decision trees’, Machine Learning 1, 81--106. C. Rouveirol & J.-F. Puget (1989), ‘A simple solution for inverting resolution’, Proc. European Working Sessions on Learning, K. Morik (ed.), Pitman, London, 201--210. E.Y. Shapiro (1981), Inductive inference of theories from facts, Techn. rep. 192, Comp. Sc. Dep., Yale University. E.Y. Shapiro Algorithmic program debugging, MIT Press Table of Contents Next article

© Arthur van Horck

Logical approaches to Machine Learning --- an overview - CiteSeerX

Logical approaches to Machine Learning --- an overview - CiteSeerX

Suggest Documents

Machine Learning Approaches to Gene Recognition - CiteSeerX

Overview: Machine Learning

Machine Learning Approaches to Sentiment Analytics

Machine Learning Approaches to Energy Consumption ... - arXiv

Doctoral Dissertation Machine Learning Approaches to Rhetorical

Machine Learning Approaches to Power System

Machine Learning Approaches to Corn Yield ...

Machine Learning Approaches to Power System Security ... - CiteSeerX

Machine Learning Approaches to Power System Security ... - CiteSeerX

Machine learning approaches to environmental ... - Semantic Scholar

The Boosting Approach to Machine Learning An Overview

Comparing Machine Learning Approaches for

MACHINE LEARNING APPROACHES FOR DISCERNING ...

Logical Limitations to Machine Ethics

Neural Machine Learning Approaches: Q-Learning ...

An Overview of Practical Research Approaches to

Logical approaches to fuzzy similarity-based reasoning: an ... - IIIA CSIC

An Overview of the Edinburgh Logical Framework

An Introduction to Logical Spreadsheets - CiteSeerX

An investigation into machine learning approaches for forecasting ...

Logical Approaches to Computational Barriers

Applying Machine Learning Toward an Automatic ... - CiteSeerX

Approximate Reinforcement Learning: An Overview - CiteSeerX

Predictive Analytics and Machine Learning: An Overview - IBM