First-Order Bayesian Classification with 1BC - Semantic Scholar

1 downloads 0 Views 201KB Size Report
We can define a uniform probability distribution over lists if we consider only ... We then need, for each list length l, a probability distribution over lists of length l.
Machine Learning, , 1–29 ()

c Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.

First-Order Bayesian Classification with 1BC PETER A. FLACH

[email protected]

Department of Computer Science, University of Bristol, United Kingdom

NICOLAS LACHICHE

[email protected]

LSIIT - IUT de Strasbourg Sud, France

Abstract. In this paper we present 1BC, a first-order Bayesian Classifier. Our approach is to view individuals as structured objects, and to distinguish between structural predicates referring to parts of individuals (e.g. atoms within molecules), and properties applying to the individual or one or several of its parts (e.g. a bond between two atoms). We describe an individual in terms of elementary features consisting of zero or more structural predicates and one property; these features are considered conditionally independent following the usual naive Bayes assumption. 1BC has been implemented in the context of the first-order descriptive learner Tertius, and we describe several experiments demonstrating the viability of our approach. Keywords: inductive logic programming, naive Bayes, first-order logic

1. Motivation and scope In this paper we present 1BC, a first-order Bayesian Classifier. While the propositional Bayesian Classifier makes the naive Bayes assumption of statistical independence of elementary features (one attribute taking on a particular value) given the class value, it is not immediate which elementary features to use in the first-order case, where features may be constructed from arbitrary numbers of literals. A classification task consists in classifying new individuals given some examples. It requires therefore a clear notion of individuals. Our approach is to view individuals as structured objects, and to distinguish between structural predicates referring to parts of individuals (e.g. atoms within molecules), and properties applying to the individual or one or several of its parts (e.g. a bond between two atoms). An elementary first-order feature then consists of zero or more structural predicates and one property. The naive Bayesian classifier has proved to be very useful [13, 7]. A Bayesian classifier looks for the most likely class value c0 of an individual i given its description d. Applying Bayes’ theorem, this can be calculated as: c0 = argmaxc P(cjd ) = argmaxc

P(d jc)P(c) P (d )

= argmaxcP(d

jc)P(c)

(1)

The key term here is P(d jc), the likelihood of the given description as a function of the class. A Bayesian classifier estimates these likelihoods from training data. Clearly the description d of an individual plays a key role here, because if it is too detailed it will not cover many other individuals and will therefore be hard to estimate, while if it is too general there may be too much variance in the class value among the individuals satisfying it to make a reliable prediction. Choosing the right level of description is a pre-requisite for making the Bayesian classifier work. While declarative bias is an important ingredient

P.A. FLACH AND N. LACHICHE

2

for any kind of symbolic machine learning, it is more so in a Bayesian classifier because the choice of representation has not only logical but also statistical impact. In an attribute-value representation, the individual is described by its values a1 ; : : : ; an for a fixed set of attributes A1 ; : : : ; An . Determining P(d jc) here requires an estimate of the joint probability P(A1 = a1 ; : : : ; An = an jC = c), usually abbreviated to P(a1 ; : : : ; an jc). This joint probability distribution is problematic for two reasons: 1. its size is exponential in the number of attributes n; 2. it requires a complete training set, with several examples for each possible description. These problems vanish if we can assume that all attributes are independent given the class: P(a1 ; : : : ; an jc) = P(a1 jc)  : : :  P(an jc)

(2)

This assumption is usually called the naive Bayes assumption. Even in cases where the assumption is clearly false, the naive Bayes classifier can give good results [7]. This can be explained by considering that we are not interested in P(d jc) per se, but merely in calculating the most likely class. Thus, if argmaxc P(a1 ; a2 jc) = argmaxc P(a1 jc)P(a2 jc)P(c) for all values a1 and a2 of attributes A1 and A2 , the naive Bayes assumption will not result in a loss of predictive accuracy, even if the actual probabilities are different. Roughly speaking, this requires that the combined attribute A1 ^ A2 correlates with the class in a similar way as the individual attributes A1 and A2 . For instance, a case where the naive Bayes classifier clearly fails is where a1 and a2 separately correlate positively with class c, but their conjunction correlates negatively with it. In first-order representations, individuals have internal structure which cannot be described by a fixed set of attributes. The features to be used in hypotheses (ie., boolean or multivalued properties of individuals) are not directly present in the dataset, but have to be constructed [11]. The situation is not unsimilar to image processing, where images are presented in the form of low-level pixel matrices, from which higher-level features such as contours and textures have to be generated before the image can be successfully processed. The question is thus: if an individual is described by a list, which high-level features should be used? Should we employ the head-tail representation of lists, so that e.g. any statement involving the n-th element of the list requires n head-tail splits, or do we use an element’s position? Is the number of occurrences of an item in a list relevant? Clearly, there are no domain-independent answers to these questions, and the representation has to be engineered to meet the requirements of the domain at hand. The approach we propose in this paper can be outlined as follows.

 

Feature construction is guided by the structure of the individual. For instance, if an individual is a set of tuples as in the multiple instance problem [5], features first extract a tuple from the set, and then state a property of the tuple, such as being equal to a given tuple. Representation engineering is facilitated by varying the declarative bias. For instance, we may decide to further decompose the tuple in a similar way as in attribute-value representations. Rather than requiring a different dataset, we adjust the bias specification and re-run the system.

FIRST-ORDER BAYESIAN CLASSIFICATION WITH 1BC



3

In the spirit of the naive Bayes classifier, the system should be simple. In particular, the basic naive Bayes classifier takes a strict top-down approach by considering all given attributes, determining their relative importance by Bayesian inference. Similarly, our first-order Bayesian classifier takes all features within the given bias into account. At the end of the paper we discuss possible improvements to this simple top-down scheme.

The outline of the paper is as follows. A major difference with the attribute-value case is that we have to consider probability distributions over structured terms such as lists and sets. There are several possibilities, which are considered in Section 2. In particular, we propose a probability distribution over sets which, unlike the usual bitvector distributions, only depends on its elements and not on the elements in its complement. In Section 3 we discuss the more general topic of decomposing probability distributions over nested terms. Section 4 presents the first-order feature language employed by 1BC. Section 5 discusses implementation details, including the mechanism for specifying declarative bias. In Section 6 we demonstrate the viability of our approach by means of several experiments. Section 7 considers related work, and Section 8 concludes. 2. Probability distributions over lists and sets In this section we assume an alphabet A = fx1 ; : : : ; xn g of atomic objects (e.g. integers or characters), and we consider the question: how to define probability distributions ranging over lists and sets of elements from A? 2.1. Probability distributions over lists We can define a uniform probability distribution over lists if we consider only finitely many L+1 of them, say, up to and including length L. There are n n?1?1 of those for n > 1, so under a uniform distribution every list has probability nLn+?1 1?1 for n > 1, and probability L+1 1 for n = 1. Clearly, such a distribution does not depend on the internal structure of the lists, treating each of them as equiprobable. A slightly more interesting case includes a probability distribution over lengths of lists. This has the additional advantage that we can define distributions over all (infinitely many) lists over A. For instance, we can use the geometric distribution over list lengths: Pτ (l ) = τ(1 ? τ)l , with parameter τ denoting the probability of the empty list. Of course, we can use other infinite distributions, or arbitrary finite distributions, as long as they sum up to 1. The geometric distribution corresponds to the head-tail representation of lists. We then need, for each list length l, a probability distribution over lists of length l. We can again assume a uniform distribution: since there are nl lists of length l, we would assign probability n?l to each of them. Combining the two distributions over list lengths and over lists of fixed length, we assign probability τ( 1?n τ )l to any list of length l. Such a distribution only depends on the length of the list, not on the elements it contains. We can also assume a probability distribution PA over the alphabet, and use this to define a non-uniform distribution over lists of length l. For instance, among the lists of length 3, list [a; b; c] would have probability PA (a)PA (b)PA (c), and so would its 5 permutations.

P.A. FLACH AND N. LACHICHE

4

Combining PA and Pτ thus gives us a distribution over lists which depends on the length and the elements of the list, but ignores their positions or ordering. Definition 1 (Distribution over lists). The following defines a probability distribution over lists: l

Pli ([x j1 ; : : : ; x jl ]) = τ(1 ? τ)l ∏ PA (x ji ) i=1

where 0 < τ  1 is a parameter determining the probability of the empty list. Introducing an extended alphabet A0 = fε; x1 ; : : : ; xn g and a renormalised distribution PA (ε) = τ and PA (xi ) = (1 ? τ)PA (xi ), we have Pli ([x j1 ; : : : ; x jl ]) = PA (ε) ∏li=1 PA (x ji ). That is, under Pli we can view each list as an infinite tuple of finitely many independently chosen elements of the alphabet, followed by the stop symbol ε representing an infinite empty tail. 0

0

0

0

E XAMPLE : Consider an alphabet A = fa; b; cg, and suppose that the probability of each element occurring is estimated as PA (a) = :2, PA (b) = :3, and PA (c) = :5. Taking τ = (1 ? :2)(1 ? :3)(1 ? :5) = :28 (see Definition 3), we have PA (a) = (1 ? :28)  :2 = :14, PA (b) = :22, and PA (c) = :36, and Pli ([a]) = :28  :14 = :04, Pli ([b]) = :06, Pli ([c]) = :10, Pli ([a; b]) = :28  :14  :22 = :009, Pli ([a; c]) = :014, Pli ([b; c]) = :022, and Pli ([a; b; c]) = :28  :14  :22  :36 = :003. We also have, e.g., Pli ([a; b; b; c]) = :28  :14  :22  :22  :36 = :0007. 0

0

0

If we want to include the ordering of the list elements in the distribution, we can assume a distribution PA2 over pairs of elements of the alphabet, so that [a; b; c] would have probability PA2 (ab)PA2 (bc) among lists of length 3. To include lists of length 0 and 1, we can add special start and stop symbols to the alphabet. Such a distribution would take some aspects of the ordering into account, but note that [a; a; b; a] and [a; b; a; a] would still obtain the same probability, because they consist of the same pairs (aa, ab, and ba), and they both start and end with a. Obviously we can continue this process with triples, quadruples etc., but note that this is both increasingly computationally expensive and unreliable if the probabilities must be estimated from data. A different approach is obtained by taking not ordering but position into account. For instance, we can have three distributions PA;1 , PA;2 and PA;3+ over the alphabet, for positions 1, 2, and 3 and higher, respectively. Among lists of length 4, the list [a; b; c; d ] would get probability PA;1 (a)PA;2 (b)PA;3+(c)PA;3+ (d ); so would the list [a; b; d ; c]. In summary, all except the most trivial probability distributions over lists involve (i) a distribution over lengths, and (ii) distributions over lists of fixed length. The latter take the list elements, ordering and/or position into account. 2.2. Probability distributions over sets and multisets A multiset (also called a bag) differs from a list in that its elements are unordered, but multiple elements may occur. Assuming some arbitrary total ordering on the alphabet, each multiset has a unique representation such as f[a; b; b; c]g. Each multiset can be mapped

FIRST-ORDER BAYESIAN CLASSIFICATION WITH 1BC

5

to the equivalence class of lists consisting of all permutations of its elements. For instance, the previous multiset corresponds to 12 lists. Now, given any probability distribution over lists that assigns equal probability to all permutations of a given list, this provides us with a method to turn such a distribution into a distribution over multisets. In particular, we can employ Pli above which defines the probability of a list, among all lists with the same length, as the product of the probabilities of their elements. Definition 2 (Distribution over multisets). For any multiset s, let l stand for its cardinality, and let ki stand for the number of occurrences of the i-th element of the alphabet. The following defines a probability distribution over multisets: Pms (s) =

l! k1 ! : : : kn !

Pli (s) = l!τ ∏ i

PA (xi )ki ki ! 0

where τ is a parameter giving the probability of the empty multiset. Here,

l! k1 !:::kn !

stands for the number of permutations of a list with possible duplicates.

E XAMPLE : Continuing the previous example, we have Pms (f[a]g) = :04, Pms (f[b]g) = :06, Pms (f[c]g) = :10 as before. However, Pms (f[a; b]g) = :02, Pms (f[a; c]g) = :03, Pms (f[b; c]g) = :04, Pms (f[a; b; c]g) = :02, and Pms (f[a; b; b; c]g) = :008. This method, of defining a probability distribution over a type by virtue of that type being isomorphic to a partition of another type for which a probability distribution is already defined, is more generally applicable. Although in the above case we assumed that the distribution over each block in the partition is uniform, so that we only have to count its number of elements, this is not a necessary condition. Indeed, blocks in the partition can be infinite, as long as we can derive an expression for its cumulative probability. We will now proceed to derive a probability distribution over sets from distribution Pli over lists in this manner. Consider the set fa; bg. It can be interpreted to stand for all lists of length at least 2 which contain (i) at least a and b, and (ii) no other element of the alphabet besides a and b. The cumulative probability of lists satisfying the second condition is easily calculated. L EMMA 1 Consider a subset S of l elements from the alphabet, with cumulative probability PA (S) = ∑xi 2S PA (xi ). The cumulative probability of all lists of length at least l 0

0

(P 0 (S))l

containing only elements from S is f (S) = τ 1?AP

A0 (S)

.

Proof: We can delete all elements in S from the alphabet and replace them by a single element xS with probability PA (S). The lists we want consist of l or more occurrences of xS . Their cumulative probability is 0

(P 0 (S))l

∑ τ(PA (S)) j = τ 1 ?APA (S) j l

0

0

P.A. FLACH AND N. LACHICHE

6

f (S) as defined in Lemma 1 is not a probability, because the construction in the proof includes lists that do not contain all elements of S. For instance, if S = fa; bg it includes lists containing only a’s or only b’s. More generally, for arbitrary S the construction includes lists over every possible subset of S, which have to be excluded in the calculation of the probability of S. In other words, the calculation of the probability of a set iterates over its subsets. P ROPOSITION 1 (S UBSET- DISTRIBUTION elements from the alphabet, and define Pss (S) =

OVER SETS)

Let S be a non-empty subset of l

∑ (?PA (S0 ))l?l  f (S) 0

0

S0 S

where l 0 is the cardinality of S0 , and f (S) is as defined in Lemma 1. Furthermore, define Pss (0/ ) = τ. Pss (S) is a probability distribution over sets. E XAMPLE : Continuing the previous example, we have Pss (0/ ) = τ = :28, Pss (fag) = f (fag) = P (fag) τ A = :05, Pss (fbg) = :08, and Pss (fcg) = :16. Furthermore, Pss (fa; bg) = f (fa; bg) ? 1?PA (fag) PA (fag)  f (fag) ? PA (fbg)  f (fbg) = :03, Pss (fa; cg) = :08, and Pss (fb; cg) = :15. Finally, Pss (fa; b; cg) = f (fa; b; cg) ? PA (fa; bg)  f (fa; bg) ? PA (fa; cg)  f (fa; cg) ? PA (fb; cg)  f (fb; cg) + (PA (fag))2  f (fag) + (PA (fbg))2  f (fbg) + (PA (fcg))2  f (fcg) = :18. 0

0

0

0

0

0

0

0

0

0

Pss takes only the elements occurring in a set into account, and ignores the remaining elements of the alphabet. For instance, the set fa; b; cg will have the same probability regardless whether there is one more element d in the alphabet with probability p, or 10 more elements with cumulative probability p. This situation is analogous to lists. The following distribution, on the other hand, defines the probability of a set in terms of both its members and its non-members. Definition 3 (Bitvector-distribution over sets). Pbv (S) = ∏ PA (x) ∏(1 ? PA(y)) x2S

y62S

That is, a set is viewed as a bitvector over the alphabet, each bit being independent of the others. E XAMPLE : Continuing the previous example, we have Pbv (0/ ) = P(fcg) = :8  :7  :5 = 28, Pbv (fag) = P(fa; cg) = :07, Pbv (fbg) = P(fb; cg) = :12, and Pbv (fa; bg) = P(fa; b; cg) = :03. Notice how, at the expense of fa; b; cg, the singleton subsets obtain higher probabilities, in particular fcg (c being the most frequent element of the alphabet). :

Pbv doesn’t take the probability of the empty list or set as a parameter. We have therefore used τ = Pbv (0/ ) in the preceding examples. However, in general the bitvector-distribution requires a distribution over each bit, rather than a distribution over the alphabet. One can

FIRST-ORDER BAYESIAN CLASSIFICATION WITH 1BC

7

Colour

Shape

Number

Class

Colour

Class

Shape

Class

Number

Class

blue red blue

circle triangle triangle

2 2 3

 

blue red blue

 

circle triangle triangle

 

2 2 3

 

Figure 1. Propositional naive Bayes from the database perspective.

think of the former as being obtained from the latter through a parameter σ, so that the distribution associated with element xi is (σPA (xi ); 1 ? σPA (xi )). The last example was constructed so that σ = 1. To differentiate between the distribution PA and the distribution over a particular bit, we will sometimes use the notation Pxi (+) and Pxi (?) for the latter. The two distributions over sets, Pss and Pbv , differ in the independence assumptions they make. Which one is better thus depends on the domain. Notice that Pss is expensive to compute for large sets, while Pbv is expensive for large universes. 3. Probability distributions over structured objects In this section, we will discuss more generally the problem of probability distributions over structured objects, and how these may be broken down into simpler distributions by making certain independence assumptions. We will then show, in the following sections, how the 1BC approach fits into the picture. Two perspectives will be helpful in our analysis: the term perspective and the database perspective. In the term perspective, our structured objects are viewed as terms in a typed programming language [9]; thus, we can say that an individual is described by a set of tuples, corresponding to the type ‘powerset of a cartesian product’. By a slight abuse of language, we will usually refer to such a type as a set of tuples as well. In the database perspective, a dataset is described by a database of relations with foreign keys, and an individual is described by the sub-database of tuples referring to that individual or parts thereof. The process of converting a term representation into a database representation is called flattening, and is described in more detail elsewhere. Here, it suffices to point out that a tuple term (i.e. with a type constructed by cartesian product) will correspond to a single row in a table, while set and list terms will correspond to 0 or more rows. 3.1. Decomposition of tuples: propositional naive Bayes Approaching the propositional naive Bayes classifier from the database perspective, we see that it amounts to breaking up a table with n + 1 columns into n binary tables (Figure 1). This is of course, in general, a transformation which results in loss of information: we lose the associations between attributes that exist in the dataset. The naive Bayes assumption is made to reduce computational cost and to make the probability estimates more reliable, but we cannot achieve this without loss of logical and statistical knowledge. E XAMPLE : In Figure 1, each individual is described by a 3-tuple (Colour,Shape,Number). Following the naive Bayes assumption, the dataset in the left-hand table gets decom-

P.A. FLACH AND N. LACHICHE

8

(WKf, WKr)

(WRf, WRr)

(BKf, BKr)

Class

(1, 1) (1, 1) (3, 6)

(3, 4) (3, 4) (7, 8)

(8, 8) (1, 2) (8, 8)

legal illegal illegal

WKf

WKr

Class

WRf

WRr

Class

BKf

BKr

Class

1 1 3

1 1 6

legal illegal illegal

3 3 7

4 4 8

legal illegal illegal

8 1 8

8 2 8

legal illegal ilegal

Figure 2. Level-1 decomposition of a KRK database.

-

Example

Colour

Shape

Number

e1 e1 e1 e2 e3 e3

blue red blue green blue green

circle triangle triangle rectangle circle rectangle

2 2 3 3 2 3

Example

Class

e1 e2 e3

 

Figure 3. A multiple instance problem. Here, the type of an individual is f(Colour,Shape,Number)g, i.e. a set of 3-tuples. Classes are assigned to sets rather than tuples, requiring a separate table.

posed into the three tables on the right. Conditional probabilities of a particular combination of attribute values occurring given the class are estimated from the smaller relations. For instance, we have P(bluej) = :5, P(circlej) = :5, and P(2j) = 1, and hence P((blue; circle; 2)j) is estimated at .25, instead of .5 which would be estimated from the left table. We thus see that in propositional naive Bayes, each n-tuple describing a single individual gets decomposed into n 1-tuples (we ignore the class value here, since it will obviously be included in any decomposed table). We can generalise this by allowing nested tuples: e.g., in the King-Rook-King problem a board can be described as a 3-tuple of positions of each of the pieces, where a position is a pair (file,rank). We now have the choice between a level-1 decomposition (Figure 2), in which we estimate probabilities of positions (i.e. a 3-tuple of pairs gets decomposed into 3 pairs), and a level-2 decomposition in which we estimate probabilities of files and ranks (i.e. a board is now described by 6 1-tuples). 3.2. Decomposition of sets: the multiple instance problem Such nesting of terms also occurs in the multiple instance problem [5], where each example is described by a variable number of attribute-value vectors. From the type perspective, this corresponds to a set of tuples (Figure 3).

FIRST-ORDER BAYESIAN CLASSIFICATION WITH 1BC

9

Example

HasBlueCircle2

HasRedTriangle2

HasBlueTriangle3

HasGreenRectangle3

Class

e1 e2 e3

+ +

+ -

+ -

+ +

 

Figure 4. Representation of a set of tuples as a bitvector.

How do we estimate P(e1j)? A simple approach would be to just use the table on the right. However, this does not generalise, as we would have no means to evaluate the probability of a set we haven’t seen before. Also, the estimates would be extremely poor unless we see the same sets many times. We thus have to make use of the internal structure of the set, i.e., which tuples it contains. For instance, we can represent the set as a bitvector, treating each bit as statistically independent of the others (Figure 4). E XAMPLE : If we want to estimate P(e1j) from the data in Figure 4, we can use Pbv or Pss . For the former, we first estimate Pxi (+j). We use Laplace’s correction because of the small sample size, and obtain PHasBlueCircle2 (+j) = :75, PHasRedTriangle2 (+j) = :5, PHasBlueTriangle3 (+j) = :5, and PHasGreenRectangle3 (+j) = :5. Consequently, Pbv (e1j) = :75  :5  :5  (1 ? :5) = :094. For the subset-distribution, we first estimate PA (xi j). Considering only the plusses in Figure 4, we have PA (HasBlueCircle2j) = :4, PA (HasRedTriangle2j) = :2, PA (HasBlueTriangle3j) = :2, and PA (HasGreenRectangle3j) = :2. (Alternatively, we could have obtained this distribution from the first set of probabilities, with σ = 2:25 – this would have resulted in slightly different probabilities because of the Laplace correction.) We can use the bitvector-distribution to estimate τ; again with Laplace correction we obtain τ = Pbv (0/ ) = :031. Putting everything together we obtain Pss (e1j) = :037. In this example we had to estimate probabilities of tuples. In the spirit of the naive Bayes classifier we can choose to decompose the tuples as well. Effectively, this means that we decompose the tuples first, turning a set of tuples into a tuple of sets (Figure 5). Subsequently, we can estimate probabilities of sets by treating them as bitvectors (Figure 6). E XAMPLE : For the bitvector-distribution, we obtain the following estimates from Figure 6 using Laplace’s correction: PHasBlue (+j) = PHasCircle (+j) = PHas2(+j) = PHas3(+j) = :75, and PHasRed (+j) = PHasGreen (+j) = PHasTriangle (+j) = PHasRectangle (+j) = :5. Consequently, Pbv (e1j) = :020. For the subset-distribution we obtain the following estimates: PA (HasBluej) = PA(HasCirclej) = PA(Has2j) = PA(Has3j) = :17, and PA (HasRedj) = PA (HasGreenj) = PA(HasTrianglej)PA (HasRectanglej) = :08. Furthermore, τ = Pbv (0/ ) = :00024. Putting everything together we obtain Pss (e1j) = :00013. It should be kept in mind that, in such a small toy example, the estimates are not very reliable. A good estimate for τ is clearly important: if we take τ = 1=28 = :0039, we obtain Pss (e1j) = :0021. In summary, in order to apply Bayesian classification to a multiple instance problem we can perform the following decompositions:

P.A. FLACH AND N. LACHICHE

10

Example

Colour

Example

Shape

Example

Number

e1 e1 e1 e2 e3 e3

blue red blue green blue green

e1 e1 e1 e2 e3 e3

circle triangle triangle rectangle circle rectangle

e1 e1 e1 e2 e3 e3

2 2 3 3 2 3

Figure 5. Decomposition of a set of tuples into a tuple of (multi)sets. The class assignment to sets is the same as in Figure 3.

Example

HasBlue

HasRed

HasGreen

HasCircle

HasTriangle

HasRectangle

Has2

Has3

Class

e1 e2 e3

+ +

+ -

+ +

+ +

+ -

+ +

+ +

+ + +

 

Figure 6. Decomposition of the sets in Figure 5 into bitvectors. Effectively, each set is now described by an 8-tuple.

Example

HasBlue

HasRed

HasGreen

HasCircle

HasTriangle

HasRectangle

Has2

Has3

Class

e1 e2 e3

2 0 1

1 0 0

0 1 1

1 0 1

2 0 0

0 1 1

2 0 1

1 1 1

 

Figure 7. Decomposition of the multisets in Figure 5 into cardinality-vectors.

level-0: not to decompose at all, i.e. only count number of occurrences of a particular set (right table of Figure 3); level-1: to treat the set as a bitvector and decompose the bitvector but not the tuples (Figure 4); level-2: to decompose the set of tuples into a tuple of sets (Figure 5), and to further decompose the sets thus obtained into bitvectors with independent components (Figure 6). The main point about this analysis is that it demonstrates the kind of features that enter the picture as soon as one starts decomposing naive-Bayes style. The next section presents a first-order language for representing such decomposed features. The analysis also demonstrates that, in general, one has a choice with regard to the depth of the decomposition. Although the preceding analysis only considered the multiple instance problem, i.e. a set of tuples, a similar analysis can be applied to hierarchies of arbitrary depth. Before closing this section we note that cardinality-vectors provide an alternative to the bitvector representation, i.e. counting the number of times a certain attribute-value is present in a tuple (Figure 7). Like in the bitvector case, this doesn’t prescribe the calcu-

FIRST-ORDER BAYESIAN CLASSIFICATION WITH 1BC

11

lation of probabilities: we can either estimate a distribution over the alphabet of features, and use the cardinalities as the ki in Definition 2; or we can use the cardinalities as simple nominal values, governed by a distribution for each feature. 4. A first-order language for Bayesian Classifiers In this section we elaborate the representation formalism employed by 1BC. Our approach is best understood by thinking of individuals as structured objects represented by firstorder terms in a strongly typed language [9]. Our actual implementation makes use of a flattened, function-free Prolog representation similar to the database representation from the previous section. 4.1. Individuals as terms As in the propositional case, we will assume that the domain provides a well-defined notion of an individual, e.g. a patient in a medical domain, a molecule in mutagenicity prediction, or a board position in chess. Associated with each individual is its description (everything that is known about it except its classification) and its classification. In a first-order representation the description of an individual can be expressed by a single structured term. 1 In the attribute-value case this term is a tuple (element of a cartesian product) of attribute values (constants). For instance, in a medical domain each patient could be represented by a five-tuple specifying name, age, sex, weight, and blood pressure of the patient. The first-order case generalises this by allowing other complex types at the top-level (e.g. sets, lists), and by allowing intermediate levels of complex subtypes before the atomic enumerated types are reached. E XAMPLE : Consider Michalski’s east- and westbound trains learning problem. We start with a number of propositional attributes: Shape = frectangle; u shaped; bucket; hexa; : : :g Length = fdouble; shortg Roo f = fflat; jagged; peaked; arc; openg Load = fcircle; hexagon; triangle; : : :g A car is a 5-tuple describing its shape, length, number of wheels, type of roof, and its load: Car = Shape  Length  Integer  Roo f  Load And finally we define a train as a set of cars: Train = 2Car Here is a term representing a train with 2 cars: f(u shaped; short; 2; open; triangle); (rectangle; short; 2; flat; circle)g. In this example an individual is represented by a set of tuples of constants, rather than by a tuple of constants as in the propositional case. Notice that the above type signature represents another instance of the multiple instance problem [5]. It has been argued that the multiple instance problem represents most of the complexity of upgrading to a first-order representation [3]. However, we would like to stress that the above representation does not prevent a deeper nesting of types.

12

P.A. FLACH AND N. LACHICHE

4.2. First-order features As 1BC employs the flattened database perspective rather than the strongly typed term perspective, type signatures like the above are represented in our language through firstorder features. First-order features use additional, existentially quantified variables that express non-determinacy: e.g., “containing a carbon atom” could be a first-order feature of molecules. Here and elsewhere, a feature denotes a statement about an individual of the domain. Since a first-order language is much more expressive than an attribute-value language (which is essentially propositional), the question now is what elementary features to use in the naive Bayesian formula. In the propositional case we simply use all features of the form Ai = ai . If we use Prolog as representation language, as is common in inductive logic programming, the obvious choice seems to use single literals as elementary features, but the problem here is that many literals express properties of parts of individuals such as atoms. The above feature would typically be represented by a conjunction of two Prolog literals, one that links a molecule to one of its atoms, and a second to express a property of that atom. In this section we develop a precise definition of elementary first-order features. We employ Prolog notation, with variables starting with capitals and constants starting with lowercase letters, and commas indicating conjunction between literals. As explained above, a domain is thought of as a hierarchy of complex types. Instead of specifying this type hierarchy directly, the teacher associates with each complex type a structural predicate that can be used to refer to some of its subterms. Definition 4 (Structural predicate). A structural predicate is a binary predicate associated with a complex type representing the relation between that type with one of its parts. A functional structural predicate, or structural function, refers to a unique subterm, while a non-determinate structural predicate is non-functional. For instance, for an n-dimensional cartesian product, each of the n projection functions can be represented by a structural predicate (since it is a 1-to-1 relation, it is usually omitted). For a list type and a set type, list membership and set membership are structural predicates (notice that these are not functions, but 1-to-many relations). For example, the following conjunction of literals refers to the load L of some car C of a train T: train2car(T,C),car2load(C,L). car2load is functional and train2car is non-determinate. Our non-determinate structural predicates are similar to the structural literals of [22], however they are not required to be transitive. Definition 5 (Property). A property is a predicate characterising a subset of a type. A parameter is an argument of a property which is always instantiated. If a property has no parameter (or only one instantiation of its parameters), it is boolean, otherwise it is multivalued. A property is propositional if it has only one argument which is not a parameter, and relational if it has more than one. A property is for instance the length of a car short(C), or the shape of a load load(L,triangle). The former is boolean, while the latter is multivalued if the second argument of load is a parameter. We will also indicate multivalued properties with a hash, as in load(L,#LShape),

FIRST-ORDER BAYESIAN CLASSIFICATION WITH 1BC

13

to denote the argument holding the value of the multivalued property. Our parameters correspond to valued arguments of [20]. We assume that the value of parameters depends functionally on the instantiation of the remaining arguments. The property shape(C,rectangle) is propositional, while bond(Atom1,Atom2,#BondType) is relational. The distinction between structural predicates and properties is best understood in the context of a representation of individuals by terms: structural predicates refer to subterms (and introduce new variables), properties treat subterms as atomic (and consume variables). However, the same distinction can be made when using a flattened representation of the examples, which is in fact what we use in the 1BC system. Flattening requires introducing a name for all relevant subterms. For instance, the set e1 of Figure 3 could have the following flattened representation: example(e1). set2tuple(e1,t1). isTuple(t1,blue,circle,2). set2tuple(e1,t3). isTuple(t3,blue,triangle,3).

set2tuple(e1,t2). isTuple(t2,red,triangle,2).

Here, se2tuple is a structural predicate and isTuple is a property with three parameters (representing the actual 3-tuple). Which predicates are structural and which are properties is part of the definition of the hypothesis language, as it cannot always be detected automatically, especially when the representation is flattened and no types are defined. 1BC’s declarative bias is discussed in the next section. Properties and structural predicates are used to define first-order features. Definition 6 (First-order feature). A global variable is a variable of the complex type describing the domain of interest.2 A first-order feature of an individual is a conjunction of structural predicates and properties where:

   

each structural predicate introduces a new existentially quantified local variable, and uses either the global variable or one of the local variables introduced by other structural predicates; properties do not introduce new local variables; each variable is used by either a structural predicate or a property; the feature is minimal, in the sense that it cannot be split into smaller features which only share the global variable.

A feature is functional if all structural predicates are functional, otherwise it is non-determinate. A functional feature is boolean if it contains a single property and this property is boolean, otherwise it is multivalued. A non-determinate feature is always boolean. A feature is elementary if it contains a single property. For instance, the following is a first-order feature: train2car(T,C),car2load(C,L),load(L,triangle).

P.A. FLACH AND N. LACHICHE

14

Note that all variables are existentially quantified except the global variable, which is free. This feature is true of any train which has a car which has a triangular load. It is boolean because of the non-determinate structural predicate train2car. An example of a multivalued feature is train2firstcar(T,C),shape(C,#Shape). where Shape indicates a parameter (i.e., a possible value of the multivalued feature). The first car of a given train has only one shape. On the other hand, the feature train2car(T,C), shape(C,Shape) is boolean rather than multivalued. The same train can have a car with a rectangular shape, and another car with a non-rectangular shape. The following are not first-order features of trains: load(L,triangle). train2car(T,C). train2car(T,C1),train2car(T,C2),short(C1). train2car(T,C1),train2car(T,C2),short(C1),long(C2). The first condition does not use a global train variable (it could be a feature of loads, however); the second and third have an unused local variable; and the last one is nonminimal. To understand the distinction between elementary and non-elementary first-order features, consider the following features: train2car(T,C),length(C,short). train2car(T,C),roof(C,open). train2car(T,C),length(C,short),roof(C,open). The first feature is true of trains having a short car. The second feature is true of trains having an open car. The third feature is true of trains having a short, open car. From the naive Bayes perspective this third feature is non-elementary, as we assume (justifiably or otherwise) that the probability of a car being short is independent of the probability of a car being open, given the class value. Therefore our first-order Bayesian classifier needs to have access to the first and second feature, but not the third. Notice that properties may express relations between subterms, e.g. mol2atom(M,A1),mol2atom(M,A2),bond(A1,A2,2) would be an elementary first-order feature which describes the case of a molecule containing two atoms with a bond between them. 5. 1BC In this section we describe 1BC, a first-order Bayesian Classifier based on the theory described in the preceding sections.3 After giving some implementation details, we describe 1BC’s declarative bias mechanism in detail, particularly how it can be used to vary the depth of the decomposition.

FIRST-ORDER BAYESIAN CLASSIFICATION WITH 1BC

15

5.1. Implementation 1BC has been implemented in C in the context of the first-order descriptive learner Tertius [10]. Let us first describe briefly Tertius’ abilities which are used in 1BC. Tertius is able to deal with extensional explicit knowledge (i.e. the truth value of all ground facts is given), with extensional knowledge under the Closed World Assumption (i.e. all true ground facts are given), or with intensional knowledge (i.e. truth values are derived using either Prolog’s SLD-resolution or a theorem prover). Tertius can also deal with (weakly) typed predicates, that is each argument of a predicate belongs to a named type and the set of constants belonging to one type defines its domain. Moreover, if a domain is continuous, Tertius allows one to discretise it into several intervals of one standard deviation and centered on the mean. Given some knowledge concerning the domain, Tertius returns a list of interesting sets of literals. It performs a top-down search, starting with the empty set and iteratively refines it. In order to avoid to consider the same clauses several times (and their refinements!), the refinement steps (i.e. adding a literal, unifying two variables, and instantiating a variable) are ordered. Once a particular refinement step is applied, none of its predecessors are applicable anymore. The search space can be seen as a generalisation of set-enumeration trees [19] to first-order logic. Since there might be an infinite number of refinements, the search is restricted to a maximum number of literals and of variables. Other language biases are the declaration of structural predicates and properties, the distinction between functional and non-determinate structural predicates, and the use of parameters, as explained in the previous section. Elementary first-order features are generated by constraining Tertius to generate only hypotheses containing exactly one property and no unnecessary structural predicates. The features can optionally be read from a file. In the naive Bayes formula, the features Ai = ai are replaced by elementary first-order features f . Each conditional probability P( f jc) of the feature value f given the value c of the class is then estimated from the training data. Writing n( f ^ c) for the number of individuals satisfying f and Cl = c, n(c) for the number of individuals satisfying Cl = c, and F for the number of values of the feature (the number of possible values for a multivalued feature, 2 for a boolean or a non-determinate feature), c)+1 the Laplace estimate P( f jc) = n(n(f c^)+ F is used in order to avoid null probabilities in the product in Equation (2). 5.2. Declarative bias The following is an example of the declarations required by 1BC in order to deal with the multiple instance problem in Figure 3. --INDIVIDUAL example 1 set cwa --STRUCTURAL set2tuple 2 1:set *:tuple cwa --PROPERTIES isTuple 4 tuple #colour #shape #number cwa

P.A. FLACH AND N. LACHICHE

16

The first two lines define a single global variable of type set. The next two lines define a structural predicate. set2tuple is a many-to-one relation, indicating that one set may contain many tuples, but each tuple belongs to exactly one set. Finally, the last two lines define a property. The second, third and fourth argument’s types of the property isTuple are preceded by #, indicating that they are parameters which must always be instantiated in rules. Jointly, they represent the tuple, with the first argument being the tuple’s name. The label cwa means that the Closed-World Assumption is used for these predicates. These declarations are similar to mode declarations, albeit on the predicate level rather than the argument level. ILP systems such as Progol [14] and Warmr [4] use mode declarations such as train2car(+Train,-Car) indicating that the first argument is an input argument and should use a variable already occurring previously in the current hypothesis. They however do not distinguish between structural predicates and properties. 5.3. Decomposition by varying the declarative bias In this section, we demonstrate how 1BC can be used to estimate a distribution of probabilities over first-order terms as described in Sections 2 and 3. First, since it uses a flattened representation, 1BC assumes that all relevant subterms (except the tuples, which are represented by parameters) are named. For instance, the first tuple of set e1 (Figure 3) is a t1 tuple. Set e3 has got a t1 tuple too. A property is introduced for each sub-type of the individual. For instance, isTuple(T,blue,circle,2) is the property stating that tuple T is a t1. If there were a lot of examples, some of them identical, we could introduce a property on the global variable (i.e., level-0 decomposition, cf. right table of Figure 3). It would be possible to state that the examples e101 and e347 are both e1 sets, isSet(e101,e1) and isSet(e347,e1). It would then be possible to get a better estimate of P(e1j). In such a case, the following (purely propositional) bias declaration would be given to 1BC: --INDIVIDUAL example 1 set cwa --PROPERTIES isSet 2 set #set_value cwa However, if there are too many possible sets, they are unlikely to occur often enough, and it is difficult to assign its (set )value to each example, then a level-1 deomposition of the probability of the set (cf. Figure 4) can be obtained with the following declaration: --INDIVIDUAL example 1 set cwa --STRUCTURAL set2tuple 2 1:set *:tuple cwa --PROPERTIES isTuple 2 tuple #colour #shape #number cwa

FIRST-ORDER BAYESIAN CLASSIFICATION WITH 1BC

17

If the use of features describing the tuples contained in the set, such as P(set2tuple(S; T); IsTuple(T; blue; circle; 2)j), leads to a good accuracy, the decomposition can be stopped there. Otherwise a level-2 decomposition has to be considered (cf. Figure 6). The decomposition of a tuple could be seen as using a structural predicate to project a tuple along its components and then considering some properties of its components, for instance: set2tuple(S,T),tuple2colour(T,C),isColour(C,blue). But decomposing a tuple can be done without introducing a structural predicate. Since it is a one-to-one relation, each property of some component of a tuple can be seen as a property of the tuple itself. So we will consider them as properties of the tuple instead, and write for instance: set2tuple(S,T),hasColour(T,blue). Such a distribution of probabilities of the tuples can be obtained in 1BC with the following declaration: --INDIVIDUAL example 1 set cwa --STRUCTURAL set2tuple 2 1:set *:tuple cwa --PROPERTIES hasColour 2 tuple #colour cwa hasShape 2 tuple #shape cwa hasNumber 2 tuple #number cwa The 1BC bias declaration allows an optional maximum number of occurrences of a predicate in a feature. This also allows us to ‘switch off’ particular predicates, and to combine all declarations for different decompositions. Here is for instance the declaration for the deepest decomposition: --INDIVIDUAL example 1 set cwa --STRUCTURAL set2tuple 2 1:set *:tuple cwa --PROPERTIES isSet 2 set #set_value 0 cwa isTuple 2 tuple #colour #shape #number 0 cwa hasColour 2 tuple #colour cwa hasShape 2 tuple #shape cwa hasNumber 2 tuple #number cwa 1BC requires that the user provides a flattened representation containing all ground facts for each of these predicates. In fact, the decomposition of isTuple(T1,blue,circle,2) into hasColour(T1,blue), hasShape(T1,circle) and hasNumber(T1,2) could be encoded by background knowledge. While 1BC does provide an interface to

P.A. FLACH AND N. LACHICHE

18

Prolog, this slows the system down considerably. In the examples described in the next section we therefore always saturate the dataset with any background knowledge that may be required. Similarly, the dataset needs to contain ground facts for isSet if one wants to use a level-0 decomposition. This is obviously inconvenient and would be unnecessary in an unflattened representation. On the other hand, this may be a source of inefficiency as the learning system needs to decide whether two different examples consist of the same set or list. We believe that the user should have the choice: if it is obvious that there are several occurrences of the same term or subterm, then it makes sense to name those occurrences, otherwise the probability estimates on such terms are unlikely to be reliable, hence they have to be decomposed, and their names do not matter. 6. Experiments In this section we describe experimental results on several domains: chess-board illegal positions, Alzheimer’s disease, mutagenesis, finite-element mesh design and diterpene structure elucidation. Experiments aim at showing that (i) the decomposition of structured individuals improves the reliability of estimates, hence the accuracy, and (ii) a first-order naive Bayesian classifier gets an accuracy similar to other first-order learners on those domains. Experiments were carried out using sets and tuples only. The bitvector decomposition was used. Those choices of representation and of decomposition are consistent with the representations used on those domains by other ILP systems. Additional settings are described whenever they are used, in particular the declarative bias guiding the decomposition process. 6.1. Illegal Chess Endgame Positions This experiment concerns the chess endgame domain White King and Rook versus Black King (KRK) [16]. The classification task is to distinguish between illegal and legal board positions. Following the results reported in [12], we used 5 training sets of 100 board positions each, and a test set of 5000 positions. Table 1 gives the accuracy over the training set, and the accuracy over the test set, averaged over the 5 training sets. At level-0 an example consists of a ‘6-tuple’ boardeq(Board,#AllPositions): file and rank of the three pieces. Here, the 6-tuple is represented simply by a constant naming it. A first-order feature is for instance: boardeq(B,763738). Accuracy is given on the first line (level-0) of Table 1. The last line of Table 1 reports the accuracy of the majority class as a guideline. Two level-1 decompositions can be performed: the 6-tuple can either be decomposed into the positions of each piece, i.e. three pairs (as in Figure 2), or into the files/ranks of all pieces, i.e. two 3-tuples. The 1BC representation employs a structural function board2WK to refer to the position of the White King (similarly for the other two pieces) and two structural functions board2files and board2ranks to refer to the files and ranks of

FIRST-ORDER BAYESIAN CLASSIFICATION WITH 1BC

19

Table 1. Results in the KRK-illegal domain. Decomposition

Training accuracy

Test accuracy

Level-0 Level-1 Position Level-1 Files/Ranks Level-1 Relational Level-2 Propositional Level-2 Relational

100% sd. 0% 94.2% sd. 1.3% 100% sd. 0% 81.2% sd. 2.8% 79.0% sd. 3.5% 93.8% sd. 3.6%

66.3% sd. 57.3% sd. 68.3% sd. 80.7% sd. 56.2% sd. 88.3% sd.

Majority class

64.0% sd. 3.0%

66.3%

0.0% 2.1% 0.9% 1.3% 1.4% 2.8%

the three pieces. A propositional property poseq is used to refer to the absolute position of a piece. Level-1 decomposition into separate positions considers features like: board2WK(B,K),poseq(K,76). board2WR(B,R),poseq(R,37). board2BK(B,K),poseq(K,38). Accuracy is given on the second line (level-1 positions) of Table 1: 94.2% on the training sets, and 57.3% on the test set. Level-1 decomposition into files, on one hand, and ranks, on the other hand, relies on features like: board2files(B,F),fileseq(F,678). board2ranks(B,R),rankseq(R,733). Accuracy is given on the third line (level-1 files/ranks) of Table 1: 100% on the training sets, and 68.3% on the test set. On this domain propositional features consider absolute files and ranks of the three pieces. However, relational features considering relative positions of pieces should be expressive enough to predict whether a position of the board is illegal. Furthermore, relational features should cover more instances than propositional ones, hence estimates should be more reliable. First we consider a refinement of level-1 positions with the relational property adjpos(Pos1,Pos2), i.e. Pos2 is one step away from Pos1 either vertically, horizontally, or diagonally. Features that are used are for instance: board2WR(B,R),board2BK(B,K),adjpos(R,K). This representation is a relational alternative to the level-1 position decomposition using background knowledge. Accuracy is given on the fourth line (Level-1 Relational) of Table 1: 81.2% on the training sets, and 80.7% on the test set. Level-2 decomposition consists in splitting all files and ranks. The 1BC representation employs two structural functions pos2rank and pos2file to translate a position into rank and file. We have two propositional properties rankeq and fileeq equating rank/file with a number. Features at level-2 are exemplified by: board2WK(B,K),pos2rank(K,R),rankeq(R,1). board2BK(B,K),pos2file(K,F),fileeq(F,8).

P.A. FLACH AND N. LACHICHE

20

Accuracy is given on the fifth line (level-2 propositional) of Table 1: 79.0% on the training sets, and 56.2% on the test set. Three relational properties adj(RF1,RF2), eq(RF1,RF2) and lt(RF1,RF2) can be used to compare rank/files of two pieces. It allows one to consider an alternative relational representation of the level-2 propositional decomposition using background knowledge. Features considered here are of the form: board2WK(A,B),board2WR(A,C),pos2rank(B,D),pos2rank(C,E),lt(E,D). board2WR(A,B),board2BK(A,C),pos2file(B,D),pos2file(C,E),adj(E,D). Accuracy is given on the sixth line (Level-2 Relational) of Table 1: 81.2% on the training sets, and 80.7% on the test set. The propositional representation clearly overfits the training sets and leads to a poor accuracy on the test set. Moreover, the decomposition of the 6-tuple into 6 independent properties decreases the accuracy. However, the use of relational properties increases the accuracy both on the training set and the test set, therefore it avoids overfitting. The results show also that KRK-illegal is a difficult domain for a Bayesian classifier, since the best result reported in [12] was 98.1% on the test set, achieved by LINUS using a propositional rule learner. Nevertheless, the experiment clearly demonstrates that the use of first-order relational features instead of propositional features considerably improves the performance of the Bayesian classifier. 6.2. Alzheimer’s disease This dataset is about drugs against Alzheimer’s disease [1]. It aims at comparing four desirable properties of such drugs:

   

low toxicity high acetyl cholinesterase inhibition good reversal of scopolamine-induced memory deficiency inhibit amine reuptake

The aim isn’t to predict whether a molecule is good or bad, but rather whether the molecule is better or worse than another molecule for each of the four properties above. All molecules considered in this dataset have the structure of the tacrine molecule and they differ only by some substitutions on the ring structures (Figure 8). New compounds are created by substituting chemicals for R, X6 and X7 . The X substitution is represented by a substituent and its position. The R substitution consists of one or more alkyl groups linked to one or more benzene rings. The linkage can either be direct, or through N or O atoms, or through a CH bond. The R substitution is represented by its number of alkyl groups, a set of pairs of position and substituent, a number of ring substitutions, and a set of pairs of ring position and substituent. Since one example consists of two molecules, all properties occur twice. A level-0 decomposition of such an individual is too complex for 1BC, moreover probability estimates are unlikely to be reliable. At level-1 the X substitutions are represented by their positions

FIRST-ORDER BAYESIAN CLASSIFICATION WITH 1BC

21

RNH

X6 (CH 2) n

X7

Figure 8. Template of the tacrine molecule.

and substituents, i.e. a 4-tuple. The same representation is used for R substitutions and for ring substitutions. Finally the number of ring substitutions in each molecule is taken into account. Features considered are for instance: mol2alkgroup(C,A),alkgroupeq(A,01). mol2x_subst(C,X),x_substeq(X,67fcl). mol2r_subst(C,R),r_substeq(R,32aro2double_alk1). mol2ring_subst(C,R),ring_substeq(R,23ch3och3). mol2ring_substitution(C,R),ring_substitutioneq(R,05). Accuracies are summarised on the first line (level-1 propositional) of Table 2. All experiments in this domain are carried out using a 10-fold cross validation. The last line shows the best accuracies reported by Bostr¨om and Asker using up-to-date rule inducers [1]. It should be noted that they applied several rule inducers, and none of them got the best accuracy on all four targets. Table 2. Results in the Alzheimer’s disease domain. System

Amine

Toxic

Choline

Scopolamine

Level-1 Propositional Level-1 Relational Level-2 Propositional Level-2 Relational Level-3 Propositional Level-3 Relational 1 Level-3 Relational 2

66.1% 73.6% 70.6% 71.0% 69.7% 70.2% 68.9%

74.4% 70.7% 73.9% 74.1% 73.4% 73.5% 73.3%

71.5% 70.7% 69.1% 67.6% 69.6% 67.8% 68.8%

74.5% 74.7% 72.5% 71.9% 71.7% 71.6% 72.2%

Best rule induction

86.1%

81.9%

75.5%

61.0%

Properties alkgroupeq and ring substitutioneq are propositional, they consider the absolute numbers of alkyl groups and of substitutions in the ring. Relational properties comparing those numbers can be used to consider features like: mol2alkgroup(C,A),greater_alk_group(A,1). mol2ring_substitution(C,R),greater_ring_substitutions(B,-1). Resulting accuracies are mentioned on the second line (level-1 relational) of Table 2.

P.A. FLACH AND N. LACHICHE

22

Level-2 decomposition considers separately the positions and the substituents. New features considered instead of those using x substeq, r substeq and ring substeq are for instance: mol2x_subst(C,X),x_subst2pos(X,P),poseq(X,67). mol2x_subst(C,X),x_subst2group(X,G),groupeq(G,cf3och3). mol2r_subst(C,R),r_subst2pos(R,P),poseq(P,23). mol2r_subst(C,R),r_subst2substs(X,S),substseq(S,single_alk2o). mol2ring_subst(C,R),ring_subst2pos(R,P),poseq(P,52). mol2ring_subst(C,R),ring_subst2group(R,G),groupeq(G,ch3f). Either the propositional or the relational properties of the numbers of alkyl groups and of the numbers of ring substitutions can be considered. Accuracies are reported on the third (level-2 propositional) and fourth (level-2 relational) lines of Table 2. Level-3 decomposition considers the properties (polarity, size, hydrogen-bond donor, hydrogen-bond acceptor, pi-donor, pi-acceptor, polarisability, sigma effect) of X and ring substituents instead of the groups themselves: mol2x_subst(C,X),x_subst2group(X,G),group2polar(B,P),polareq(P,polar5polar2). mol2x_subst(C,X),x_subst2group(X,G),group2h_acceptor(G,H),h_acceptoreq(H,h_acc0h_acc0). mol2ring_subst(C,R),ring_subst2flex(R,F),flexeq(F,flex1flex0). mol2ring_subst(C,R),ring_subst2sigma(R,S),sigmaeq(S,sigma5sigma3).

Either the propositional or the relational properties of the numbers of alkyl groups and of the numbers of ring substitutions can be considered. Accuracies are reported on the fifth (level-3 propositional) and sixth (level-3 relational 1) lines of Table 2. All above properties of the substituents are propositional, they consider the absolute values of their respective characteristics. Relational properties comparing those values can be used instead in order to get features of the form: mol2x_subst(C,X),x_subst2group(X,G),group2polar(G,P),greater_polar(P,0). mol2ring_subst(C,R),ring_subst2pi_doner2(R,P),greater_pi_doner(P,-1).

Accuracies are reported on the seventh (level-3 relational 1) line of Table 2. On this domain relational properties cannot be distinguished from the propositional ones from the accuracy point of view: at each level of decomposition there are 2 targets that get a better accuracy and two that get a worse accuracy when going from propositional to relational properties. From the decomposition perspective each decomposition decreases the accuracy for 3 out of the 4 targets whatever propositional or relational properties are considered. It shows that the loss of information due to the decomposition is not balanced by an increase of the reliability of the probability estimates. Overall the accuracies are slightly lower on the first three targets than the best reachable accuracies; however, 1BC achieves a considerably better result than the best rule inducer predicting the good reversal of scopolamine-induced memory deficiency. 6.3. Mutagenesis This problem concerns identifying mutagenic compounds [21, 17]. We considered the “regression friendly” dataset. In these experiments, we used the atom and bond structure of the molecule as one setting, adding the lumo and logp properties to get a second setting,

FIRST-ORDER BAYESIAN CLASSIFICATION WITH 1BC

23

and finally adding boolean indicators Ia and I1 as well. The latter four properties are propositional and are considered as separate properties. The decomposition will only be applied in the atom-bond structure. Obviously a molecule is a graph. In order to apply the decomposition process, we will represent molecules as sets of atoms, with bonds as relational properties of atoms. The set is represented by a structural predicate mol2atom(M,A) linking an atom A to its molecule M. Each atom is described by its element, its type, and its charge. A bond is represented by a multivalued property bond(A1,A2,#BondType) linking the two atoms involved in the bond and its type. Level-0 decomposition would consider molecules as a whole. Since there are only 188 examples of molecules, it is impossible to get reliable estimates at that level. At level-1, the elements of the set, i.e. the atoms of the molecule, are considered. An atom is represented by its element, its type, and its charge, i.e. a 3-tuple. An example of first-order features used at level-1 decomposition is: mol2atom(M,A),atomeq(A,c22-0.117). The accuracies are reported on the first line (level-1) of the Table 3. All accuracies are evaluated using a 10-fold cross-validation. Table 3. Results in the mutagenesis domain. The last three lines are taken from [17]. Regression did not use the atoms and bonds knowledge. System

Atoms and bonds only

Plus lumo and logp

Plus I1 and Ia

Level-1 Level-2

81.4% 80.3%

82.4% 82.4%

85.1% 87.2%

88% 85% 66%

88% 89% 66%

Progol Regression Default

Level-2 decomposition consists in splitting the 3-tuple representing an atom into three properties. Possible first-order features are: mol2atom(M,A),atom2element(A,E),elementeq(E,c). mol2atom(M,A),atom2type(A,T),typeeq(T,22). mol2atom(M,A),atom2charge(A,C),chargeq(C,-0.117). Let’s notice that we did not discretise the charge of the atoms. The accuracies are reported on the second line of Table 3. The bonds occur as separate features like: mol2atom(M,A),mol2atom(M,B),bond(A,B,7). Whereas the decomposition of the atoms decreases the accuracy when the atoms and bonds only are considered, it increases the accuracy when all indicators are included. Overall, the best results of 1BC are slightly worse than those of Progol and regression.

P.A. FLACH AND N. LACHICHE

24 6.4. Finite Element Mesh Design

This domain is about finite element methods in engineering. The task is to predict how many elements should be used to model each edge of a structure [6]. The target predicate is mesh(Edge,Number) where the Number of elements in the Mesh model can vary between 1 and 17. Each edge is described by a 3-tuple edge(Edge,Type,Support,Load). The number of elements depends also on the surrounding edges. The dataset provides information about the neighbours, and also the opposite and equal edges. Since the properties of the edges are multivalued, structural predicates neighbour xy(E1,E2), neighbour yz(E1,E2), neighbour zx(E1,E2), opposite(E1,E2) and equal(E1,E2) provide the functional representation of topological relations which is necessary to define multivalued features. There are two dimensions in this domain. One dimension is whether the surrounding edges are considered. The second dimension is about the representation of the edge. The decomposition process will be applied on the second dimension only, and the two settings of the first dimension are considered as different experiments. Let us first consider one edge only. At level-0, the example consists of one edge, i.e. a 3-tuple. An example of first-order feature is: edge(E,short,fixed,cont_loaded). The accuracy achieved by 1BC is 63.0% at level-0. A level-1 decomposition consists in splitting the 3-tuple into three properties type(Edge,Type), support(Edge,Support) and load(Edge,Load). First-order features that are considered are like: type(E,circuit_hole). support(E,two_side_fixed). load(E,not_loaded). With such a decomposition, the accuracy decreases slightly to 61.9%. A second run of experiments considers one edge and the edges related to it through the topological relations as on example. At level-0, first-order features look like: edge(E,short,fixed,cont_loaded). neighbour_xy(N,E),edge(N,usual,free,cont_loaded). The accuracy at level-0 is 82.4%. At level-1, tuples corresponding to the edge and its relative are splitted. Features become similar to: support(E,one_side_fixed). neighbour_xy(E,N),type(N,half_circuit_hole). At level-1 decomposition the accuracy decreases to 67.3%. In this domain, we check that even if the estimates become more reliable with decomposition, information is lost by breaking down the structure. Here it is better to stop at level-0. The best accuracy achieved by 1BC using related edges and level-0 decomposition, 82.4%, is close to Golem’s accuracy of 84.9% [6].

FIRST-ORDER BAYESIAN CLASSIFICATION WITH 1BC

25

6.5. Diterpene structure elucidation The last dataset is concerned with Diterpenes, which are one of a few fundamental classes of natural products with about 5000 members known [8]. Structure elucidation of diterpenes from C-NMR-Spectra (Nuclear magnetic Resonance) can be separated into three main stages: (1) identification of residues (ester and/or glycosides), (2) identification of the diterpene skeleton, and (3) arrangement of the residues on the skeleton. This dataset is concerned with the second stage, the identification of the skeleton. A skeleton is a unique connection of 20 carbon atoms each with a specific atom number and, normalized to a pure skeleton molecule without residues, a certain multiplicity (s, d, t or q) measuring the number of hydrogens directly connected to a particular carbon atom: s stands for singulet, which means there is no proton (i.e., hydrogen) connected to the carbon; d stands for a doublet with one proton connected to the carbon; t stands for a triplet with two protons and q for a quartet with three protons bound to the carbon atom. The data contain information on 1503 diterpenes with known structure. Two representations are available: a propositional one where the atom numbers are known, and a relational one without this information. In order to compare our results with [8], the accuracy is evaluated with a 3-fold cross-validation, and three settings were considered: each representation separately, and then together. The propositional representation consists of a 4-tuple associated to each compound. Level-0 features are for instance: prop(D,2486). 1BC’s accuracy using such features is 78.0%. Level-1 decomposition consists in splitting the 4-tuple into 4 separate properties. Level-1 features are of the form: prop2first(D,F),firsteq(F,2). prop2second(D,S),secondeq(S,4). prop2third(D,T),thirdeq(T,8). prop2fourth(D,F),fourtheq(F,6). The accuracy decreases to 77.4% showing that the loss of information due to breaking the 4-tuple is not balanced by an increase of the reliability of the estimates. Indeed, with 1503 examples and given the simplicity of features the estimates of level-0 features are already quite reliable. The relational representation consists of 20 2-tuples describing the carbon atoms. Level0 was not considered since it would consider the set as a whole. A level-1 decomposition considers all elements separately. Features are for instance: hasatom(D,A),atomeq(A,q19.3). hasatom(D,A),atomeq(A,t35.5). Level-1 accuracy is 65.9%. Level-2 decomposition considers features such as: hasatom(D,A),atom2mult(A,M),multeq(M,q). hasatom(D,A),atom2second(A,S),secondeq(S,35.5).

P.A. FLACH AND N. LACHICHE

26

The accuracy drops to 57.6%. In a third run of experiments, both the propositional and the relational are used. Features are therefore the union of level-1 (resp. level-2) features considered in the two previous settings. Level-1 accuracy is 80.0% and level-2 accuracy is 72.8%. The results with the propositional only, relational only and both propositional and relational representations are summarised and compared with results from other systems in Table 4. Table 4. Results for the Diterpene dataset. System

Prop.

Rel.

Prop. and Rel.

Level-0 Level-1 Level-2

78.0% 77.4% 77.4%

N/A 65.9% 57.6%

N/A 80.0% 72.8%

C4.5 FOIL RIBL

78.5% 70.1% 79.0%

N/A 46.5% 86.5%

N/A 78.3% 91.2%

On this domain 1BC achieves its best result with propositional and relational data using level-1 decomposition, but the simplest approach (propositional data without decomposition) achieves only slightly less. 1BC/ cannot match the first-order instance-based learner RIBL, but beats FOIL in all but one case and performs comparable to C4.5 on the propositional data at all decomposition levels. This indicates an important property of a first-order learner, namely the ability to perform well on propositional data. 7. Related work While many propositional learners have been upgraded to first-order logic in ILP, the case of the Bayesian classifier poses a problem that has not been satisfyingly solved before, namely how to distinguish between elementary and non-elementary first-order features. Treating each Prolog literal as a feature, as is done in LINUS [12] is not a solution, because many literals do not contain a reference to an individual, and thus the relative frequency associated with that literal cannot be attributed to an individual. Our approach gives a clear picture of how an individual-centred first-order representation upgrades attributevalue learning, namely by allowing relational and non-determinate features. In this respect, the work extends previous work on the relationship between propositional and first-order learning [3, 9, 2, 20, 22]. Pompe and Kononenko also describe an application of naive Bayesian classifiers in a first-order context [18]. However, in their approach the naive Bayesian formula is used in a post-processing step to combine the predictions of several, independently learnt first-order rules. As far as we are aware, this work is the first to describe a first-order naive Bayesian learner. Most of the probability distributions over lists and sets discussed in Section 2 have been used in some context or other. For instance, the bitvector and multiset distributions are widely used in document classification (‘bag of words’ representation). Distributions over

FIRST-ORDER BAYESIAN CLASSIFICATION WITH 1BC

27

lists, including the assumption of a geometric length distribution, are commonly used in bio-informatics for sequence analysis. Such distributions can also naturally be defined by stochastic logic programs [15]. On the other hand, since sets are not directly representable in Prolog, our subset-distribution Pss does not seem to be definable by an SLP. This distribution is interesting because it defines the probability of a set in terms of its extension only, independently of its complement. We are not aware of other work that achieves this. 8. Conclusions In this paper we presented theory and implementation of a first-order Bayesian classifier. Its main ingredients are: various techniques for estimating probability distributions over sets, techniques for decomposing distributions over nested structured objects to various levels, and a powerful bias mechanism. We also tested our approach experimentally, and found that in some domains decomposition clearly improved accuracy, while in others the estimates turned out to be less reliable. This is to be expected: there is no reason why decomposition should work in all domains, and the naive assumption of independent features may be even harder to maintain in first-order domains than it is in propositional domains. Our approach should be seen as an initial study rather than the last word on the subject. One can argue that any individual-centred first-order learning problem, as defined in this paper, can be transformed to attribute-value learning, by introducing an attribute for any first-order feature in the hypothesis language [11]. However, one should distinguish between transformation of the hypothesis language and transformation of the data (propositionalisation). 1BC does the first but not the second. On the other hand, 1BC (like its propositional counterpart) operates in a strictly top-down manner: the features it considers are those suggested by the language bias, not those suggested by the data. This suggests various lines for future research. One is to let the system decide which features to use, for instance by testing correlation with the class. We could also let 1BC decide to what level decomposition is required to reach a certain significance of the estimates. Another datadriven feature could be dynamic discretisation, as happens in many decision tree learners. 4 Acknowledgments Thanks are due to Henrik Bostr¨om and Saˇso Dˇzeroski for providing us with the Alzheimer and Diterpene datasets, respectively. Part of this work was supported by the Esprit IV Long Term Research Project 20237 Inductive Logic Programming 2, and by the Esprit V project IST-1999-11495 Data Mining and Decision Support for Business Competitiveness: Solomon Virtual Enterprise. Notes 1. Some properties of individuals or their parts are more conveniently represented as background knowledge. 2. If the top-level type is a tuple, 1BC allows the use of several global variables each referring to a component. 3. The system can be downloaded for academic purposes from http://www.cs.bris.ac.uk/Research/MachineLearning/1BC/.

28

P.A. FLACH AND N. LACHICHE

4. Discretisation was not used for the experiments described in this paper.

References 1. H. Bostr¨om and L. Asker. Combining divide-and-conquer and separate-and-conquer for efficient and effective rule induction. In S. Dˇzeroski and P. Flach, editors, Proceedings of the 9th International Workshop on Inductive Logic Programming, volume 1634 of Lecture Notes in Artificial Intelligence, pages 33–43. Springer-Verlag, 1999. 2. M. Botta, A. Giordana, and R. Piola. FONN: Combining first order logic with connectionist learning. In Proceedings of the 14th International Conference on Machine Learning, pages 46–56. Morgan Kaufmann, 1997. 3. L. De Raedt. Attribute value learning versus inductive logic programming: The missing links (extended abstract). In D. Page, editor, Proceedings of the 8th International Conference on Inductive Logic Programming, volume 1446 of Lecture Notes in Artificial Intelligence, pages 1–8. Springer-Verlag, 1998. 4. L. Dehaspe and L. De Raedt. Mining association rules in multiple relations. In S. Dˇzeroski and N. Lavraˇc, editors, Proceedings of the 7th International Workshop on Inductive Logic Programming, volume 1297 of Lecture Notes in Artificial Intelligence, pages 125–132. Springer-Verlag, 1997. 5. T. G Dietterich, R. H. Lathrop, and T. Lozano-Perez. Solving the multiple instance problem with axisparallel rectangles. Artificial Intelligence, 89:31–71, 1997. 6. B. Dolˇsak, I. Bratko, and A. Jezernik. Finite element mesh design: An engineering domain for ILP application. In S. Wrobel, editor, Proceedings of the 4th International Workshop on Inductive Logic Programming, volume 237 of GMD-Studien, pages 305–320. Gesellschaft f¨ur Mathematik und Datenverarbeitung MBH, 1994. 7. P. Domingos and M. Pazzani. On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29:103–130, 1997. 8. Sasˇo Dˇzeroski, Steffen Schulze-Kremer, Karsten R. Heidtke, Karsten Siems, Dietrich Wettschereck, and Hendrik Blockeel. Diterpene structure elucidation from 1 3c nmr spectra with inductive logic programming. Applied Artificial Intelligence, 12(5):363–383, July-August 1998. Special Issue on First-Order Knowledge Discovery in Databases. 9. P.A. Flach, C. Giraud-Carrier, and J.W. Lloyd. Strongly typed inductive concept learning. In D. Page, editor, Proceedings of the 8th International Conference on Inductive Logic Programming, volume 1446 of Lecture Notes in Artificial Intelligence, pages 185–194. Springer-Verlag, 1998. 10. P.A. Flach and N. Lachiche. Confirmation-guided discovery of first-order rules with Tertius. Machine Learning. In print. The Tertius system can be downloaded for academic purposes from http://www.cs.bris.ac.uk/Research/MachineLearning/Tertius/. 11. P.A. Flach and N. Lavraˇc. The role of feature construction in inductive learning. Submitted. 12. N. Lavraˇc and S. Dˇzeroski. Inductive Logic Programming: Techniques and Applications. Ellis Horwood, 1994. 13. T.M. Mitchell. Machine Learning. McGraw-Hill, 1997. 14. S. Muggleton. Inverse entailment and Progol. New Generation Computing, Special issue on Inductive Logic Programming, 13(3-4):245–286, 1995. 15. S. Muggleton. Stochastic logic programs. In L. De Raedt, editor, Advances in Inductive Logic Programming, pages 254–264. IOS Press, 1996. 16. S. Muggleton, M. Bain, J. Hayes-Michie, and D. Michie. An experimental comparison of human and machine learning formalisms. In Proceedings of the 6th International Workshop on Machine Learning, pages 113–118. Morgan Kaufmann, 1989. 17. S. Muggleton, A. Srinivasan, R. King, and M. Sternberg. Biochemical knowledge discovery using Inductive Logic Programming. In H. Motoda, editor, Proceedings of the first Conference on Discovery Science, Berlin, 1998. Springer-Verlag. 18. U. Pompe and I. Kononenko. Naive Bayesian classifier within ILP-R. In L. De Raedt, editor, Proceedings of the 5th International Workshop on Inductive Logic Programming, pages 417–436. Department of Computer Science, Katholieke Universiteit Leuven, 1995. 19. R. Rymon. Search through systematic set enumeration. In Proc. Third Int. Conf. on Knowledge Representation and Reasoning, pages 539–550. Morgan Kaufmann, 1992.

FIRST-ORDER BAYESIAN CLASSIFICATION WITH 1BC

29

20. M. Sebag. A stochastic simple similarity. In D. Page, editor, Proceedings of the 8th International Conference on Inductive Logic Programming, volume 1446 of Lecture Notes in Artificial Intelligence, pages 95–105. Springer-Verlag, 1998. 21. A. Srinivasan, S. Muggleton, R.D. King, and M.J.E. Sternberg. Mutagenesis: ILP experiments in a nondeterminate biological domain. In S. Wrobel, editor, Proceedings of the 4th International Workshop on Inductive Logic Programming, volume 237 of GMD-Studien, pages 217–232. Gesellschaft f¨ur Mathematik und Datenverarbeitung MBH, 1994. 22. J.-D. Zucker and J.-G. Ganascia. Learning structurally indeterminate clauses. In D. Page, editor, Proceedings of the 8th International Conference on Inductive Logic Programming, volume 1446 of Lecture Notes in Artificial Intelligence, pages 235–244. Springer-Verlag, 1998.

Suggest Documents