Parameter Learning of Logic Programs for Symbolic-statistical Modeling

Journal of Arti cial Intelligence Research 15 (2001) 391-454

Submitted 6/18; published 12/01

Parameter Learning of Logic Programs for Symbolic-statistical Modeling

Taisuke Sato Yoshitaka Kameya

[email protected] [email protected]

Dept. of Computer Science, Graduate School of Information Science and Engineering, Tokyo Institute of Technology 2-12-1 Ookayama Meguro-ku Tokyo Japan 152-8552

Abstract

We propose a logical/mathematical framework for statistical parameter learning of parameterized logic programs, i.e. de nite clause programs containing probabilistic facts with a parameterized distribution. It extends the traditional least Herbrand model semantics in logic programming to distribution semantics , possible world semantics with a probability distribution which is unconditionally applicable to arbitrary logic programs including ones for HMMs, PCFGs and Bayesian networks. We also propose a new EM algorithm, the graphical EM algorithm, that runs for a class of parameterized logic programs representing sequential decision processes where each decision is exclusive and independent. It runs on a new data structure called support graph s describing the logical relationship between observations and their explanations, and learns parameters by computing inside and outside probability generalized for logic programs. The complexity analysis shows that when combined with OLDT search for all explanations for observations, the graphical EM algorithm, despite its generality, has the same time complexity as existing EM algorithms, i.e. the Baum-Welch algorithm for HMMs, the Inside-Outside algorithm for PCFGs, and the one for singly connected Bayesian networks that have been developed independently in each research eld. Learning experiments with PCFGs using two corpora of moderate size indicate that the graphical EM algorithm can signi cantly outperform the Inside-Outside algorithm. 1. Introduction

Parameter learning is common in various elds from neural networks to reinforcement learning to statistics. It is used to tune up systems for their best performance, be they classi ers or statistical models. Unlike these numerical systems described by mathematical formulas however, symbolic systems, typically programs, do not seem amenable to any kind of parameter learning. Actually there has been little literature on parameter learning of programs. This paper is an attempt to incorporate parameter learning into computer programs. The reason is twofold. Theoretically we wish to add the ability of learning to computer programs, which the authors believe is a necessary step toward building intelligent systems. Practically it broadens the class of probability distributions, beyond traditionally used numerical ones, which are available for modeling complex phenomena such as gene inheritance, consumer behavior, natural language processing and so on.

c 2001 AI Access Foundation and Morgan Kaufmann Publishers. All rights reserved.

Sato & Kameya

The type of learning we consider here is statistical parameter learning applied to logic programs.1 We assume that facts (unit clauses) in a program are probabilistically true and have a parameterized distribution.2 Other clauses, non-unit de nite clauses, are always true as they encode laws such as \if one has a pair of blood type genes a and b, one's blood type is AB". We call logic programs of this type a parameterized logic program and use for statistical modeling in which ground atoms3 provable from the program represent our observations such as \one's blood type is AB" and the parameters of the program are inferred by performing ML (maximum likelihood) estimation on the observed atoms. The probabilistic rst-order framework sketched above is termed statistical abduction (Sato & Kameya, 2000) as it is an amalgamation of statistical inference and abduction where probabilistic facts play the role of abducible s, i.e. primitive hypotheses.4 Statistical abduction is powerful in that it not only subsumes diverse symbolic-statistical frameworks such as HMMs (hidden Markov models, Rabiner, 1989), PCFGs (probabilistic context free grammars, Wetherell, 1980; Manning & Schutze, 1999) and (discrete) Bayesian networks (Pearl, 1988; Castillo, Gutierrez, & Hadi, 1997) but gives us freedom of using arbitrarily complex logic programs for modeling.5 The semantic basis for statistical abduction is distribution semantics introduced by Sato (1995). It de nes a parameterized distribution, actually a probability measure, over the set of possible truth assignments to ground atoms and enables us to derive a new EM algorithm6 for ML estimation called the graphical EM algorithm (Kameya & Sato, 2000). Parameter learning in statistical abduction is done in two phases, search and EM learning. Given a parameterized logic program and observations, the rst phase searches for all explanations for the observations. Redundancy in the rst phase is eliminated by tabulating partial explanations using OLDT search (Tamaki & Sato, 1986; Warren, 1992; Sagonas, T., & Warren, 1994; Ramakrishnan, Rao, Sagonas, Swift, & Warren, 1995; Shen, Yuan, You, & Zhou, 2001). It returns a support graph which is a compact representation of the discovered explanations. In the second phase, we run the graphical EM algorithm on the support graph 1. In this paper, logic programs mean de nite clause programs. A de nite clause program is a set of de nite clauses. A de nite clause is a clause of the form A L1 ; : : : ; Ln (0 n) where A; L1 ; : : : ; Ln are atoms. A is called the head, L1 ; : : : ; Ln the body. All variables are universally quanti ed. It reads if L1 and 1 1 1 and Ln hold, then A holds. In case of n = 0, the clause is called a unit clause. A general clause is one whose body may contain negated atoms. A program including general clauses is sometimes called a general program (Lloyd, 1984; Doets, 1994). 2. Throughout this paper, for familiarity and readability, we will somewhat loosely use \distribution" as a synonym for \probability measure". 3. In logic programming, the adjective \ground" means no variables contained. 4. Abduction means inference to the best explanation for a set of observations. Logically, it is formalized as a search for an explanation E such that E; KB ` G where G is an atom representing our observation, KB a knowledge base and E a conjunction of atoms chosen from abducible s, i.e. a class of formulas allowed as primitive hypotheses (Kakas, Kowalski, & Toni, 1992; Flach & Kakas, 2000). E must be consistent with KB. 5. Existing symbolic-statistical modeling frameworks have restrictions and limitations of various types compared with arbitrary logic programs (see Section 7 for details). For example, Bayesian networks do not allow recursion. HMMs and PCFGs, stochastic grammars, allow recursion but lack variables and data structures. Recursive logic programs are allowed in Ngo and Haddawy's (1997) framework but they assume domains are nite and function symbols seem prohibited. 6. \EM algorithm" stands for a class of iterative algorithms for ML estimation with incomplete data (McLachlan & Krishnan, 1997). 392


and learn the parameters of the distribution associated with the program. Redundancy in the second phase is removed by the introduction of inside and outside probability for logic programs computed from the support graph. The graphical EM algorithm has accomplished, when combined with OLDT search for all explanations, the same time complexity as the specialized ones, e.g. the Baum-Welch algorithm for HMMs (Rabiner, 1989) and the Inside-Outside algorithm for PCFGs (Baker, 1979), despite its generality. What is surprising is that, when we conducted learning experiments with PCFGs using real corpora, it outperformed the Inside-Outside algorithm by orders of magnitudes in terms of time for one iteration to update parameters. These experimental results enhance the prospect for symbolic-statistical modeling by parameterized logic programs of even more complex systems than stochastic grammars whose modeling has been dicult simply because of the lack of an appropriate modeling tool and their sheer complexities. The contributions of this paper therefore are distribution semantics for parameterized logic programs which uni es existing symbolicstatistical frameworks, the graphical EM algorithm (combined with tabulated search), a general yet ecient EM algorithm that runs on support graphs and the prospect suggested by the learning experiments for modeling and learning complex symbolic-statistical phenomena. The rest of this paper is organized as follows. After preliminaries in Section 2, a probability space for parameterized logic programs is constructed in Section 3 as a mathematical basis for the subsequent sections. We then propose a new EM algorithm, the graphical EM algorithm, for parameterized logic programs in Section 4. Complexity analysis of the graphical EM algorithm is presented in Section 5 for HMMs, PCFGs, pseudo PCSGs and sc-BNs.7 Section 6 contains experimental results of parameter learning with PCFGs by the graphical EM algorithm using real corpora that demonstrate the eciency of the graphical EM algorithm. We state related work in Section 7, followed by conclusion in Section 8. The reader is assumed to be familiar with the basics of logic programming (Lloyd, 1984; Doets, 1994), probability theory (Chow & Teicher, 1997), Bayesian networks (Pearl, 1988; Castillo et al., 1997) and stochastic grammars (Rabiner, 1989; Manning & Schutze, 1999). 2. Preliminaries

Since our subject intersects logic programming and EM learning which are quite dierent in nature, we separate preliminaries.

2.1 Logic Programming and OLDT

In logic programming, a program DB is a set of de nite clauses8 and the execution is search for an SLD refutation of a given goal G. The top-down interpreter recursively selects the 7. Pseudo PCSGs (probabilistic context sensitive grammars) are a context-sensitive extension of PCFGs proposed by Charniak and Carroll (1994). sc-BN is a shorthand for a singly connected Bayesian network (Pearl, 1988). 8. We do not deal with general logic programs in this paper. 393

Sato & Kameya

next goal and unfolds it (Tamaki & Sato, 1984) into subgoals using a nondeterministically chosen clause. The computed result by the SLD refutation, i.e. a solution, is an answer substitution (variable binding) such that DB ` G.9 Usually there is more than one refutation for G, and the search space for all refutations is described by an SLD tree which may be in nite depending on the program and the goal (Lloyd, 1984; Doets, 1994). More often than not, applications require all solutions. In natural language processing for instance, a parser must be able to nd all possible parse trees for a given sentence as every one of them is syntactically correct. Similarly in statistical abduction, we need to examine all explanations to determine the most likely one. All solutions are obtained by searching the entire SLD tree, and there is a choice of the search strategy. In Prolog, the standard logic programming language, backtracking is used to search for all solutions in conjunction with a xed search order for goals (textually from left-to-right) and clauses (textually top-to-bottom) due to the ease and simplicity of implementation. The problem with backtracking is that it forgets everything until up to the previous choice point, and hence it is quite likely to prove the same goal again and again, resulting in exponential search time. One answer to avoid this problem is to store computed results and reuse them whenever necessary. OLDT is such an instance of memoizing scheme (Tamaki & Sato, 1986; Warren, 1992; Sagonas et al., 1994; Ramakrishnan et al., 1995; Shen et al., 2001). Reuse of proved subgoals in OLDT search often drastically reduces search time for all solutions, especially when refutations of the top goal include many common subrefutations. Take as an example a logic program coding an HMM. For a given string s, there exist exponentially many transition paths that output s. OLDT search applied to the program however only takes time linear in the length of s to nd all of them unlike exponential time by Prolog's backtracking search. What does OLDT have to do with statistical abduction? From the viewpoint of statistical abduction, reuse of proved subgoals, or equivalently, structure sharing of sub-refutations for the top-goal G brings about structure sharing of explanations for G, in addition to the reduction of search time mentioned above, thereby producing a highly compact representation of all explanations for G.

2.2 EM Learning

Parameterized distributions such as the multinomial distribution and the normal distribution provide convenient modeling devices in statistics. Suppose a random sample x1; : : : ; xT of size T on a random variable X drawn from a distribution P (X = x j ) parameterized by unknown , is observed. The value of is determined by ML estimation as the MLE (maximum likelihood estimate) of , i.e. as the maximizer of the likelihood 1iT P (xi j ). Things get much more dicult when data are incomplete. Think of a probabilistic relationship between non-observable cause X and observable eect Y such as one between diseases and symptoms in medicine and assume that Y does not uniquely determine the cause X . Then Y is incomplete in the sense that Y does not carry enough information to completely determine X . Let P (X = x; Y = y j ) be a parameterized joint distribution over X and Y . Our task is to perform ML estimation on under the condition that X is Q

9. By a solution we ambiguously mean both the answer substitution itself and the proved atom G, as one gives the other. 394


non-observable while Y is observable. Let y1 ; : : : ; yT be a random sample of size T drawn from the marginal distribution P (Y = y j ) = x P (X = x; Y = y j ). The MLE of is obtained by maximizing the likelihood 1iT P (yi j ) as a function of . While mathematical formulation looks alike in both cases, the latter, ML estimation with incomplete data, is far more complicated and direct maximization is practically impossible in many cases. People therefore looked to indirect approaches to tackle the problem of ML estimation with incomplete data to which the EM algorithm has been a standard solution (Dempster, Laird, & Rubin, 1977; McLachlan & Krishnan, 1997). It is an iterative algorithm applicable to a wide class of parameterized distributions including the multinomial distribution and the normal distribution such that the MLE computation is replaced by the iteration of two easier, more tractable steps. At n-th iteration, it rst calculates the value of Q function introduced below using current parameter value (n) (E-step)10 : P

Q

Q( j (n)) def =

X

x

P (x j y; (n) ) ln P (x; y j ):

(1)

Next, it maximizes Q( j (n)) as a function of and updates (n) (M-step): (n+1) = argmax Q( j (n) ): (2) Since the old value (n) and the updated value (n+1) do not necessarily coincide, the E-steps and M-steps are iterated until convergence, during which the (log) likelihood is assured to increase monotonically (McLachlan & Krishnan, 1997). Although the EM algorithm merely performs local maximization, it is used in a variety of settings due to its simplicity and relatively good performance. One must notice however that the EM algorithm is just a class name, taking dierent form depending on distributions and applications. The development of a concrete EM algorithm such as the Baum-Welch algorithm for HMMs (Rabiner, 1989) and the Inside-Outside algorithm for PCFGs (Baker, 1979) requires individual eort for each case. 10. Q function is related to ML estimation as follows. We assume here only one data, y , is observed. From Jensen's inequality (Chow & Teicher, 1997) and the concavity of ln function, it follows that X

P (x j y; (n) ) ln P (x j y; ) 0

x

X

P (x j y; (n) ) ln P (x j y; (n) ) 0

x

and hence that Q( j (n) ) 0 Q((n) j (n) ) X X = P (x j y; (n) ) ln P (x j y; ) 0 P (x j y; (n) ) ln P (x j y; (n) ) + ln P (y j ) 0 ln P (y j (n) ) x

ln P (y j ) 0 ln P (y j (n) ):

x

Consequently, we have Q( j (n) ) Q((n) j (n) ) ) ln p(y j ) ln p(y j (n) ) ) p(y j ) p(y j (n) ): 395

Sato & Kameya 3. Distribution Semantics

In this section, we introduce parameterized logic programs and de ne their declarative semantics. The basic idea is as follows. We start with a set F of probabilistic facts (atoms) and a set R of non-unit de nite clauses. Sampling from F determines a set F 0 of true atoms, and the least Herbrand model of F 0 [ R determines the truth value of every atom in DB = F [ R. Hence every atom can be considered as a random variable, taking on 1 (true) or 0 (false). In what follows, we formalize this process and construct the underlying probability space for the denotation of DB.

3.1 Basic Distribution PF

Let DB = F [R be a de nite clause program in a rst-order language L with countably many variables, function symbols and predicate symbols where F is a set of unit clauses (facts) and R a set of non-unit clauses (rules). In the sequel, unless otherwise stated, we consider for simplicity DB as the set of all ground instances of the clauses in DB, and assume that F and R consist of countably in nite ground clauses (the nite case is similarly treated). We then construct a probability space for DB in two steps. First we introduce a probability space over the Herbrand interpretations11 of F i.e. the truth assignments to ground atoms in F . Next we extend it to a probability space over the Herbrand interpretations of all ground atoms in L by using the least model semantics (Lloyd, 1984; Doets, 1994). Let A1 ; A2 ; : : : be a xed enumeration of atoms in F . We regard an in nite vector ! = hx1; x2; : : :i of 0s and 1s as a Herbrand interpretation of F in such a way that for i = 1; 2; : : : Ai is true (resp. false) if and only if xi = 1 (resp. xi = 0). Under this isomorphism, the set of all possible Herbrand interpretations of F coincides with the Cartesian product: 1 def

F = f0; 1gi: Y

i=1

We construct a probability measure PF over the sample space F 12 from a collection of nite joint distributions PF(n)(A1 = x1; : : : ; An = xn) (n = 1; 2; : : : ; xi 2 f0; 1g; 1 i n) such that 0 PF(n)(A1 = x1 ; : : : ; An = xn) 1 (n) (3) x ;:::;x PF (A1 = x1 ; : : : ; An = xn ) = 1 ( n+1) ( n) PF (A1 = x1 ; : : : ; An+1 = xn+1 ) = PF (A1 = x1 ; : : : ; An = xn ): x The last equation is called the compatibility condition. It can be proved (Chow & Teicher, 1997) from the compatibility condition that there exists a probability space ( F ; F ; PF ) where PF is a probability measure on F , the minimal algebra containing open sets of F , such that for any n, PF (A1 = x1 ; : : : ; An = xn ) = PF(n) (A1 = x1 ; : : : ; An = xn ): 8 > > < > > :

P

P

1

n

n+1

11. A Herbrand interpretation interprets a function symbol uniquely as a function on ground terms and assigns truth values to ground atoms. Since the interpretation of function symbols is common to all Herbrand interpretations, given L, they have a one-to-one correspondence with truth assignments to ground atoms in L. So we do not distinguish them. 12. We regard F as a topological space with the product topology such that each f0; 1g is equipped with the discrete topology. 396


We call PF a basic distribution .13 ( n) The choice of PF is free as long as the compatibility condition is met. If we want all interpretations to be equiprobable, we should set PF(n)(A1 = x1; : : : ; An = xn) = 1=2n for every hx1; : : : ; xn i. The resulting PF is a uniform distribution over F just like the one over the unit interval [0; 1]. If, on the other hand, we stipulate no interpretation except !0 = hc1 ; c2; : : :i should be possible, we put, for each n, PF(n) (A1 = x1 ; : : : ; An = xn ) = 10 ifo.w. 8i xi = ci (1 i n) Then PF places all probability mass on !0 and gives probability 0 to the rest. De ne a parameterized logic program as a de nite clause program14 DB = F [ R where F is a set of unit clauses, R is a set of non-unit clauses such that no clause head in R is uni able with a unit clause in F and a parameterized basic distribution PF is associated with F . A parameterized PF is obtained from a collection of parameterized joint distributions satisfying the compatibility condition. Generally, the more complex PF(n)'s are, the more

exible PF is, but at the cost of tractability. The choice of parameterized nite distributions made by Sato (1995) was simple: PF(2n) (ON 1 = x1; OFF 2 = x2 ; : : : ; OFF 2n = x2n j 1 ; : : : ; n) n = Pbs (ON 2i01 = x2i01; OFF 2i = x2i j i) i=1 where Pbs (ON 2i01 = x2i01 ; OFF 2i = x2i j i ) 0 if x2i01 = x2i = i if x2i01 = 1; x2i = 0 (4) 1 0 i if x2i01 = 0; x2i = 1: Pbs (ON 2i01 = x2i01 ; OFF 2i = x2i j i ) (1 i n) represents a probabilistic binary switch, i.e. a Bernoulli trial, using two exclusive atoms ON 2i01 and OFF 2i in such a way that either one of them is true on each trial but never both. i is a parameter specifying the probability that the switch i is on. The resulting PF is a probability measure over the in nite product of independent binary outcomes. It might look too simple but expressive enough for Bayesian networks, Markov chains and HMMs (Sato, 1995; Sato & Kameya, 1997). (

Y

8 > < > :

3.2 Extending PF to PDB

In this subsection, we extend PF to a probability measure PDB over the possible world s for L, i.e. the set of all possible truth assignments to ground atoms in L through the least 13. This naming of PF , despite its being a probability measure, partly re ects the observation that it behaves like an in nite joint distribution PF (A1 = x1 ; A2 = x2 ; : : :) for an in nite random vector hA1 ; A2 ; : : :i of which PF(n) (A1 = x1 ; : : : ; An = xn ) (n = 1; 2; : : :) are marginal distributions. Another reason is intuitiveness. These considerations apply to PDB de ned in the next subsection as well. 14. Here clauses are not necessarily ground. 397

Sato & Kameya

Herbrand model (Lloyd, 1984; Doets, 1994). Before proceeding however, we need a couple of notations. For an atom A, de ne Ax by Ax = A if x = 1 Ax = :A if x = 0: Next take a Herbrand interpretation 2 F of F . It makes some atoms in F true and others false. Let F be the set of atoms made true by . Then imagine a de nite clause program DB0 = R [ F and its least Herbrand model MDB0 (Lloyd, 1984; Doets, 1994). MDB0 is characterized as the least xed point of a mapping TDB0 (1) below is some A B1 ; : : : ; Bk 2 DB0 (0 k) TDB0 (I ) def = A there such that fB1 ; : : : ; Bk g I where I is a set of ground atoms.15 Or equivalently, it is inductively de ned by I0 = ; In+1 = TDB0 (In ) MDB0 = In : (

(

)

[

n

Taking into account the fact that MDB0 is a function of 2 F , we henceforth employ a functional notation MDB ( ) to denote MDB0 . Turning back, let A1 ; A2 ; : : : be again an enumeration, but of all ground atoms in L.16 Form DB , similarly to F , as the Cartesian product of denumerably many f0; 1g's and identify it with the set of all possible Herbrand interpretations of the ground atoms A1 ; A2 ; : : : in L, i.e. the possible world s for L. Then extend PF to a probability measure PDB over DB ( n) as follows. Introduce a series of nite joint distributions PDB (A1 = x1 ; : : : ; An = xn ) for n = 1; 2; : : : by [Ax1 ^ 1 1 1 ^ Axn ]F def = f 2 F j MDB ( ) j= Ax1 ^ 1 11 ^ Axn g def (n) (A = x ; : : : ; A = x ) = P ([Ax ^ 11 1 ^ Ax ] ): PDB 1 1 n n F n F 1 1

n

1

1

n

n

(n) 's satisfy the Note that the set [Ax1 ^ 1 1 1 ^ Axn ]F is PF -measurable and by de nition, PDB compatibility condition (n+1) (A = x ; : : : ; A = x ) = P (n) (A = x ; : : : ; A = x ): PDB 1 1 n+1 n+1 1 n n DB 1 1

n

X

xn+1

Hence there exists a probability measure PDB over DB which is an extension of PF such that PDB (A1 = x1 ; : : : ; An = xn ) = PF (A1 = x1 ; : : : ; An = xn ) 15. I de nes, mutually, a Herbrand interpretation such that a ground atom A is true if and only if A 2 I . A Herbrand model of a program is a Herbrand interpretation that makes every ground instance of every clause in the program true. 16. Note that this enumeration enumerates ground atoms in F as well. 398


for any nite atoms A1; : : : ; An in F and for every binary vector hx1 ; : : : ; xni (xi 2 f0; 1g; 1 i n). De ne the denotation of the program DB = F [ R w.r.t. PF to be PDB . The denotational semantics of parameterized logic programs de ned above is called distribution semantics. As remarked before, we regard PDB as a kind of in nite joint distribution PDB (A1 = x1 ; A2 = x2 ; : : :). Mathematical properties of PDB are listed in Appendix A where our semantics is proved to be an extension of the standard least model semantics in logic programming to possible world semantics with a probability measure.

3.3 Programs as Distributions

Distribution semantics views parameterized logic programs as expressing distributions. Traditionally distributions have been expressed by using mathematical formulas but the use of programs as (discrete) distributions gives us far more freedom and exibility than mathematical formulas in the construction of distributions because they have recursion and arbitrary composition. In particular a program can contain in nitely many random variables as probabilistic atoms through recursion, and hence can describe stochastic processes that potentially involve in nitely many random variables such as Markov chains and derivations in PCFGs (Manning & Schutze, 1999).17 Programs also enable us to procedurally express complicated constraints on distributions such as \the sum of occurrences of alphabets a or b in an output string of an HMM must be a multiple of three". This feature, procedural expression of arbitrarily complex (discrete) distributions, seems quite helpful in symbolic-statistical modeling. Finally, providing mathematically sound semantics for parameterized logic programs is one thing, and implementing distribution semantics in a tractable way is another. In the next section, we investigate conditions on parameterized logic programs which make probability computation tractable, thereby making them usable as a means for large scale symbolic-statistical modeling. 4. Graphical EM Algorithm

According to the preceding section, a parameterized logic program DB = F [ R in a rst-order language L with a parameterized basic distribution PF (1 j ) over the Herbrand interpretations of ground atoms in F speci es a parameterized distribution PDB (1 j ) over the Herbrand interpretations for L. In this section, we develop, step by step, an ecient EM algorithm for the parameter learning of parameterized logic programs by interpreting PDB as a distribution over the observable and non-observable events. The new EM algorithm is termed the graphical EM algorithm. It is applicable to arbitrary logic programs satisfying certain conditions described later provided the basic distribution is a direct product of multi-ary random switches, which is a slight complication of the binary ones introduced in Section 3.1. From this section on, we assume that DB consists of usual de nite clauses containing (universally quanti ed) variables. De nitions and changes relating to this assumption are 17. An in nite derivation can occur in PCFGs. Take a simple PCFG fp : S ! a; q : S ! SS g where S is a start symbol, a a terminal symbol, p + q = 1 and p; q > 0. In this PCFG, S is rewritten either to a with probability p or to SS with probability q . The probability of the occurrence of an in nite derivation is calculated as max f0; 1 0 (p=q)g which is non-zero when q > p (Chi & Geman, 1998). 399

Sato & Kameya

listed below. For a predicate p, we introduce i (p), the i de nition of p by i(p) def = 8x (p(x) $ 9y1(x = t1 ^ W1 ) _ 1 11 _ 9yn (x = tn ^ Wn )) : Here x is a vector of new variables of length equal to the arity of p, p(ti) Wi (1 i n; 0 n), an enumeration of clauses about p in DB, and yi , a vector of variables occurring in p(ti) Wi. Then de ne comp(R) as follows. head(R) def = fB j B is a ground instance of a clause head appearing in Rg i (R) def = fi (p) j p appears in a clause head in Rg Eq def = ff (x) = f (y) ! x = y j f is a function symbolg [ ff (x) 6= g(y) j f and g are dierent function symbolsg [ ft 6= x j t is a term properly containing xg comp(R) def = i (R) [ Eq Eq , Clark's equational theory (Clark, 1978), deductively simulates uni cation. Likewise comp(R) is a rst-order theory which deductively simulates SLD refutation with the help of Eq by replacing a clause head atom with the clause body (Lloyd, 1984; Doets, 1994). We here introduce some de nitions which will be frequently used. Let B be an atom. An explanation for B w.r.t. DB = F [ R is a conjunction S such that S; R ` B, and as a set comprised of its conjuncts, S F holds and no proper subset of S satis es this. The set of all explanations for B is called the support set for B and designated by DB (B).18

4.1 Motivating Example

First of all, we review distribution semantics by a concrete example. Consider the following program DBb = Fb [ Rb in Figure 1 modeling how one's blood type is determined by blood type genes probabilistically inherited from the parents.19 The rst four clauses in Rb state a blood type is determined by a genotype, i.e. a pair of blood type genes a, b and o. For instance, btype('A'):- (gtype(a,a) ; gtype(a,o) ; gtype(o,a)) says that one's blood type is A if his (her) genotype is ha; ai, ha; oi or ho; ai. These are propositional rules. Succeeding clauses state general rules in terms of logical variables. The fth clause says that regardless of the values of X and Y, event gtype(X,Y) (one's having genotype hX; Yi) is caused by two events, gene(father,X) (inheriting gene X from the father) and gene(mother,Y) (inheriting gene Y from the mother). gene(P,G):- msw(gene,P,G) is a clause connecting rules in Rb with probabilistic facts in Fb. It tells us that the gene G is inherited from a parent P if a choice represented by msw(gene,P,G)20 is made. The

18. This de nition of a support set diers from the one used by Sato (1995) and Kameya and Sato (2000). 19. When we implicitly emphasize the procedural reading of logic programs, Prolog conventions are employed (Sterling & Shapiro, 1986). Thus, ; stands for \or", , \and" :- \implied by" respectively. Strings beginning with a capital letter are (universally quanti ed) variables, but quoted ones such as 'A' are constants. The underscore is an anonymous variable. 20. msw is an abbreviation of \multi-ary random switch" and msw(1; 1; 1) expresses a probabilistic choice from nite alternatives. In the framework of statistical abduction, msw atoms are abducibles from which explanations are constructed as a conjunction. 400


8 > > > > > > >
> > > > > > :

msw(gene,mother,a); msw(gene,mother,b); msw(gene,mother,o)g

Figure 1: ABO blood type program DBb genetic knowledge that the choice of G is by chance and made from fa; b; og is expressed by specifying a joint distribution Fb as follows. PF (msw(gene,t,a) = x; msw(gene,t,b) = y; msw(gene,t,o) = z j a ; b ; o ) def = axby oz where x; y; z 2 f0; 1g, x + y + z = 1, a; b; o 2 [0; 1], a + b + o = 1 and t is either father or mother. Thus a is the probability of inheriting gene a from a parent. Statistical independence of the choice of gene, once from father and once from mother, is expressed by putting PF ( msw(gene,father,a) = x; msw(gene,father,b) = y; msw(gene,father,o) = z; msw(gene,mother,a) = x0; msw(gene,mother,b) = y 0; msw(gene,mother,o) = z 0 j a ; b ; o ) = PF (x; y; z j a; b; o)PF (x0; y0 ; z0 j a; b ; o): In this setting, atoms representing our observation are obs(DBb ) = fbtype('A'); btype('B'); btype('O'); btype('AB')g. We observe one of them, say btype('A'), and infer a possible explanation S , i.e. a minimal conjunction of abducibles msw(gene,1,1) such that S; Rb ` btype('A'). S is obtained by applying a special SLD refutation procedure to the goal btype('A') which preserves msw atoms resolved upon in the refutation. Three explanations are found. S1 = msw(gene,father,a) ^ msw(gene,mother,a) S2 = msw(gene,father,a) ^ msw(gene,mother,o) S3 = msw(gene,father,o) ^ msw(gene,mother,a) So DB (btype(a)), the support set for btype(a), is fS1 ; S2 ; S3g. The probability of each explanation is respectively computed as PF (S1) = a2 and PF (S2 ) = PF (S3) = ao. From Proposition A.2 in Appendix A, it follows that PDB (btype('A')) = PDB (S1 _ S2 _ S3) = PF (S1 _ S2 _ S3 ) and that PDB (btype('A') j a ; b ; o ) = PF (S1 ) + PF (S2 ) + PF (S3 ) = a2 + 2ao: b

b

b

b

b

b

b

b

b

b

b

b

b

401

b

b

Sato & Kameya

Here we used the fact that S1, S2 and S3 are mutually exclusive as the choice of gene is exclusive. Parameters, i.e. a, b and o are determined by ML estimation performed on a random sample such as fbtype('A'); btype('O'); btype('AB')g of btype as follows. ha ; b ; oi = argmaxh ; ; i PDB (btype('A'))PDB (btype('O'))PDB (btype('AB')) = argmaxh ; ; i (a2 + 2ao)o2 ab This program contains neither function symbol nor recursion though our semantics allows for them. Later we see an example containing both, a program for an HMM (Rabiner & Juang, 1993). a

b o

a

b o

b

b

b

4.2 Four Simplifying Conditions

in Figure 1 is simple and probability computation is easy. This is not generally the case. Since our primary interest is learning, especially ecient parameter learning of parameterized logic programs, we hereafter concentrate on identifying what property of a program makes probability computation easy like DBb, thereby makes ecient parameter learning possible. To answer this question precisely, let us formulate the whole modeling process. Suppose there exist symbolic-statistical phenomena such as gene inheritance for which we hope to construct a probabilistic computational model. We rst specify a target predicate p whose ground atom p(s) represents our observation of the phenomena. Then to explain the empirical distribution of p, we write down a parameterized logic program DB = F [ R having a basic distribution PF with parameter that can reproduce all observable patterns of p(s). Finally, observing a random sample p(s1); : : : ; p(sT ) of ground atoms of p, we adjust by ML estimation, i.e. by maximizing the likelihood L() = Tt=1 PDB (p(st) j ) so that PDB (p(1) j ) approximates as closely to the empirically observed distribution of p as possible. At rst sight, this formulation looks right, but in reality it is not. Suppose two events p(s) and p(s0 ) (s 6= s0) are observed. We put L() = PDB (p(s) j )PDB (p(s0) j ). But this cannot be a likelihood at all simply because in distribution semantics, p(s) and p(s0 ) are two dierent random variables, not two realizations of the same random variable. A quick remedy is to note that in the case of blood type program DBb where obs(DBb) = fbtype('A'); btype('B'); btype('O'); btype('AB')g are observable atoms, only one of them is true for each observation, and if some atom is true, others must be false. In other words, these atoms collectively behave as a single random variable having the distribution PDB whose values are obs(DBb ). Keeping this in mind, we introduce the following condition. Let obs(DB) ( head(R)) be a set of ground atoms which represent observable events. We call them observable atom s. DBb

Q

b

Uniqueness condition: 0 PDB (G ^ G ) = 0

for any G 6= G0 2 obs(DB), and

402

P

G2obs(DB) PDB

(G) = 1.


The uniqueness condition enables us to introduce a new random variable Yo representing our observation. Fix an enumeration G1 ; G2 ; : : : of observable atoms in obs(DB) and de ne Yo by21 Y o (! ) = k

i ! j= Gk for ! 2 DB (k 1): (5) Let Gk T; Gk ; : : : ; Gk 2 obs(DB) be a random sample of size T . Then L() = Tt=1 PDB (Gk j ) = t=1 PDB (Yo = kt j ) quali es for the likelihood function w.r.t. Yo . The second condition concerns the reduction of probability computation to addition. Take again the blood type exmaple. The computation of PDB (btype('A')) is decomposed into a summation because explanations in the support set are mutualy exclusive. So we introduce Q

Q1

2

t

T

b

Exclusiveness condition:

For every G 2 obs(DB) and the support set DB (G), PDB (S ^ S 0) = 0 for any S 6= S 0 2 DB (G). Using the exclusiveness condition (and Proposition A.2 in Appendix A), we have PDB (G) = PF (S ): X

S2

DB

(G)

From a modeling point of view, it means that while a single event, or a single observation, G, may have several (or even in nite) explanations DB (G), only one of DB (G) is allowed to be true for each observation. Now introduce 9DB , i.e. the set of all explanations relevant to obs(DB) by 9DB def = DB (G) [

G2obs(DB)

and x an enumeration S1; S2 ; : : : of explanations in 9DB . It follows from Proposition A.2, the uniqueness condition and the exclusiveness condition that PDB (Si ^ Sj ) = 0 for i 6= j and PDB (S ) = PDB (S ) X

S 29DB

X

X

G2obs(DB) S 2

(G) PDB (G) DB

= G2obs(DB) = 1: So we are able to introduce under the uniqueness condition and the exclusiveness condition yet another random variable Xe, representing an explanation for G, de ned by Xe (!) = k i ! j= Sk for ! 2 DB : (6) The third condition concerns termination.

21.

X

G2obs(DB) PDB (G) = 1 only guarantees that the measure of f ! j ! j= Gk for some k ( 1)g is one, so there can be some ! satisfying no Gk 's. In such case, we put Yo (!) = 0. But values on a set of measure zero do not aect any part of the discussion that follows. This also applies to the de nition of Xe in (6). P

403

Sato & Kameya

Finite support condition:

For every G 2 obs(DB) DB (G) is nite. PDB (G) is then computed from the support set DB (G) = fS1 ; : : : ; Sm g (0 m), with the help of the exclusiveness condition, as a nite summation mi=1 PF (Si). This condition prevents an in nite summation that is hardly computable. The fourth condition simpli es the probability computation to multiplication. Recall that an explanation S for G 2 obs(DB) is a conjunction a1 ^ 1 11 ^ am of some abducibles fa1; : : : ; amg F (1 m). In order to reduce the computation of PF (S ) = PF (a1 ^11 1^ am) to the multiplication PF (a1) 1 11 PF (am ), we assume P

Distribution condition:

F is a set Fmsw of ground atoms with a parameterized distribution Pmsw speci ed below.

Here atom msw(i,n,v) is intended to simulate a multi-ary random switch whose name is i and whose outcome is v on trial n. It is a generalization of primitive probabilistic events such as coin tossing and dice rolling. 1. Fmsw consists of probabilistic atoms msw(i,n,v). The arguments i, n and v are ground terms called switch name, trial-id and a value (of the switch i), respectively. We assume that a nite set Vi of ground terms called the value set of i is associated with each i, and v 2 Vi holds. 2. Write Vi as fv1 ; v2 ; : : : ; vmg (m = jVi j). Then, one of the ground atoms f msw(i,n,v1), msw(i,n,v2 ), . .. , msw(i,n,vm )g becomes exclusively true (takes on value 1) on each trial. With each i, a parameter i;v 2 [0; 1] such that v2V i;v = 1 is associated. i;v is the probability of msw(i,1,v) being true (v 2 Vi). 3. For each ground terms i, i0 , n, n0, v 2 Vi and v0 2 Vi0 , random variable msw(i,n,v) is independent of msw(i0 ,n0 ,v0 ) if n 6= n0 or i 6= i0 . In other words, we introduce a family of parameterized nite distributions P(i;n) such that P(i;n)(msw(i,n,v1 ) = x1 ; : : : ; msw(i,n,vm ) = xm j i;v ; : : : ; i;v ) x x if mk=1 xk = 1 def = 0i;v 1 11 i;v o.w. (7) P

i

(

1

1

1

m m

P

m

where m = jVij, xk 2 f0; 1g (1 k m), and de ne Pmsw as their in nite product Pmsw def = P(i;n) : Y

i;n

Under this condition, we can compute Pmsw(S ), the probability of an explanation S, as the product of parameters. Suppose msw(ij ,n,v) and msw(ij0 ,n0,v0) are dierent conjuncts in an explanation S = msw(i1 ,n1,v1 ) ^ 11 1^ msw(ik ,nk ,vk ). If either j 6= j 0 or n 6= n0 holds, they are independent by construction. Else if j = j 0 and n = n0 but v 6= v 0, they are not independent but Pmsw(S ) = 0 by construction. As a result, whichever condition may hold, Pmsw (S ) is computed from the parameters. 404


4.3 Modeling Principle

Up to this point, we have introduced four conditions, the uniqueness condition, the exclusiveness condition, the nite support condition and the distribution condition, to simplify probability computation. The last one is easy to satisfy. We just adopt Fmsw together with Pmsw . So, from here on, we always assume that Fmsw has a parameterized distribution Pmsw introduced in the previous subsection. Unfortunately the rest are not satis ed automatically. According to our modeling experiences however, it is only mildly dicult to satisfy the uniqueness condition and the exclusiveness condition as long as we obey the following modeling principle. Modeling principle: DB = Fmsw [ R describes a sequential decision process (modulo auxiliary computations) that uniquely produces an observable atom G 2 obs(DB) where each decision is expressed by some msw atom.22 Translated into programming level, it says that we must take care when writing a program so that for any sample F 0 from Pmsw, there must uniquely exist goal G (G 2 obs(DB)) which has a successful refutation from DB0 = F 0 [ R. We can con rm the principle by the blood type program DBb = Fb [ Rb. It describes a process of gene inheritance, and for an arbitrary sample Fb0 from Pmsw, say Fb0 = fmsw(gene,father,a); msw(gene,mother,o)g, there exists a unique goal, btype('A') in this case, that has a successful SLD refutation from Fb0 [ Rb. The idea behind this principle is that a decision process always produces some result (an observable atom), and dierent decision processes must dier at some msw thereby entailing mutually exclusive observable atoms. So the uniqueness condition and the exclusiveness condition will be automatically satis ed. Satisfying the nite support condition is more dicult as it is virtually equivalent to writing a program DB for which all solution search for G (G 2 obs(DB)) always terminates. Apparently we have no general solution to this problem, but as far as speci c models such as HMMs, PCFGs and Bayesian networks are concerned, it can be met. All programs for these models satisfy the nite support condition (and other conditions as well).

4.4 Four Conditions Revisited

In this subsection, we discuss how to relax the four simplifying conditions introduced in Subsection 4.2 for the purpose of exible modeling. We rst examine the uniqueness condition considering its crucial role in the adaptation of the EM algorithm to our semantics. The uniqueness condition guarantees that there exists a (many-to-one) mapping from explanations to observations so that the EM algorithm is applicable (Dempster et al., 1977). It is possible, however, to relax the uniqueness condition while justifying the application of the EM algorithm. We assume the MAR (missing at random) condition introduced by Rubin (1976) which is a statistical condition on how a complete data (explanation) becomes an incomplete data (observation), and is customarily assumed implicitly or explicitly in statistics (see Appendix B). By assuming the MAR condition, we can apply our EM 22. Decisions made in the process are a nite subset of Fmsw . 405

Sato & Kameya

algorithm to non-exclusive observations O such that O P (O) 1 where the uniqueness condition is seemingly destroyed. Let us see the MAR condition in action with a simple example. Imagine we walk along a road in front of a lawn. We occasionally observe their state such as \the road is dry but the lawn is wet". Assume that the lawn is watered by a sprinkler running probabilistically. The program DBrl = Rrl [ Frl in Figure 2 describes a sequential process which outputs an observation observed(road(x),lawn(y)) (\the road is x and the lawn is y") where x; y 2 fwet; dryg. Rrl = { observed(road(X),lawn(Y)):P

Frl

=

msw(rain,once,A), ( A = yes, X = wet, Y = wet ; A = no, msw(sprinkler,once,B), ( B = on, X = dry, Y = wet ; B = off, X = dry, Y = dry ) ). } { msw(rain,once,yes), msw(rain,once,no), msw(sprinkler,once,on), msw(sprinkler,once,off) }

Figure 2: DBrl The basic distribution over Frl is speci ed like PF (1) in Subsection 4.1, so we omit it. msw(rain,once,A) in the program determines whether it rains (A = yes) or not (A = no), whereas msw(sprinkler,once,B) determines whether the sprinkler works ne (B = on) or not (B = off). Since for each sampled values of A = a (a 2 fyes; nog) and B = b (b 2 fon; offg), there uniquely exists an observation observed(road(x),lawn(y)) (x; y 2 fwet; dryg), there is a many-to-one mapping : (a; b) = hx; yi. In other words, we can apply the EM algorithm to the observations observed(road(x),lawn(y)) (x; y 2 fwet; dryg). What would happen if we observe exclusively either a state of the road or that of the lawn? Logically, this means we observe 9y observed(road(x),lawn(y)) or 9x observed(road(x),lawn(y)). Apparently the uniqueness condition is not met, because 9y observed(road(wet),lawn(y)) and 9x observed(road(x),lawn(wet)) are compatible (they are true when it rains). Despite the non-exclusiveness of the observations, we can still apply the EM algorithm to them under the MAR condition, which in this case translates into that we observe either the lawn or the road randomly regardless of their state. We now brie y check other conditions. Basically they can be relaxed at the cost of increased computation. Without the exclusiveness condition for instance, we would need an additional process of transforming the support set DB (G) for a goal G into a set of exclusive explanations. For instance, if G has explanations fmsw(a,n,v); msw(b,m,w)g, we have to transform it into fmsw(a,n,v); :msw(a,n,v) ^ msw(b,m,w)g and so on.23 Clearly, this transformation is exponential in the number of msw atoms and eciency concern leads to assuming the exclusiveness condition. The nite support condition is in practice equivalent to the condition that the SLD tree for G is nite. So relaxing this condition might induce in nite computation. b

23. :msw(a,n,v ) is further transformed to a disjunction of exclusive msw atoms like 406

W

6

0 2 msw(a,n,v ).

v 0 =v;v 0 Va


Relaxing the distribution condition and accepting probability distributions other than serve to expand the horizon of the applicability of parameterized logic programs. In particular the introduction of parameterized joint distributions P (v1; : : : ; vk ) like Boltzmann distributions over switches msw1 ; : : : ; mswk where v1; : : : ; vk are values of the switches, makes them correlated. Such distributions facilitate writing parameterized logic programs for complicated decision processes in which decisions are not independent but interdependent. Obviously, on the other hand, they increase learning time, and whether the added

exibility of distributions deserves the increased learning time or not is yet to be seen. Pmsw

4.5 Naive Approach to EM Learning

In this subsection, we derive a concrete EM algorithm for parameterized logic programs DB = Fmsw [ R assuming that they satisfy the uniqueness condition, the exclusiveness condition and the nite support condition. To start, we introduce Yo, a random variable representing our observations according to (5) based on a xed enumeration of observable atoms in obs(DB). We also introduce another random variable Xe representing their explanations according to (6) based on some xed enumeration of explanations in 9DB . Our understanding is that Xe is non-observable while Yo is observable, and they have a joint distribution PDB (Xe = x; Yo = y j ) where denotes relevant parameters. It is then immediate, following (1) and (2) in Section 2, to derive a concrete EM algorithm from the Q function de ned by Q( j 0 ) def = x PDB (x j y; 0) ln PDB (x; y j ) whose input is a random sample of observable atoms and whose output is the MLE of . In the following, for the sake of readability, we substitute an observable atom G (G 2 obs(DB)) for Yo = y and write PDB (G j ) instead of PDB (Yo = y j ). Likewise we substitute an explanation S (S 2 9DB ) for Xe = x and write PDB (S; G j ) instead of PDB (Xe = x; Yo = y j ). Then it follows from the uniqueness condition that 0 if S 62 DB (G) PDB (S; G j ) = Pmsw (S j ) if S 2 DB (G): We need yet another notation here. For an explanation S, de ne the count of msw(i,n,v) in S by i;v (S ) def = jf n j msw(i,n,v) 2 S gj : We have done all preparations now. Suppose we make some observations G = G1 ; : : : ; GT where Gt 2 obs(DB) (1 t T ). Put I def = fi j msw(i,n,v) 2 S 2 DB (Gt); 1 t T g def = fi;v j msw(i,n,v) 2 S 2 DB (Gt); 1 t T g: I is a set of switch names that appear in some explanation for one of the Gt 's and denotes parameters associated with these switches. is nite due to the nite support condition. P

(

407

Sato & Kameya

Various probabilities and the Q function are computed by using Proposition A.2 in Appendix A together with our assumptions as follows. PDB (Gt j ) = PDB = Pmsw (S j ) (8) DB (Gt )

_

Pmsw (S j )

=

Q( j 0 ) def =

= where (i; v; ) def =

Y

t=1 S29DB i2I;v2Vi T X t=1

X

S2

i;v (S ) i;v

i2I;v2Vi T X X

X

DB

(Gt )

PDB (S j Gt ; 0 ) ln PDB (S; Gt j )

(i; v; 0 )ln i;v

1

PDB (Gt j ) S 2

X

i2I;v2Vi

X

DB

(Gt )

! (i; v; 0 ) 0 (i; v; ) ln P 0 0 v0 2Vi (i; v ; )

(9)

Pmsw (S j )i;v (S )

Here we used Jensen's inequality to obtain (9). Note that PDB (Gt j )01 S2 (G ) Pmsw (S j )i;v (S ) is the expected count of msw(i,1,v) in an SLD refutation of Gt. Speaking of the likelihood function L() = Tt=1 PDB (Gt j ), it is already shown in Subsection 2.2 (footnote) that Q( j 0 ) Q(0 j 0 ) implies L() L(0 ). Hence from (9), we reach the procedure learn-naive( ,G) below that nds the MLE of the parameters. The array variable [i; v] stores (i; v; ) under the current . P

DB

t

Q

DB

1: 2: 3: 4: 5: 6:

procedure

learn-naive(DB; G )

begin

Initialize T with appropriate values and " with a small positive number ; (0) := t=1 ln PDB (Gt j ); % Compute the log-likelihood. P

repeat foreach

i 2 I; v 2 Vi do

[i; v ] :=

T X

1

PDB (Gt j ) S 2 foreach i 2 I; v 2 Vi do [i; v] i;v := P 0; 0 v 2Vi [i; v ] m := m +P1; (m) := Tt=1 ln PDB (Gt j ) until (m) 0 (m01) < "

7: 8: 9: 10: 11: 12: end

t=1

X

DB

(Gt )

Pmsw (S j )i;v (S );

% Update the parameters. % Compute the log-likelihood again. % Terminate if converged.

This EM algorithm is simple and correctly calculates the MLE of , but the calculation of PDB (Gt j ) and [i; v](Line 3, 6 and 10) may suer a combinatorial explosion of explanations. That is, j DB (Gt)j often grows exponentially in the complexity of the model. For instance, j DB (Gt)j for an HMM with N states is O(N L), exponential in the length L of an input/output string. Nonetheless, suppressing the explosion to realize ecient computation in a polynomial order is possible, under suitable conditions, by avoiding multiple computations of the same subgoal as we see next. 408


4.6 Inside Probability and Outside Probability for Logic Programs

In this subsection, we generalize the notion of inside probability and outside probability (Baker, 1979; Lari & Young, 1990) to logic programs. Major computations in learn-naive( ,G) are those of two terms in Line 6, PDB (Gt j ) and S2 (G ) Pmsw(S j )i;v (S). Computational redundancy lurks in the naive computation of both terms. We show it by an example. Suppose there is a propositional program DBp = Fp [ Rp where Fp = fa; b; c; d; mg and DB

P

DB

8 > > > > >
> > > > :

f f g g h

t

a^g b^g c d^h m:

(10)

Here f is an observable atom. We assume that a, b, c, d and m are independent and also that fa; bg and fc; dg are pair-wise exclusive. Then the support set for f is calculated as DB (f) = fa ^ c; a ^ d ^ m; b ^ c; b ^ d ^ m g: Hence, in light of (8), we may compute PDB (f) as PDB (f) = PF (a ^ c) + PF (a ^ d ^ m) + PF (b ^ c) + PF (b ^ d ^ m): (11) This computation requires 6 multiplications (because PF (a ^ c) = PF (a)PF (c) etc.) and 3 additions. On the other hand, it is possible to compute PDB (f) much more eciently by factoring out common computations. Let A be a ground atom. De ne the inside probability (A) of A as (A) def = PDB (A j ):24 (12) Then by applying Theorem A.1 in Appendix A to comp(Rp) ` f $ (a ^ g) _ (b ^ g); g $ c _ (d ^ h); h $ m (13) which unconditionally holds in our semantics, and by using the independent and the exclusiveness assumption made on Fp, the following equations about inside probability are derived. (f) = (a) (g) + (b) (g) (g) = (c) + (d) (h) (14) (h) = (m) PDB (f)(= (f)) is obtained by solving (14) about (f), for which only 3 multiplications and 2 additions are required. It is quite straightforward to generalize (14) but before proceeding, look at a program DBq = fmg [ fg:-m ^ m; g:-mg where g is an observable atom and m the only msw atom. We have g $ (m ^ m) _ m in our semantics, but to compute P (g) = P (m)P (m) + P (m) is clearly wrong as it ignores the fact that clause bodies for g, i.e. m^m and m are not mutually exclusive, and atoms in the clause body m^m are not independent (here P (1) = PDB (1)). Similarly, if we set a = b = c = d = m, the equation (14) will be totally incorrect. p

p

p

p

p

p

p

p

p

p

p

8 > < > :

p

q

24. Note that if A is a fact in F , (A) = Pmsw (A j ).

409

Sato & Kameya

We therefore add, temporarily in this subsection, two assumptions on top of the exclusiveness condition and the nite support condition so that equations like (14) become mathematically correct. The rst assumption is that \clause" bodies are mutually exclusive i.e. if there are two clauses B W and B W 0 , PDB (W ^ W 0 j ) = 0, and the second assumption is that body atoms are independent, i.e. if A B1 ^ 1 1 1 ^ Bk is a rule, PDB (B1 ^ 11 1 ^ Bk j ) = PDB (B1 j ) 11 1 PDB (Bk j ) holds. Please note that \clause" used in this subsection has a special meaning. It is intended to mean G where G is a goal and is a tabled explanation for G obtained by OLDT search both of which will be explained in the next subsection.25 In other words, these additional conditions are not imposed on a source program but on the result of OLDT search. So clauses for auxiliary computations do not need to satisfy them. Now suppose clauses about A occur in DB like A A

B1;1 ^ 1 1 1 ^ B1;i1

11 1

BL;1 ^ 11 1 ^ BL;iL

where Bh;j (1 h L; 1 j ih) is an atom. Theorem A.1 in Appendix A and the above assumptions ensure i i (A) = (B1;j ) + 1 11 + (BL;j ): (15) 1 Y

L Y

j =1

j =1

(15) suggests that (Gt) can be considered as a function of (A) if these equations about inside probabilities are hierarchically organized in such a way that (Gt) belongs to the top layer and any (A) appearing on the left hand side only refers to (B)'s which belong to the lower layers. We refer to this condition as the acyclic support condition. Under the acyclic support condition, equations of the form (15) have a unique solution, and the computation of PDB (G j ) via inside probabilities allows us to take advantage of reusing intermediate results stored as (A), thereby contributing to faster computation of PDB (Gt j ). Next we tackle a more intricate problem, the computation of S2 (G ) Pmsw(S j )i;v (S ). Since the sum equals n msw(i,n,v )2S 2 (G ) Pmsw (S j ), we concentrate on the computation of (Gt; m) def = Pmsw (S j ) P

P

DB

P

DB

t

t

X

( )

m2S 2 DB Gt

where m = msw(i,n,v). First we note that if an explanation S contains m like S = a1 ^1 11^ ah ^ m, then we have (S ) = (a1 ) 1 1 1 (ah) (m). So (Gt ; m) is expressed as (Gt; m) = (Gt ; m) (m)

(16) where (Gt; m) = @@ (G(m;)m) and (Gt; m) does not depend on (m). Generalizing this observation to arbitrary ground atoms, we introduce the outside probability of ground atom A w.r.t. Gt by (Gt) (Gt; A) def = @ @ (A) t

25. The logical relationship (13) corresponds to (20) where f, g and h are table atoms. 410


assuming the same conditions as inside probability. In view of (16), the problem of computing (Gt; m) is now reduced to that of computing (Gt; m), which is recursively computable as follows. Suppose A occurs in the ground program DB like B1 BK

A ^ W1;1 ; 11 1 ; B1

11 1

A ^ WK;1 ; 11 1 ; BK

A ^ W1;i1 A ^ WK;iK :

As (Gt) is a function of (B1 ); : : : ; (BK ) by our assumption, the chain rule of derivatives leads to @ (Gt ) @ (A ^ WK;i ) @ (Gt ) @ (A ^ W1;1 ) + 11 1 + (Gt ; A) = @ (B1 ) @ (A) @ (BK ) @ (A) and hence to26 (Gt; Gt ) = 1 (17) i i (Gt ; A) = (Gt ; B1 ) (W1;j ) + 11 1 + (Gt ; BK ) (WK;j ): (18)

K

1 X

K X

j =1

j =1

Therefore if all inside probabilities have already been computed, outside probabilities are recursively computed from the top (17) using (18) downward along the program layers. In the case of DBp with f and m being chosen atoms, we compute (f; f) = 1 (f; g) = (a) + (b) (19) (f; h) = (f; g) (d) (f; m) = (f; h): From (19), the desired sum (f; m) is calculated as (f; m) = (f; m) (m) = ( (a) + (b)) (d) (m) which requires only two multiplications and one addition compared to four multiplications and one addition in the naive computation. Gains obtained by computing inside and outside probability may be small for this case, but as the problem size grows, they become enormous, and compensate enough for additional restrictions imposed on the result of OLDT search. 8 > > > < > > > :

4.7 OLDT Search

To compute inside and outside probability recursively like (15) or (17) and (18), we need at programming level a tabulation mechanism for structure-sharing of partial explanations 26. Because of the independence assumption on body atoms, Wh;j (1 h K; 1 j ih ) and A are independent. Therefore @ (A ^ Wh;j ) = @ (A) (Wh;j ) = (W ): h;j @ (A) @ (A) 411

Sato & Kameya

between subgoals. We henceforth deal with programs DB in which a set table(DB) of table predicate s are declared in advance. A ground atom containing a table predicate is called a table atom. The purpose of table atoms is to store their support sets and eliminate the need of recomputation, and by doing so, to construct hierarchically organized explanations made up of the table atoms and the msw atoms. Let DB = Fmsw [ R be a parameterized logic program which satis es the nite support condition and the uniqueness condition. Also let G1 ; G2; : : : ; GT be a random sample of observable atoms in obs(DB). We make the following additional assumptions.

Assumptions:

For each t (1 t T ), there exists a nite set f1t; : : : ; Kt g of table atoms associated t (0 k K ; 1 j m ) such that with conjunctions Sk;j t k t

e

comp(R)

` Gt $ S0t;1 _ 1 1 1 _ S0t;m ^ 1t $ S1t;1 _ 11 1 _ S1t;m ^ 1 11 ^ Kt t $ SKt t ;1 _ 11 1 _ SKt t ;mKt e

e

e

0

e

e

1

e

(20)

where t (0 k K ; 1 j m ) is, as a set, a subset of F each Sk;j t msw [ fk+1 ; : : : ; K g k (acyclic support condition). As a convention, we put 0 = Gt and call respectively t def t (k 0) a t-explanation DB = f0 ; 1t; : : : ; Kt g the set of table atoms for Gt and Sk;j for kt .27 The set of all t-explanations for k is denoted by DB (kt ) and we consider DB (1) as a function of table atoms. t ^ St ) = t-explanations are mutually exclusive, i.e. for each k (0 k Kt ), PDB (Sk;j k;j 0 0 (1 j 6= j 0 mk ) (t0exclusiveness condition). t (0 k K ; 1 j m ) is a conjunction of independent atoms (independent Sk;j t k condition).28 These assumptions are aimed at ecient probability computation. Namely, the acyclic support condition makes dynamic programming possible, the t-exclusiveness condition reduces PDB (A _ B) to PDB (A)+ PDB (B) and the independent condition reduces PDB (A ^ B ) to PDB (A)PDB (B). There is one more point concerning eciency however. Note that the t 29 imcomputation in dynamic programming proceeds following the partial order on DB posed by the acyclic support condition and access to the table atoms will be much simpli ed t respecting the said partial if they are linearly ordered. We therefore topologically sort DB t order and call the linearized DB satisfying the three assumptions (the acyclic support condition, the t-exclusiveness condition and the independent condition) a hierarchical system t = h t ; t ; : : : ; t i ( = G ) assuming of t-explanations for Gt . We write it as DB 0 t DB (1) is 0 1 K 30 implicitly given. Once a hierarchical system of t-explanations for Gt is successfully built e

t

e

t

e

e

e

e

e

t

e

27. Pre x \t-" is an abbreviation of \tabled-". 28. The independence mentioned here only concerns positive propositions. For B1 ; B2 2 head(DB), we say B1 and B2 are independent if PDB (B1 ^ B2 j ) = PDB (B1 j )PDB (B2 j ) for any . 29. i precedes j if and only if the top-down execution of i w.r.t. DB invokes j directly or indirectly. 30. So now it holds that if i precedes j then i < j . 412


from the source program, equations on inside probability and outside probability such as (14) and (19) are automatically derived and solved in time proportional to the size of the equations. It plays a central role in our approach to ecient EM learning. One way to obtain such t-explanations is to use OLDT search (Tamaki & Sato, 1986; Warren, 1992), a complete refutation method for logic programs. In OLDT search, when a goal G is called for the rst time, we set up an entry for G in a solution table and store its answer substitutions G there. When a call to an instance G0 of G occurs later, we stop solving G0 and instead try to retrieve an answer substitution G stored in the solution table by unifying G0 with G. To record the remaining answer substitutions of G, we prepare a lookup table for G0 and hold a pointer to them. For self-containedness, we look at details of OLDT search using a sample program DBh = Fh [Rh in Figure 431 which depicts an HMM32 in Figure 3. This HMM has two states fs0; s1g. At a state transition, it probabilistically chooses the next destination from fs0; s1g a,b s1

s0

a,b

a,b

a,b

Figure 3: Two state HMM Fh

Rh

=

=

8 < : 8 > > > > > > > > > > > > > > < > > > > > > > > > > > > > > :

f1: values(init, [s0,s1]). f2: values(out(_),[a,b]). f3: values(tr(_), [s0,s1]).

h1: hmm(Cs):msw(init,once,Si), hmm(1,Si,Cs). h2: hmm(T,S,[C|Cs]):- T=3.

% % % % % % % % %

To generate a string (chars) Cs, Set initial state to Si, and then Enter the loop with clock = 1. Loop: Output C in state S. Transit from S to NextS. Put the clock ahead. Continue the loop (recursion). Finish the loop if clock > 3.

Figure 4: Two state HMM program DBh 31. f1, f2, f3, h1, h2 and h3 are temporary marks, not part of the program. 32. An HMM de nes a probability distribution over strings in the given set of alphabets, and works as a stochastic string generator (Rabiner & Juang, 1993) such that an output string is a sample from the de ned distribution. 413

Sato & Kameya

and also an alphabet from fa; bg to emit. Note that to specify a fact set Fh and the associated distribution compactly, we introduce here a new notation values(i,[v1,...,vm]). It declares that Fh contains msw atoms of the form msw(i,n,v) (v 2 fv1 ; : : : ; vmg) whose distribution is P(i;n) given by (7) in Subsection 4.2. For example, (f3), values(tr( ),[s0,s1]) introduces msw(tr(t),n,v) atoms into the program such that t can be any ground term, v 2 fs0; s1g and for a ground term n, they have a distribution x y P(tr(t);n) (msw(tr(t),n,s0) = x; msw(tr(t),n,s1) = y j i;s0 ; i;s1) = i;s 0 i;s1 where i = tr(t), x; y 2 f0; 1g and x + y = 1. This program runs like a Prolog program. For a non-ground top-goal hmm(S), it functions as a stochastic string generator returning a list of alphabets such as [a,b,a] in the variable S as follows. The top-goal calls clause (h1) and (h1) selects the initial state by executing subgoal msw(init,once,Si)33 which returns in Si an initial state probabilistically chosen from fs0, s1g. The second clause (h2) is called from (h1) with ground S and ground T. It makes a probabilistic choice of an output alphabet C by asking msw(out(S),T,C) and then determines NextS, the next state, by asking msw(tr(S),T,NextS). (h3) is there to stop the transition. For simplicity, the length of output strings is xed to three. This way of execution is termed as sampling execution because it corresponds to a random sampling from PDB . If the top-goal is ground like hmm([a,b,a]), it works as an acceptor, i.e. returning success (yes) or failure (no). If all explanations for hmm([a,b,a]) are sought for, we keep all msw atoms resolved upon during the refutation as a conjunction (explanation), and repeat this process by backtracking until no more refutation is found. If we need t-explanations however, backtracking must be abandoned because sharing of partial explanations through t-explanations, the purpose of t-explanations itself, becomes impossible. We therefore instead use OLDT search for all h

t1: t2: t3: t4: t4': : t7: t8: t9:

top_hmm(Cs,Ans):- tab_hmm(Cs,Ans,[]). tab_hmm(Cs,[hmm(Cs)|X],X):- hmm(Cs,_,[]). tab_hmm(T,S,Cs,[hmm(T,S,Cs)|X],X):- hmm(T,S,Cs,_,[]). e_msw(init,T,s0,[msw(init,T,s0)|X],X). e_msw(init,T,s1,[msw(init,T,s1)|X],X). hmm(Cs,X0,X1):- e_msw(init,once,Si,X0,X2), tab_hmm(1,Si,Cs,X2,X1). hmm(T,S,[C|Cs],X0,X1):T=3.

Figure 5: Translated program of DBh 33. If msw(i,n,V) is called with ground i and ground n, V, a logical variable, behaves like a random variable. It is instantiated to some term v with probability i;v selected from the value set Vi declared by a values atom. If, on the other hand, V is a ground term v when called, the procedural semantics of msw(i,n,v) is equal to that of msw(i,n,V) ^ V = v. 414


t-explanation search. In the case of the HMM program for example, to build a hierarchical system of t-explanations for hmm([a,b,a]) by OLDT search, we rst declare hmm=1 and hmm=3 as table predicate.34 So a t-explanation will be a conjunction of hmm=1 atoms, hmm=3 atoms and msw atoms. We then translate the program into another logic program, analogously to the translation of de nite clause grammars (DCGs) in Prolog (Sterling & Shapiro, 1986). We add two arguments (which forms a D-list) to each predicate for the purpose of accumulating msw atoms and table atoms as conjuncts in a t-explanation. The translation applied to DBh yields the program in Figure 5. In the translated program, clause (t1) corresponds to the top-goal hmm(l) with an input string l, and a t-explanation for the table atom hmm(l) will be returned in Ans. (t2) and (t3) are auxiliary clauses to add to the callee's D-list a table atom of the form hmm(l) and hmm(t,s,l) respectively (t: time step, s: state). In general, if p=n is a table predicate in the original program, p=(n + 2) becomes a table predicate in the translated program and an auxiliary predicate tab p=(n +2) is inserted to signal the OLDT interpreter to check the solution table for p=n, i.e. to check if there already exist t-explanations for p=n. Likewise clauses (t4) and (t4') are a pair corresponding to (f1) which insert msw(init,T,1) to the callee's D-list with T = once. Clauses (t7), (t8) and (t9) respectively correspond to (h1), (h2) and (h3). hmm([a,b,a]):[hmm([a,b,a])] [ [msw(init,once,s0), hmm(1,s0,[a,b,a])], [msw(init,once,s1), hmm(1,s1,[a,b,a])] ] hmm(1,s0,[a,b,a]):[hmm(1,s0,[a,b,a])] [ [msw(out(s0),1,a), msw(tr(s0),1,q0), hmm(2,s0,[b,a])], [msw(out(s0),1,a), msw(tr(s0),1,s1), hmm(2,s1,[b,a])] ] hmm(1,s1,[a,b,a]):[hmm(1,s1,[a,b,a])] [ [msw(out(s1),1,a), msw(tr(s1),1,s0), hmm(2,s0,[b,a])], [msw(out(s1),1,a), msw(tr(s1),1,s1), hmm(2,s1,[b,a])] ] hmm(2,s0,[b,a]):[hmm(2,s0,[b,a])] [ [msw(out(s0),2,b), msw(tr(s0),2,s0), hmm(3,s0,[a])], [msw(out(s0),2,b), msw(tr(s0),2,s1), hmm(3,s1,[a])] ] hmm(2,s1,[b,a]):[hmm(2,s1,[b,a])] [ [msw(out(s1),2,b), msw(tr(s1),2,s0), hmm(3,s0,[a])], [msw(out(s1),2,b), msw(tr(s1),2,s1), hmm(3,s1,[a])] ] hmm(3,s0,[a]):[hmm(3,s0,[a])] [ [msw(out(s0),3,a), msw(tr(s0),3,s0), hmm(4,s0,[])], [msw(out(s0),3,a), msw(tr(s0),3,s1), hmm(4,s1,[])] ] hmm(3,s1,[a]):[hmm(3,s1,[a])] [ [msw(out(s1),3,a), msw(tr(s1),3,s0), hmm(4,s0,[])], [msw(out(s1),3,a), msw(tr(s1),3,s1), hmm(4,s1,[])] ] hmm(4,s0,[]):[hmm(4,s0,[])] [[]] hmm(4,s1,[]):[hmm(4,s1,[])] [[]]

Figure 6: Solution table 34. In general, p=n means a predicate p with arity n. So although hmm=1 and hmm=3 share the predicate name hmm, they are dierent predicates. 415

Sato & Kameya

Then after translation, we apply OLDT search to top hmm([a,b,a],Ans) while noting (i) the added D-list does not in uence the OLDT procedure, and (ii) we associate with each solution of a table atom in the solution table a list of t-explanations. The resulting solution table is shown in Figure 6. The rst row reads that a call to hmm([a,b,a]) occurred and entered the solution table and its solution, hmm([a,b,a]) (no variable binding generated), has two t-explanations, msw(init,once,s0) ^ hmm(1,s0,[a,b,a]) and msw(init,once,s1) ^ hmm(1,s1,[a,b,a]). The remaining task is the topological sorting of the table atoms stored in the solution table respecting the acyclic support condition. This can be done by using depth- rst search (trace) of t-explanations from the top-goal for example. Thus we obtain a hierarchical system of t-explanations for hmm([a,b,a]).

4.8 Support Graphs

Looking back, all we need to compute inside and outside probability is a hierarchical system of t-explanations, which essentially is a boolean combination of primitive events (msw atoms) and compound events (table atoms) and as such can be more intuitively representable as a graph. For this reason, and to help visualizing our learning algorithm, we introduce a new data-structure termed support graphs, though the new EM algorithm in the next subsection itself is described solely by the hierarchical system of t-explanations. As illustrated in Figure 7 (a), the support graph for Gt is a graphical representation of t = h t ; t ; : : : ; t i ( t = G ) for G in (20). the hierarchical system of t-explanations DB t t 0 1 0 K It consists of totally ordered disconnected subgraphs, each of which is labeled with the t (0 k K ). A subgraph labeled t comprises two corresponding table atom kt in DB t k special nodes (the start node and the end node) and explanation graphs, each corresponding t in t to a t-explanation Sk;j DB (k ) (1 j mk ). t is a linear graph in which a node is labeled either with a An explanation graph of Sk;j t . They are called a table node and a switch table atom or with a switch msw(1,1,1) in Sk;j node respectively. Figure 7 (b) is the support graph for hmm([a,b,a]) obtained from the solution table in Figure 6. Each table node labeled refers to the subgraph labeled , so data-sharing is achieved through the distinct table nodes referring to the same subgraph. t

e

e

e

e

4.9 Graphical EM Algorithm

We describe here an ecient EM learning algorithm termed the graphical EM algorithm (Figure 8) introduced by Kameya and Sato (2000), that runs on support graphs. Suppose we have a random sample G = G1; : : : ; GT of observable atoms. Also suppose support graphs for Gt (1 t T ), i.e. hierarchical systems of t-explanations satisfying the acyclic support condition, the t-exclusiveness condition and the independent condition, have been successfully constructed from a parameterized logic program DB satisfying the uniqueness condition and the nite support condition. The graphical EM algorithm re nes learn-naive( ,G ) by introducing two subroutines, get-inside-probs( , G ) to compute inside probabilities and get-expectations( , G ) to compute outside probabilities. They are called from the main routine learn-gEM( ,G ). When learning, we prepare four arrays for each support graph for Gt in G : P [t; ] for the inside probability of , i.e. ( ) = PDB ( j ) (see (12)) DB

DB

DB

DB

416

Parameter Learning of Logic Programs for Symbolic-statistical Modeling τk

(a)

explanation graph

msw

Gt: start

τk

msw

end

msw

msw

msw

: τk : start

end msw

msw

:

(b) msw(init,once,s0)

hmm(1,s0,[a,b,a])

hmm([a,b,a]):

start

end

msw(init,once,s1)

msw(out(s0),1,a)

hmm(1,s1,[a,b,a])

msw(tr(s0),1,s0)

hmm(2,s0,[b,a])

hmm(1,s0,[a,b,a]):

end

start

msw(out(s0),1,a)

msw(tr(s0),1,s1)

hmm(2,s1,[b,a])

msw(out(s1),1,a)

msw(tr(s1),1,s0)

hmm(2,s0,[b,a])

hmm(1,s1,[a,b,a]):

end

start

msw(out(s1),1,a)

msw(tr(s1),1,s1)

hmm(2,s1,[b,a])

Figure 7: A support graph (a) in general form, (b) for Gt = hmm([a,b,a]) in the HMM program DBh. A double-circled node refers to a table node. Q[t; ] for the outside probability of w.r.t. Gt , i.e. (Gt; ) (see (17) and (18)) R[t; ; S ] for the explanation probability of S (2 DB (kt )), i.e. PDB (S j ) e

e

417

e

e

Sato & Kameya 1: procedure learn-gEM (DB; G ) 2: begin 3: Select some as initial 4: 5: 6: 7: 8: 9:

1: procedure get-inside-probs (DB; G ) 2: begin 3: for t := 1 to T do begin 4: Let 0t = Gt; 5: for k := Kt downto 0 do begin 6: P [t; kt ] := 0; 7: foreach Se 2 eDB (kt ) do begin 8: Let Se = fA1 ; A2 ; : : : ; AjSejg; 9: R[t; kt ; Se] := 1; 10: for l := 1 to jSej do 11: if Al = msw(i,1,v ) then 12: R[t; kt ; Se] 3 = i;v 13: else R[t; kt ; Se] 3 = P [t; Al ]; 14: P [t; kt ]+= R[t; kt ; Se] 15: end /* foreach Se */ 16: end /* for k */ 17: end /* for t */ 18: end.

parameters;

get-inside-probs (DB; G); P (0) := Tt=1 ln P [t; Gt ]; repeat

get-expectations (DB; G ); i 2 I; v 2 Vi do [i; vP ] := T [t; i; v]=P [t; G ]; t t=1 foreach i 2 I; v P 2 Vi do i;v := [i; v]= v0 2Vi [i; v0 ]; get-inside-probs (DB; G ); m := m + 1; P (m) := Tt=1 ln P [t; Gt ] until (m) 0 (m01) < " foreach

10: 11: 12: 13: 14: 15: 16: end.

1: procedure get-expectations (DB; G ) begin 2: for t := 1 to T do begin 3: foreach i 2 I; v 2 Vi do [t; i; v] := 0; 4: Let 0t = Gt; Q[t; 0t] := 1:0; 5: for k := 1 to Kt do Q[t; kt ] := 0; 6: for k := 0 to Kt do 7: foreach Se 2 eDB (kt ) do begin 8: Let Se = fA1 ; A2 ; : : : ; AjSejg; 9: for l := 1 to jSej do 10: if Al = msw(i,1,v ) then [t; i; v] += Q[t; kt ] 1 R[t; kt ; Se] 11: else Q[t; Al ] += Q[t; kt ] 1 R[t; kt ; Se]=P [t; Al ] 12: end /* foreach Se */ 13: end /* for t */ 14: end.

Figure 8: graphical EM algorithm. [t; i; v] for the expected count of msw(i,1,v), i.e.

P

S2

DB

(Gt) Pmsw (S j )i;v (S )

and call the procedure learn-gEM( ,G) in Figure 8. The main routine learn-gEM( ,G) initially computes all inside probabilities (Line 4) and enters a loop in which get-expectations( ,G ) is called rst to compute the expected count [t; i; v] of msw(i,1,v) and parameters are updated (Line 11). Inside probabilities are renewed by using the updated parameters before entering the next loop (Line 12). DB

DB

DB

418


The subroutine get-inside-probs( ,G ) computes the inside probability ( ) = PDB ( j ) (and stores it in P [t; ]) of a table atom from the bottom layer up to the topmost layer 0 = Gt (Line 4) of the hierarchical system of t-explanations for Gt (see (20) in Subsection 4.6). It takes a t-explanation S in DB (kt ) one by one (Line 7), decomposes S into conjuncts and multiplies their inside probabilities which are either known (Line 12) or already computed (Line 13). The other subroutine get-expectations( ,G ) computes outside probabilities following the recursive de nitions (17) and (18) in Subsection 4.6 and stores the outside probability (Gt; ) of a table atom in Q[t; ]. It rst sets the outside probability of the top-goal 0 = Gt to 1:0 (Line 4) and computes the rest of outside probabilities (Line 6) going down the layers of the t-explanation for Gt described by (20) in Subsection 4.6. (Line 10) adds Q[t; kt ] 1 R[t; kt ; S ] = (Gt ; kt ) 1 (S ) to [t; i; v], the expected count of msw(i,1,v), as a contribution of msw(i,1,v) in S through kt to [t; i; v]. (Line 11) increments the outside probability Q[t; Al ] = (Gt; Al ) of Al according to the equation (18). Notice that Q[t; kt ] has already been computed and R[t; kt ; S]=P [t; Al ] = (W ) for S = Al ^ W . As shown in Subsection 4.5, learn-naive( ,G ) is the MLE procedure, hence the following theorem holds. Theorem 4.1 Let DB be a parameterized logic program, and G = G1; : : : ; GT a ranDB

e

e

e

DB

e

e

e

e

e

DB

dom sample of observable atoms. Suppose the ve conditions (uniqueness, nite support (Subsection 4.2), acyclic support, t-exclusiveness and independence (Subsection 4.7)) are met. ThenQlearn-gEM (DB; G ) nds the MLE 3 which (locally) maximizes the likelihood L(G j ) = Tt=1 PDB (Gt j ).

(Proof) Sketch.35 Since the main routine learn-gEM( ,G ) is the same as learn-naive( ,G) except the computation of [i; v] = Tt=1 [t; i; v], we show that [t; i; v] = S2 (G ) Pmsw(S j )i;v (S ) (= n msw(i,n,v )2S2 (G ) Pmsw (S j )). However, DB

P

P

DB

P

DB

[t; i; v]

DB

P

=

X

X

0kKt

t

t

X

n msw(i,n,v )2Se2 e ( t ) DB k

(Gt ; kt ) (Se)

(see (Line 10) in get-expectations(DB; G)) = (Gt; msw(i,n,v)) (msw(i,n,v)) n = (Gt; msw(i,n,v)) (see the equation (16)) n = Pmsw (S j ): Q.E.D. X

X

X

X

n msw(i,n,v )2S2

DB

(Gt )

Here we used the fact that if S contains msw(i,n,v) like S = S0 ^ msw(i,n,v), (S) = (S 0 ) (msw(i,n,v )) holds, and hence (Gt ; kt ) (S ) = (Gt ; kt ) (S 0 ) (msw(i,n,v )) = (contribution of msw(i,n,v) in S through kt to (Gt; msw(i,n,v))) (msw(i,n,v)): e

e

e

e

e

e

e

e

35. A formal proof is given by Kameya (2000). It is proved there that under the common parameters , [i; v] in learn-naive(DB,G ) coincides with [i; v] in learn-gEM(DB,G ). So, the parameters are updated to the same values. Hence, starting with the same initial values, the parameters converge to the same values. 419

Sato & Kameya

The ve conditions on the applicability of the graphical EM algorithm may look hard to satisfy at once. Fortunately, the modeling principle in Section 4.3 still stands, and with due care in modeling, it is likely to lead us to a program that meets all of them. Actually, we will see in the next section, programs for standard symbolic-statistical frameworks such as Bayesian networks, HMMs and PCFGs all satisfy the ve conditions. 5. Complexity

In this section, we analyze the time complexity of the graphical EM algorithm applied to various symbolic-statistical frameworks including HMMs, PCFGs, pseudo PCSGs and Bayesian networks. The results show that the graphical EM algorithm is competitive with these specialized EM algorithms developed independently in each research eld.

5.1 Basic Property

Since the EM algorithm is an iterative algorithm and since we are unable to predict when it converges, we measure time complexity by the time taken for one iteration. We therefore estimate time per iteration on the repeat loop of learn-gEM (DB; G) (G = G1 ; : : : ; GT ). We observe that in one iteration, each support graph for Gt (1 t T ) is scanned twice, once by get-inside-probs (DB; G ) and once by get-expectations (DB; G). In the scan, addition is performed on the t-explanations, and multiplication (possibly with division) is performed on the msw atoms and table atoms once for each. So time spent for Gt per iteration by the graphical EM algorithm is linear in the size of the support graph, i.e. the number of nodes in the support graph for Gt. Put 1tDB def = DB ( ) e

[

e

t 2DB

num def = 1max j1e t j tT DB maxsize def = max jSej: t e e 1tT;S 21DB t is the set of table atoms for G , and hence 1t is the set of all t-explanations Recall that DB t DB appearing in the right hand side of (20) in Subsection 4.7. So num is the maximum number of t-explanations in a support graph for the Gt's and maxsize the maximum size of a texplanation for the Gt's respectively. The following is obvious. e

Proposition 5.1 The time complexity of the graphical EM algorithm per iteration is linear

in the total size of support graphs, O (nummaxsize T ) in notation, which coincides with the space complexity because the graphical EM algorithm runs on support graphs.

This is a rather general result, but when we compare the graphical EM algorithm with other EM algorithms, we must remember that the input to the graphical EM algorithm is support graphs (one for each observed atom) and our actual total learning time is OLDT time + (the number of iterations) 2O(nummaxsizeT ) 420


where \OLDT time" denotes time to construct all support graphs for G . It is the sum of time for OLDT search and time for the topological sorting of the table atoms, but because the latter is part of the former order-wise,36 we represent \OLDT time" by time for OLDT search. Also observe that the total size of support graphs does not exceed time for OLDT search for G order-wise. To evaluate OLDT time for a speci c class of models such as HMMs, we need to know time for table operations. Observe that our OLDT search in this paper is special in the sense that table atoms are always ground when called and there is no resolution with solved goals. Accordingly a solution table is used only to check if a goal G already has an entry in the solution table, i.e. if it was called before, and to add a new searched t-explanation for G to the list of discovered t-explanations under G's entry. The time complexity of these operations is equal to that of table access which depends both on the program and on the implementation of the solution table.37 We rst suppose programs are carefully written in such a way that the arguments of table atoms used as indecies for table access are integers. Actually all programs used in the subsequent complexity analysis (DBh in Subsection 4.7, DBg and DBg0 in Subsection 5.3, DBG in Subsection 5.5) satisfy or can satisfy this condition by replacing non-integer terms with appropriate integers. We also suppose that the solution table is implemented using an array so that table access can be done in O(1) time.38 In what follows, we present a detailed analysis of the time complexity of the graphical EM algorithm applied to HMMs, PCFGs, pseudo PCSGs and Bayesian networks, assuming O(1) time access to the solution table. We remark by the way that their space complexity is just the total size of solution tables (support graphs).

5.2 HMMs

The standard EM algorithm for HMMs is the Baum-Welch algorithm (Rabiner, 1989; Rabiner & Juang, 1993). An example of HMM is shown in Figure 3 in Subsection 4.7.39 Given T observations w1 ; : : : ; wT of output string of length L, it computes in O (N 2 LT ) time in each iteration the forward probability tm(q) = P (ot1 ot2 1 11 otm01; q j ) and the backward probability mt (q) = P (otm otm+1 1 1 1 otL j q; ) for each state q 2 Q, time step m (1 m L) and a string wt = ot1 ot2 11 1 otL (1 t T ), where Q is the set of states and N the number of states. The factor N 2 comes from the fact that every state has N possible destinations and

36. Think of OLDT search for a top-goal Gt . It searches for msw atoms and table atoms to create a solution table, while doing some auxiliary computations. Therefore its time complexity is never less than O(jthe number of msw atoms and table atoms in the support graph for Gt j), which coincides with the time we need to topologically sort table atoms in the solution table by depth- rst search from 0 = Gt . 37. Sagonas et al. (1994) and Ramakrishnan et al. (1995) discuss about the implementation of OLDT. 38. If arrays are not available, we may be able to use balanced trees, giving O(log n) access time where n is the number data in the solution table, or we may be able to use hashing, giving average O (1) time access under a certain condition (Cormen, Leiserson, & Rivest, 1990). 39. We treat here only \state-emission HMMs" which emit a symbol depending on the state. Another type, \arc-emission HMMs" in which the emitted symbol depends on the transition arc, is treated similarly. 421

Sato & Kameya

we have to compute the forward and backward probability for every destination and every state. After computing all mt (q)'s and mt (q)'s, parameters are updated. So, the total computation time in each iteration of the Baum-Welch algorithm is estimated as O(N 2LT ) (Rabiner & Juang, 1993; Manning & Schutze, 1999). To compare this result with the graphical EM algorithm, we use the HMM program DBh in Figure 4 with appropriate modi cations to L, the length of a string, Q, the state set, and declarations in Fh for the output alphabets. For a string w = o1o2 11 1 oL, hmm(n,q ,[om ; om+1 ; : : : ; oL ]) in DBh reads that the HMM is in state q 2 Q at time n and has to output [om ,om+1 ,...,oL] until it reaches the nal state. After declaring hmm=1 and hmm=3 as table predicate and translation (see Figure 5), we apply OLDT search to the goal top hmm([o1,...,oL ],Ans) w.r.t. the translated program to obtain all t-explanations for hmm([o1,...,oL]). For a complexity argument however, the translated program and DBh are the same, so we talk in terms of DBh for the sake of simplicity. In the search, we x the search strategy to multi-stage depth- rst strategy (Tamaki & Sato, 1986). We assume that the solution table is accessible in O(1) time.40 Since the length of the list in the third argument of hmm=3 decreases by one on each recursion, and there are only nitely many choices of the state transition and the output alphabet, the search terminates, leaving nitely many t-explanations in the solution table like Figure 6 that satisfy the acyclic support condition respectively. Also the sampling execution of hmm(L) w.r.t. DBh is nothing but a sequential decision process such that decisions made by msw atoms are exclusive, independent and generate a unique string, which means DBh satis es the t-exclusiveness condition, the independence condition and the uniqueness condition respectively. So, the graphical EM algorithm is applicable to the set of hierarchical systems of t-explanations for hmm(wt ) (1 t T ) produced by OLDT search for T observations w1 ; : : : ; wT of output string. Put wt = ot1 ot2 1 11 otL. It follows from t DB = fhmm(m,q,[otm ,...,otL]) j 1 m L + 1; q 2 Qg [ fhmm([ot1,...,otL])g h

DBh

(

(

msw(out(q),m,om ); msw(tr(q),m,q 0); hmm(m + 1,q0 ,[otm+1 ,...,otL ])

)

) = (1 m L) that for a top-goal hmm([ot1 ,...,otL]), there are at most O(NL) calling patterns of hmm=3 and each call causes at most N calls to hmm=3, implying there occur O(NL 1 N ) = O(N 2L) calls to hmm=3. Since each call is computed once due to the tabling mechanism, we have num = O(N 2 L). Also maxsize = 3. Applying Proposition 5.1, we reach e

hmm(m,q ,[otm ,...,otL ])

q0 2 Q

Proposition 5.2 Suppose we have T strings of length L. Also suppose each table operation 2

in OLDT search is done in O (1) time. OLDT time by DBh is O(N LT ) and the graphical EM algorithm takes O(N 2 LT ) time per iteration where N is the number of states.

O(N 2LT ) is the time complexity of the Baum-Welch algorithm. algorithm runs as eciently as the Baum-Welch algorithm.41

So the graphical EM

40. O(1) is possible because in the translated program DBh in Section 4.7, we can identify a goal pattern of hmm(1,1,1,1,1) by the rst two arguments which are constants (integers). 41. Besides, the Baum-Welch algorithm and the graphical EM algorithm whose input are support graphs generated by DBh update parameters to the same value if initial values are the same. 422


By the way, the Viterbi algorithm (Rabiner, 1989; Rabiner & Juang, 1993) provides for HMMs an ecient way of nding the most likely transition path for a given input/output string. A similar algorithm for parameterized logic programs that determines the most likely explanation for a given goal can be derived. It runs in time linear in the size of the support graph, thereby O(N 2 L) in the case of HMMs, the same complexity as the Viterbi algorithm (Sato & Kameya, 2000).

5.3 PCFGs

We now compare the graphical EM algorithm with the Inside-Outside algorithm (Baker, 1979; Lari & Young, 1990). The Inside-Outside algorithm is a well-known EM algorithm for PCFGs (Wetherell, 1980; Manning & Schutze, 1999).42 It takes a grammar in Chomsky normal form. Given N nonterminals, a production rule in the grammar takes the form i ! j; k (1 i; j; k N ) (nonterminals are named by numbers from 1 to N and 1 is a starting symbol) or the form i ! w where 1 i N and w is a terminal. In each iteration, it computes the inside probability and the outside probability of every partial parse tree of the given sentence to update parameters for these production rules. Time complexity is measured by time per iteration, and is described by N , the number of nonterminals, and L, the number of terminals in a sentence. It is O(N 3 L3T ) for T observed sentences (Lari & Young, 1990). To compare the graphical EM algorithm with the Inside-Outside algorithm, we start from a propositional program DBg = Fg [ Rg below representing the largest grammar containing all possible rules i ! j; k in N nonterminals where nonterminal 1 is a starting symbol, i.e. sentence. Fg Rg

d,d0],[j ,k]) j 1 i; j; k N; d; d0 are numbersg = fmsw([if,[msw( i,d,w ) j 1 i N; d is a number; w is a terminalg

=

8 > < > :

S

q(i,d0,d2 ) :- msw(i,[d0,d2 ],[j ,k ]), q(j ,d0,d1 ), q(k ,d1 ,d2). n

q(i,d,d +1) :- msw(i,d,wd+1 ).

1 i; j; k N; 0 d0 < d1 < d2 L

1 i N; 0 d L 0 1

9 > = > ;

o

Figure 9: PCFG program DBg DB g is an arti cial parsing program whose sole purpose is to measure the size of an OLDT tree43 created by the OLDT interpreter when it parses a sentence w1 w2 1 11 wL. So 42. A PCFG (probabilistic context free grammar) is a backbone CFG with probabilities (parameters) assigned to each production rule. For a nonterminal A having n production rules fA ! i j 1 i ng, a P probability pi is assigned to A ! i (1 i n) where ni=1 pi = 1. The probability of a sentence s is the sum of probabilities of each (leftmost) derivation of s. The latter is the product of probabilities of rules used in the derivation. 43. To be more precise, an OLDT structure, but in this case, it is a tree because DBg contains only constants (Datalog program) and there never occurs the need of creating a new root node. 423

Sato & Kameya (1)

Td

q(1,d,L) 2 j N

2 k N q(1,d,d+1), q(1,d+1,L)

q(1,d+1,L) (1)

[Note] q 1 i

2 k N

q(1,d,d+1), q(k,d+1,L)

q(j,d,d+1), q(1,d+1,L)

q(j,d,d+1), q(k,d+1,L)

q(k,d+1,L)

q(1,d+1,L)

q(k,d+1,L)

(k)

Td+1 q(i,d’,d’’) already appears d+1 d’ d’’ L, 1 i d’’-d’ L-d-2

Td+1

d+2 e L-1 1 j N

2 k N q(j,d,e), q(1,e,L)

q(j,d,e), q(k,e,L)

d+2 e’ e 1 i’ N, 1 j’ N,

q(i’,d,e’), q(j’,e’,e), q(1,e,L)

Td’

q(k,e,L)

N

q

N

q(j’,e’,e), q(1,e,L) q(1,e,L)

... p(i)

p(1) p(2) p(N)

Figure 10: OLDT tree for the query

q(1,d,L)

the input sentence w1 w2 11 1 wL is embedded in the program separately as msw(i,d,wd+1) (0 d L0 1) in the second clauses of Rg (this treatment does not aect the complexity argument). q(i,d0,d1) reads that the i-th nonterminal spans from position d0 to position d1, i.e. the substring wd +1 1 11 wd . The rst clauses q(i,d0 ,d2 ) :- msw(1,1,1), q(j ,d0 ,d1), q(k,d1 ,d2) are supposed to be textually ordered according to the lexicographic order for tuples hi; j; k; d0 ; d2 ; d1i. As a parser, the top-goal is set to q(1,0,L).44 It asks the parser to parse the whole sentence w1 w2 1 1 1 wL as the syntactic category \1" (sentence). We make an exhaustive search for this query by OLDT search.45 As before, the multistage depth- rst search strategy and O(1) time access to the solution table are assumed. Then the time complexity of OLDT search is measured by the number of nodes in the OLDT tree. Let Td(k) be the OLDT tree for q(k,d,L). Figure 10 illustrates Td(1) for d (0 d L 0 3) where msw atoms are omitted. As can be seen, the tree has many similar subtrees, so we put them together (see Note in Figure 10). Due to the depth- rst strategy, Td(1) has a recursive structure and contains Td(1) +1 as a subtree. Nodes whose leftmost atom is not underlined are solution nodes, i.e. they solve their leftmost atoms for the rst time in the entire refutation process. The underlined atoms are already computed in the subtrees to their left.46 They only check the solution table if there are their entries (= already 0

1

44. L here is not a Prolog variable but a constant denoting the sentence length. 45. q is a table predicate. 0 00 0 46. It can be inductively proved that Td(1) +1 contains every computed q(i,d ,d ) (0 d L 0 3; d + 1 d < d00 L; 1 i N; d00 0 d0 L 0 d 0 2). 424


computed) in O(1) time. Since all clauses are ground, such execution only generates a single child node. (1) (k) We enumerate h(1) d , the number of nodes in Td but not in Td+1 (1 k N ). From (k) 3 2 47 Figure 10, we see h(1) d = O(N (L 0 d) ). Let hd (2 k N ) be the number of nodes in k) Td(+1 not contained in Td(1)+1. It is estimated as O(N 2 (L 0 d 0 2)). Consequently, the number (k) N 3 2 of nodes that are newly created in Td(1) is h(1) d + k=2 hd = O (N (L 0 d) ). As a result, L 0 3 3 3 48 total time for OLDT search is computed as d=0 hd = O(N L ) which is also the size of the support graph. We now consider a non-propositional parsing program DBg0 = Fg0 [ Rg0 in Figure 11 whose ground instances constitute the propositional program DBg . DBg0 is a probabilistic variant of DCG program (Pereira & Warren, 1980) in which q'/1, q'/6 and between/3 are declared as table predicate. Semantically DBg0 speci es a probability distribution over the atoms of the form fq'(l) j l is a list of terminalsg. P

P

Fg0

Rg0

t,[sj ,sk ]) j 1 i; j; k N; t is a numberg = fmsw([sfi,msw( si ,t,w) j 1 i N; t is a number; w is a terminalg

=

8 > > > > > > > > > > > > < > > > > > > > > > > > > :

q'(S) :- length(S,D), q'(s1 ,0,D,0, ,S-[]). q'(I,D0,D2,C0,C2,L0-L2) :- between(D0,D1,D2), msw(I,C0,[J,K]), q'(J,D0,D1,s(C0),C1,L0-L1), q'(K,D1,D2,s(C1),C2,L1-L2). q'(I,D,s(D),C0,s(C0),[W|X]-X) :- msw(I,C0,W).

Figure 11: Probabilistic DCG like program DBg0 The top-goal to parse a sentence S = [w1; : : : ; wL] is q'([w1; : : : ; wL]). It invokes q'(s1 ,0,D,0, ,[w1 ,: : :,wL ]-[]) after measuring the length D of an input sentence S by calling length=2. 49 50 In general, q'(i,d0,d2 ,c0 ,c2 ,l0-l2) works identically to q(i,d0 ,d2 ) but three arguments, c0 , c2 and l0-l2, are added. c0 supplies a unique trial-id for msws to be used in the body, c2 the latest trial-id in the current computation, and l0 -l2 a D-list holding a substring between d0 and d2 . Since the added arguments do not aect the shape of the

47. We here focus on the subtree Td0 . j , i0 and j 0 range from 1 to N , and f(e; e0 ) j d + 2 e0 < e L 0 1 g = O((L 0 d)2 ). Hence, the number of nodes in Td0 is O(N 3 (L 0 d)2 ). The number of nodes in Td(1) but (1) 0 neither in Td(1) = O (N 3 (L 0 d)2 ). +1 nor in Td is negligible, therefore hd (1) (1) 48. The number of nodes in TL01 and TL02 is negligible. 49. To make the program as simple as possible, we assume that an integer n is represented by a ground term

(n) z }| {

s(1 1 1s (0)1 1 1). We also assume that when D0 and D2 are ground, the goal between(D0, D1, D2) returns an integer D1 between them in time proportional to jD1 0 D0j. 50. We omit an obvious program for length(l,sn ) which computes the length sn of a list l in O(jlj) time.

sn =

def

425

Sato & Kameya

search tree in Figure 10 and the extra computation caused by length=2 is O(L) and the one by the insertion of between(D0,D1,D2) is O(NL3) respectively,51 OLDT time remains O(N 3 L3), and hence so is the size of the support graph. To apply the graphical EM algorithm correctly, we need to con rm the ve conditions on its applicability. It is rather apparent however that the OLDT refutation of any topgoal of the form q'([w1 ,: : :,wL]) w.r.t. DBg0 terminates, and leaves a support graph satisfying the nite support condition and the acyclic support condition. The t-exclusiveness condition and the independent condition also hold because the refutation process faithfully simulates the leftmost stochastic derivation of w1 11 1 wL in which the choice of a production rule made by msw(si ,sc,[sj ,sk ]) is exclusive and independent (trial-ids are dierent on dierent choices). What remains is the uniqueness condition. To con rm it, let us consider another program DBg00 , a modi cation of DBg0 such that the rst goal length(S,D) in the body of the rst clause and the rst goal between(D0,D1,D2) in the second clause of Rg0 are moved to the last position in their bodies respectively. DBg00 and DBg0 are logically equivalent, and semantically equivalent as well from the viewpoint of distribution semantics. Then think of the sampling execution by the OLDT interpreter of a top-goal q'(S) w.r.t. DBg00 where S is a variable, using the multi-stage depth- rst search strategy. It is easy to see rst that the execution never fails, and second that when the OLDT refutation terminates, a sentence [w1 ; : : : ; wL ] is returned in S, and third that conversely, the set of msw atoms resolved upon in the refutation uniquely determines the output sentence [w1 ; : : : ; wL].52 Hence, if the sampling execution is guaranteed to always terminate, every sampling from PF 00 (= PF 0 ) uniquely generates a sentence, an observable atom, so the uniqueness condition is satis ed by DBg00 , and hence by DBg0 . Then when is the sampling execution guaranteed to always terminate? In other words, when does the grammar only generate nite sentences? Giving a general answer seems dicult, but it is known that if the parameter values in a PCFG are obtained by learning from nite sentences, the stochastic derivation by the PCFG terminates with probability one (Chi & Geman, 1998). In summary, assuming appropriate parameter values, we can say that the parameterized logic program DBg0 for the largest PCFG with N nonterminal symbols satis es all applicability conditions, and the OLDT time for a sentence of length L is O(N 3 L3)53 and this is also the size of the support graph. From Proposition 5.1, we conclude g

g

Proposition 5.3 Let DB be a0 parameterized logic program representing a PCFG with N

nonterminals in the form of DBg in Figure 11, and G = G1 ; G2 ; : : : ; GT be the sampled atoms representing sentences of length L. We suppose each table operation in OLDT search is done in O(1) time. Then OLDT search for G and one iteration in learn-gEM are respectively done in O(N 3 L3T ) time.

51.

between(D0,D1,D2)

is called O(N (L 0 d)2 ) times in Td(1) . So it is called

0 O(N (L 0 d)2 ) = O(NL3 )

PL 3 d=0

times in T . 52. Because the trial-ids used in the refutation record which rule is used at what step in the derivation of w1 1 1 1 wL . 53. In DBg , we represent integers by ground terms made out of 0 and s(1) to keep the program short. If we use integers instead of ground terms however, the rst three arguments of q'(1,1,1,1,1,1) are enough to check whether the goal is previously called or not, and this check can be done in O(1) time. (1) 0

0

426

Parameter Learning of Logic Programs for Symbolic-statistical Modeling O (N 3 L 3 T )

is also the time complexity of the Inside-Outside algorithm per iteration (Lari & Young, 1990), hence our algorithm is as ecient as the Inside-Outside algorithm.

5.4 Pseudo PCSGs

PCFGs can be improved by making choices context-sensitive, and one of such attempts is pseudo PCSGs (pseudo probabilistic context sensitive grammars) in which a rule is chosen probabilistically depending on both the nonterminal to be expanded and its parent nonterminal (Charniak & Carroll, 1994). A pseudo PCSG is easily programmed. We add one extra-argument, N, representing the parent node, to the predicate q'(I,D0,D2,C0,C2,L0-L2) in Figure 11 and replace msw(I,C0,[J,K]) with msw([N,I],C0,[J,K]). Since the (leftmost) derivation of a sentence from a pseudo PCSG is still a sequential decision process described by the modi ed program, the graphical EM algorithm applied to the support graphs generated from the modi ed program and observed sentences correctly performs the ML estimation of parameters in the pseudo PCSG. A pseudo PCSG is thought to be a PCFG with rules of the form [n; i] ! [i; j ][i; k] (1 n; i; j; k N ) where n is the parent nonterminal of i, so the arguments in the previous subsection are carried over with minor changes. We therefore have (details omitted)

Proposition 5.4 Let DB be a parameterized logic program for a pseudo PCSG with N

nonterminals as shown above, and G = G1; G2 ; : : : ; GT the observed atoms representing sampled sentences of length L. Suppose each table operation in OLDT search can be done in O(1) time. Then OLDT search for G and each iteration in learn-gEM is completed in O(N 4 L3T ) time.

5.5 Bayesian Networks

A relationship between cause C and its eect E is often probabilistic such as the one between diseases and symptoms, and as such it is mathematically captured as the conditional probability P (E = e j C = c) of eect e given the cause c. What we wish to know however is the inverse, i.e. the probability of a candidate cause c given evidence e, i.e. P (C = c j E = e) which is calculated by Bayes' theorem as P (E = e j C = c)P (C = c)= c0 P (E = e j C = c0)P (C = c0 ). Bayesian networks are a representational/computational framework that ts best this type of probabilistic inference (Pearl, 1988; Castillo et al., 1997). A Bayesian network is a graphical representation of a joint distribution P (X1 = x1 ; : : : ; XN = xN ) of nitely many random variables X1 ; : : : ; XN . The graph is a dag (directed acyclic graph) such as ones in Figure 12, and each node is a random variable.54 In the graph, a conditional probability table (CPT) representing P (Xi = xi j 5i = ui) (1 i N ) is associated with each node Xi where 5i represents Xi's parent nodes and ui their values. When Xi has no parent, i.e. a topmost node in the graph, the table is just a marginal distribution P (Xi = xi). The whole joint distribution is de ned as the product of P

54. We only deal with discrete cases. 427

Sato & Kameya

A

B

A

D

C

B

C

E

D

F

E

( G 1 ) Singly-connected

F

( G 2 ) Multiply-connected

Figure 12: Bayesian networks these conditional distributions: P (X1 = x1 ; : : : ; XN

= xN )55 =

N Y i=1

P (Xi = xi j 5i = ui ):

(21)

Thus the graph G1 in Figure 12 de nes PG (a; b; c; d; e; f ) = PG (a)PG (b)PG (c j a)PG (d j a; b)PG (e j d)PG (f j d) where a, b, c, d, e and f are values of corresponding random variables A, B, C , D, E and F , respectively.56 As mentioned before, one of the basic tasks of Bayesian networks is to compute marginal probabilities. For example, the marginal distribution PG (c; d) is computed either by (22) or (23) below. PG (c; d) = PG (a)PG (b)PG (c j a)PG (d j a; b)PG (e j d)P (f j d) (22) 1

1

1

1

1

1

1

1

X

1

a;b;e;f

1

1

1

1

0

=

X @

a;b

1

10

PG1 (a)PG1 (b)PG1 (c j a)PG1 (d j a; b)A @

1 X

e;f

PG1 (e j d)PG1 (f j d)A (23)

(23) is clearly more ecient than (22). Observe that if the graph were like G2 in Figure 12, there would be no way to factorize computations like (23) but to use (22) requiring exponentially many operations. The problem is that computing marginal probabilities is NP-hard in general, and factorization such as (23) is assured only when the graph is singly connected like G1 , i.e. has no loop when viewed as an undirected graph. In such case, the computation is possible in O(jV j) time where V is the set of vertices in the graph (Pearl, 1988). Otherwise, the graph is called multiply-connected, and might need exponential time to compute marginal probabilities. In the sequel, we show the following. For any discrete Bayesian network G de ning a distribution PG (x1 ; : : : ; xN ), there is a parameterized logic program DBG for a predicate bn(1) such that PDB (bn(x1,: : :,xN )) = PG(x1 ; : : : ; xN ). G

55. Thanks to the acyclicity of the graph, without losing generality, we may assume that if Xi is an ancestor node of Xj , then i < j holds. 56. For notational simplicity, we shall omit random variables when no confusion arises. 428


For arbitrary factorizations and their order to compute a marginal distribution, there

exists a tabled program that accomplishes the same computation in the speci ed way. When the graph is singly connected and evidence e is given, there exists a tabled program DBG such that OLDT time for bn(e) is O(jV j), and hence the time complexity per iteration of the graphical EM algorithm is O(jV j) as well. Let G be a Bayesian network de ning a joint distribution PG(x1 ; : : : ; xN ) and fPG (Xi = xi j 5i = ui ) j 1 i N; xi 2 val(Xi ); ui 2 val(5i )g the conditional probabilities associated with G where val(Xi) is the set of Xi's possible values and val(5i) denotes the set of possible values of the parent nodes 5i as a random vector, respectively. We construct a parameterized logic program that de nes the same distribution PG (x1 ; : : : ; xN ). Our program DBG = FG [ RG is shown in Figure 13.

FG

= f msw(par(i,ui),once,xi) j 1 i N; ui 2 val(5i); xi 2 val(Xi) g

RG

= f bn(X1 ,: : :,XN ):-

5i),once,Xi). g

VN

i=1 msw(par(i,

Figure 13: Bayesian network program DBG FG is comprised of msw atoms of the form msw(par(i,ui ),once,xi ) whose probability is exactly the conditional probability PG (Xi = xi j 5i = ui). When Xi has no parents, ui is the empty list []. RG is a singleton, containing only one clause whose body is a conjunction of msw atoms which corresponds to the product of conditional probabilities. Note that we intentionally identify random variables X1 ; : : : ; XN with logical variables X1 ; : : : ; XN for convenience.

Proposition 5.5 DBG denotes the same distributions as G.

(Proof) Let hx1 ; : : : ; xN i be a realization of the random vector hX1; : : : ; XN i. It holds by construction that PDBG (bn(x1 ,: : :,xN ))

=

N Y h=1 N Y

Pmsw (msw(par(i,ui ),once,xi ))

= PG (Xi = xi j 5i = ui ) h=1 = PG (x1 ; : : : ; xN ): Q:E:D: In the case of G1 in Figure 12, the program becomes57 bn(A,B,C,D,E,F)

:-

msw(par('A',[]),once,A), msw(par('C',[A]),once,C), msw(par('E',[D]),once,E),

57. 0 A0 ; 0 B0 ; : : : are Prolog constants used in place of integers. 429

msw(par('B',[]),once,B), msw(par('D',[A,B]),once,D), msw(par('F',[D]),once,F).

Sato & Kameya

and the left-to-right sampling execution gives a sample realization of the random vector h A; B; C; D; E; F i. A marginal distribution is computed from bn(x1 ,: : :,xN ) by adding a new clause to DBG. For example, to compute PG (c; d), we add bn(C,D):- bn(A,B,C,D,E,F) to DBG (let the result be DB0G ) and then compute PDB0 (bn(c,d)) which is equal to PG (c; d) because PDB0 (bn(c,d)) = PDB (9 a; b; e; f bn(a,b,c,d,e,f )) = PDB (bn(a,b,c,d,e,f )) a;b;e;f = PG (c; d): Regrettably this computation corresponds to (22), not to the factorization (23). Ecient probability computation using factorization is made possible by carrying out summations in a proper order. We next sketch by an example how to carry out speci ed summations in a speci ed order by introducing new clauses. Suppose we have a joint distribution P (x; y; z; w) = 1 (x; y )2 (y; z; w)3(x; z; w) such that 1(x; y ), 2(y; z; w) and 3 (x; z; w) are respectively computed by atoms p1 (X,Y), p2 (Y,Z,W) and p3 (X,Z,W). Suppose also that we hope to compute the sum P (x) = 1(x; y ) 2 (y; z; w )3 (x; z; w) 1

1

1

G1

1

G1

X

G1

G1

1

X

X

y

z;w

!

in which we rst eliminate z; w and then y. Corresponding to each elimination, we introduce two new predicates, q(X,Y) to compute 4 (x; y) = z;w 2 (y; z; w)3 (x; z; w) and p(X) to compute P (x) = y 1 (x; y)4(x; y) as follows. P

P

p(X) q(X,Y)

::-

p1(X,Y), q(X,Y). p2(Y,Z,W), p3 (X,Z,W).

Note that the clause body of q=2 contains Z and W as (existentially quanti ed) local variables and the clause head q(X,Y) contains variables shared with other atoms. In view of the correspondence between and 9, it is easy to con rm that this program realizes the required computation. It is also easy to see by generalizing this example, though we do not prove here, that there exists a parameterized logic program that carries out the given summations in the given order for an arbitrary Bayesian network, in particular we are able to simulate VE (variable elimination, Zhang & Poole, 1996; D'Ambrosio, 1999) in our approach. Ecient computation of marginal distributions is not always possible but there is a well-known class of Bayesian networks, singly connected Bayesian networks, for which there exists an ecient algorithm to compute marginal distributions by message passing (Pearl, 1988; Castillo et al., 1997). We here show that when the graph is singly connected, we can construct an ecient tabled Bayesian network program DBG assigning a table predicate to each node. To avoid complications, we explain the construction procedure informally and concentrate on the case where we have only one interested variable. Let G be a singly P

430


connected graph. First we pick up a node U whose probability PG (u) is what we seek. We construct a tree G with the root node U from G, by letting other nodes dangling from U . Figure 14 shows how G1 is transformed to a tree when we select node B as the root node. B

D

A

E

F

C τ

Transformed graph G1

Figure 14: Transforming G1 to a tree Then we examine each node in G one by one. We add for each node X in the graph a corresponding clause to DBG whose purpose is to visit all nodes connected to X except the one that calls X . Suppose we started from the root node U1 in Figure 15 where evidence u is given, and have generated clause (24). Now we proceed to an inner node X (U1 calls X ). In the original graph G, X has parent nodes fU1 ; U2; U3g and child nodes fV1 ; V2 g. U3 is a topmost node in G.

U1 X U2

V2 U3

V1

Tree G τ

Figure 15: General situation For node X in Figure 15, we add clause (25). When it is called from the parent node U1 with U1 being ground, we rst generate possible values of U2 by calling val U2 (U2), and then call call X U2 (U2 ) to visit all nodes connected to X through U2 . U3 is similary treated. After visiting all nodes in G connecting to X through the parent nodes U2 and U3 (nodes connected to U1 have already been visited), the value of random variable X is determined by sampling the msw atom jointly indexed by 'X' and the values of U1 , U2 and 431

Sato & Kameya U3 . Then we visit X 's children, V1 and V2 . For a topmost node U3 in the original graph, we add clause (26). tbn(U1 ) :- msw(par('U1 ',[]),once,U1 ), call call

U1 X (U1 ).

U1 X (U1 ) :- val U2 (U2), call X U2 (U2 ), val U3 (U3), call X U3 (U3 ), msw(par('X',[U1 ,U2 ,U3 ]),once,X), call X V1 (X), call X V2(X).

(24)

(25)

(26) Let DBG be the nal program containing clauses like (24), (25) and (26). Apparently DBG can be constructed in time linear in the number of nodes in the network. Also note that successive unfolding (Tamaki & Sato, 1984) of atoms of the form call ...(1) in the clause bodies that starts from (24) yields a program DB0G similar to the one in Figure 13 which contains msw atoms but no call ...(1)'s. As DBG and DB0G de ne the same distribution,58 it can be proved from Proposition 5.5 that PG (u) = PDB0 (bn(u)) = PDB (tbn(u)) holds (details omitted). By the way, in Figure 15 we assume the construction starts from the topmost node U1 where the evidence u is given, but this is not necessary. Suppose we change to start from the inner node X. In that case, we replace clause (24) with call X U1(U1 ) :- msw(par('U1',[]),once,U1 ) just like (26). At the same time we replace the head of clause (25) with tbn() and add a goal call X U1 (u) to the body and so on. For the changed program DB00G , it is rather straightforward to prove that PDB00 (tbn()) = PG(u) holds. It is true that the construction of the tabled program DBG shown here is very crude and there is a lot of room for optimization, but it suces to show that a parameterized logic program for a singly connected Bayesian network runs in O(jV j) time where V is the set of nodes. To estimate time complexity of OLDT search w.r.t. DBG , we declare tbn and every predicate of the form call ...(1) as table predicate and verify the ve conditions on the applicability of the graphical EM algorithm (details omitted). We now estimate the time complexity of OLDT search for the goal tbn(u) w.r.t. DBG .59 We notice that calls occur according to the pre-order scan (parents { the node { children) of the tree G , and calls to call Y X (1) occur val(Y ) times. Each call to call Y X (1) invokes calls to the rest of nodes, X 's parents and X 's children in the graph G except the caller node, with dirent set of variable instantiations, but from the second call on, every call only refers to solutions stored in the solution table in O(1) time. Thus, the number of added computation steps in call

X U3(U3 ) :- msw(par('U3 ',[]),once,U3 ).

G

G

G

58. Since distribution semantics is based on the least model semantics, and because unfold/fold transformation (Tamaki & Sato, 1984) preserves the least Herbrand model of the transformed program, unfold/fold transformation applied to parameterized logic programs preserves the denotation of the transformed program. 59. DBG is further transformed for the OLDT interpreter to collect msw atoms like the case of the HMM program. 432


OLDT search by X is bounded from above, by constant O(val(U 1)val(U 2)val(U 3)val(X )) in the case of Figure 15. As a result OLDT time is proportional to the number of nodes in the original graph G (and so is the size of the support graph) provided that the number of edges connecting to a node, and that of values of a random variable are bounded from above. So we have

Proposition 5.6 Let G be a singly connected Bayesian network de ning distribution PG ,

V the set of nodes, and DBG the tabled program derived as above. Suppose the number of edges connecting to a node, and that of values of a random variable are bounded from above by some constant. Also suppose table access can be done in O(1) time. Then, OLDT time for computing PG(u) for an observed value u of a random variable U by means of DBG is O(jV j) and so is time per iteration required by the graphical EM algorithm. If there are T observations, time complexity is O(jV jT ).

O(jV j) is the time complexity required to compute a marignal distribution for a singly connected Bayesian network by a standard algorithm (Pearl, 1988; Castillo et al., 1997), and also is that of the EM algorithm using it. We therefore conclude that the graphical EM algorithm is as ecient as a specialzed EM algorithm for singly connected Bayesian networks.60 We must also quickly add that the graphical EM algorithm is applicable to arbitrary Bayesian networks,61 and what Proposition 5.6 says is that an explosion of the support graph can be avoided by appropriate programming in the case of singly connected Bayesian networks. To summarize, the graphical EM algorithm, a single generic EM algorithm, is proved to have the same time complexity as specialized EM algorithms, i.e. the Baum-Welch algorithm for HMMs, the Inside-Outside algorithm for PCFGs, and the one for singly connected Bayesian networks that have been developed independently in each research eld. Table 1 summarizes the time complexity of EM learning using OLDT search and the graphical EM algorithm in the case of one observation. In the rst column, \sc-BNs" represents singly connected Bayesian networks. The second column shows a program to use. DBh is an HMM proram in Subsection 4.7, DBg0 a PCFG program in Subsection 5.3 and DBG a transformed Bayesian network program in Subsection 5.5, respectively. OLDT time in the third column is time for OLDT search to complete the search of all t-explanations. gEM in the fourth column is time in one iteration taken by the graphical EM algorithm to update parameters. We use N , M , L and V respectively for the number of states in an HMM, the number of nonterminals in a PCFG, the length of an input string and the number of nodes in a Bayesian network. The last column is a standard (specialized) EM algorithm for each model.

60. When a marginal distribution of PG for more than one variable is required, we can construct a similar tabled program that computes marginal probabilities still in O(jV j) time by adding extra-arguments that convey other evidence or by embedding other evidnece in the program. 61. We check the ve conditions with DBG in Figure 13. The uniqueness condition is obvious as sampling always uniquely generates a sampled value for each random variable. The nite support condition is satis ed because there are only a nite number of random variables and their values. The acyclic support condition is immediate because of the acyclicity of Bayesian networks. The t-exclusiveness condition and the independent condition are easy to verify. 433

Sato & Kameya

Model Program HMMs DBh PCFGs DBg0 DBG sc-BNs user model

OLDT time O (N 2L) O (M 3 L 3 ) O(jV j) O(jOLDT treej)

gEM Specialized EM 2 O(N L) Baum-Welch 3 3 O (M L ) Inside-Outside O(jV j) (Castillo et al., 1997) O (jsupport graphj)

Table 1: Time complexity of EM learning by OLDT search and the graphical EM algorithm

5.6 Modeling Language PRISM

We have been developing a symbolic-statistical modeling laguage PRISM since 1995 (URL = http://mi.cs.titech.ac.jp/prism/) as an implementation of distribution semantics (Sato, 1995; Sato & Kameya, 1997; Sato, 1998). The language is intented for modeling complex symbolic-statistical phenomena such as discourse interpretation in natural language processing and gene inheritance interacting with social rules. As a programming language, it looks like an extension of Prolog with new built-in predicates including the msw predicate and other special predicates for manipulating msw atoms and their parameters. A PRISM program is comprised of three parts, one for directives, one for modeling and one for utilities. The directive part contains declarations such as values, telling the system what msw atoms will be used in the execution. The modeling part is a set of non-unit de nite clauses that de ne the distribution (denotation) of the program by using msw atoms. The last part, the utility part, is an arbitary Prolog program which refers to predicates de ned in the modeling part. We can use in the utility part learn built-in predicate to carry out EM learning from observed atoms. PRISM provides three modes of execution. The sampling execution correponds to a random sampling drawn from the distribution de ned by the modeling part. The second one computes the probability of a given atom. The third one returns the support set for a given goal. These execution modes are available through built-in predicates. We must report however that while the implementation of the graphical EM algorithm with a simpi ed OLDT search mechanism has been under way, it is not completed yet. So currently, only Prolog search and learn-naive(DB; G) in Section 4 are available for EM learning though we realized, partially, structrure sharing of explanations in the implemention of learn-naive(DB; G ). Putting computational eciecy aside however, there is no problem in expressing and learning HMMs, PCFGs, pseudo PCSGs, Bayesian networks and other probailistic models by the current version. The learning experiments in the next section used a parser as a substitute for the OLDT interpreter, and the independently implemented graphical EM algorithm. 6. Learning Experiments

After complexity analysis of the graphical EM algorithm for popular symbolic-probabilistic models in the previous section, we look at an actual behavior of the graphical EM algorithm with real data in this section. We conducted learning experiments with PCFGs using two 434


corpora which have contrasting characters, and compared the performance of the graphical EM algorithm against that of the Inside-Outside algorithm in terms of time per iteration (= time for updating parameters). The results indicate that the graphical EM algorithm can outperform the Inside-Outside algorithm by orders of magnigude. Detalis are reported by Sato, Kameya, Abe, and Shirai (2001). Before proceeding, we review the Inside-Outside algorithm for completeness.

6.1 The Inside-Outside Algorithm

The Inside-Outside algorithm was proposed by Baker (1979) as a generalization of the Baum-Welch algorithm to PCFGs. The algorithm is designed to estimate parameters for a CFG grammar in Chomsky normal form containing rules expressed by numbers like i ! j; k (1 i; j; k N for N nonterminals, where 1 is a starting symbol). Suppose an input sentence w1 ; : : : ; wL is given. In each iteration, it rst computes in a bottom up manner 3 inside probabilities3 e(s; t; i) = P (i ) ws; : : : ; wt) and then computes outside probabilities f (s; t; i) = P (S ) w1 ; : : : ; ws01 i wt+1 ; : : : ; wL ) in a top-down manner for every s, t and i (1 s t L; 1 i N ). After computing both probabilities, parameters are updated by using them, and this process iterates until some predetermined criterion such as a convergence of the likelihood of the input sentence is achieved. Although Baker did not give any analysis of the Inside-Outside algorithm, Lari and Young (1990) showed that it takes O(N 3 L3 ) time in one iteration and Laerty (1993) proved that it is the EM algorithm. While it is true that the Inside-Outside algorithm has been recognized as a standard EM algortihm for training PCFGs, it is notoriously slow. Although there is not much literature explicitly stating time required by the Inside-Outside algorithm (Carroll & Rooth, 1998; Beil, Carroll, Prescher, Riezler, & Rooth, 1999), Beil et al. (1999) reported for example that when they trained a PCFG with 5,508 rules for a corpus of 450,526 German subordinate clauses whose average ambiguity is 9,202 trees/clause using four machines (167MHz Sun UltraSPARC22 and 296MHz Sun UltraSPARC-II22), it took 2.5 hours to complete one iteration. We discuss later why the Inside-Outside algorithm is slow.

6.2 Learning Experiments Using Two Corpora

We report here parameter learning of existing PCFGs using two corpora of moderate size and compare the graphical EM algorithm against the Inside-Outside algorithm in terms of time per iteration. As mentioned before, support graphs, input to the garphical EM algorithm, were generated by a parser, i.e. MSLR parser.62 All measurements were made on a 296MHz Sun UltraSPARC-II with 2GB memory under Solaris 2.6 and the threshold for an increase of the log likelihood of input sentences was set to 1006 as a stopping criterion for the EM algorithms. In the experiments, we used ATR corpus and EDR corpus (each converted to a POS (part of speech)-tagged corpus). They are similar in size (about 10,000) but contrasting in their characters, sentence length and ambiguity of their grammars. The rst experiment employed ATR corpus which is a Japanese-English corpus (we used only the Japanese part) developed by ATR (Uratani, Takezawa, Matsuo, & Morita, 1994). It contains 10,995 short 62. MSLR parser is a Tomita (Generalized LR) parser developed by Tanaka-Tokunaga Laboratory in Tokyo Institute of Technology (Tanaka, Takezawa, & Etoh, 1997). 435

Sato & Kameya

conversational sentences, whose minimum length, average length and maximum length are respectively 2, 9.97 and 49. As a skeleton of PCFG, we employed a context free grammar Gatr comprising 860 rules (172 nonterminals and 441 terminals) manually developed for ATR corpus (Tanaka et al., 1997) which yields 958 parses/sentence. Because the Inside-Outside algorithm only accepts a CFG in Chomsky normal form, we converted Gatr into Chomsky normal form G3atr . G3atr contains 2,105 rules (196 nonterminals and 441 terminals). We then divided the corpus into subgroups of similar length like (L = 1; 2); (L = 3; 4); : : : ; (L = 25; 26), each containing randomly chosen 100 sentences. After these preparations, we compare at each length the graphical EM algorithm applied to Gatr and G3atr against the Inside-Outside algorithm applied to G3atr in terms of time per iteration by running them until convergence. (sec)

(sec)

(sec)

60 I-O 50

0.7

0.04

0.6

0.035

0.5

40

0.4

0.03

I-O gEM (original) gEM (Chomsky NF)

0.025 0.02

30 0.3 20

0.015

0.2

10

0.01

0.1

0.005

5

10

15

20

25

L

L

L 0

gEM (original) gEM (Chomsky NF)

0

5

10

15

20

25

0

5

10 15 20 25

Figure 16: Time per iteration : I-O vs. gEM (ATR) Curves in Figure 16 show the learning results where an x-axis is the length L of an input sentence and a y-axis is average time taken by the EM algorithm in one iteration to update all parameters contained in the support graphs generated from the chosen 100 sentences (other parameters in the grammar do not change). In the left graph, the Inside-Outside algorithm plots a cubic curve labeled \I-O". We omitted a curve drawn by the graphical EM algorithm as it drew the x-axis. The middle graph magni es the left graph. The curve labeled \gEM (original)" is plotted by the graphical EM algorithm applied to the original grammar Gatr whereas the one labeled \gEM (Chomsky NF)" used G3atr. At length 10, the average sentence length, it is measured that whichever grammar is employed, the graphical EM algorithm runs several hundreds times faster (845 times faster in the case of Gatr and 720 times faster in the case of G3atr) than the Inside-Outside algorithm per iteration. The right graph shows (almost) linear dependency of updating time by the graphical EM algorithm within the measuared sentence length. Although some dierence is anticipated in their learning speed, the speed gap between the Inside-Outside algorithm and the graphical EM algorithm is unexpectedly large. The most conceivable reason is that ATR corpus only contains short sentences and Gatr is not 436


much ambiguous so that parse trees are sparse and generated support graphs are small, which aects favorably the perforamnce of the graphical EM algorithm. We therefore conducted the same experiment with another corpus which contains much longer sentences using a more ambiguous grammar that generates dense parse trees. We used EDR Japanese corpus (Japan EDR, 1995) containing 220,000 Japanese news article sentences. It is however under the process of re-annotation, and only part of it (randomly sampled 9,900 sentences) has recently been made available as a labeled corpus. Compared with ATR corpus, sentences are much longer (the average length of 9,900 sentences is 20, the minimum length 5, the maximum length 63) and a CFG grammar Gedr (2,687 rules, converted to Chomsky normal form grammar G3edr containing 12,798 rules) developed for it is very ambiguous (to keep a coverage rate), having 3:0 2 108 parses/sentence at length 20 and 6:7 2 1019 parses/sentence at length 38. (sec)

5000

(sec)

(sec) 3

10

6000 I-O

8

I-O gEM (original)

2.5

gEM (original)

2

4000

6 1.5

3000

4 1

2000

2

1000 0

L 5 10 15 20 25 30 35 40

0

0.5

L 5 10 15 20 25 30 35 40

0

L 5 10 15 20 25 30 35 40

Figure 17: Time per iteration : I-O vs. gEM (EDR) Figure 17 shows the obtained curves from the experiments with EDR corpus (the graphical EM algorithm applied to Gedr vs. the Inside-Outside algorithm applied to G3edr) under the same condition as ATR corpus, i.e. plotting average time per iteration to process 100 sentences of the designated length, except that the plotted time for the Inside-Outside algorithm is the average of 20 iterations whereas that for the graphical EM algorithm is the average of 100 iterations. As is clear from the middle graph, this time again, the graphical EM algorithm runs orders of magnitude faster than the Inside-Outside algorithm. At average sentence length 20, the former takes 0.255 second whereas the latter takes 339 seconds, giving a speed ratio of 1,300 to 1. At sentence length 38, the former takes 2.541 seconds but the latter takes 4,774 seconds, giving a speed ratio of 1,878 to 1. Thus the speed ratio even widens compared to ATR corpus. This can be explained by the mixed eects of O(L3 ), time complexity of the Inside-Outside algorithm, and a moderate increase in the total size of support graphs w.r.t. L. Notice that the right graph shows how the total size of support graphs grows with sentence length L as time per iteration by the graphical EM algorithm is linear in the total size of support graphs.

437

Sato & Kameya

Since we implemented the Inside-Outside algorithm faithfully to Baker (1979), Lari and Young (1990), there is much room for improvement. Actually Kita gave a re ned InsideOutside algorithm (Kita, 1999). There is also an implementation by Mark Johnson of the Inside-Outside algorithm down-loadable from http://www.cog.brown.edu/%7Emj/. The use of such implementations may lead to dierent conclusions. We therefore conducted learning experiments with the entire ATR corpus using these two implementations and measured updating time per iteration (Sato et al., 2001). It turned out that both implementations run twice as fast as our naive implementation and take about 630 seconds per iteration while the graphical EM algorithm takes 0.661 second per iteration, which is still orders of magnitude faster than the former two. Regrettably a similar comparison using the entire EDR corpus available at the moment was abandoned due to memory over ow during parsing for the construction of support graphs. Learning experiments so far only compared time per iteration which ignore extra time for search (parsing) required by the graphical EM algorithm. So a question naturally arises w.r.t. comparison in terms of total learning time. Assuming 100 iterations for learning ATR corpus however, it is estimated that even considering parsing time, the graphical EM algorithm combined with MSLR parser runs orders of magnitude faster than the three implementations (ours, Kita's and Johnson's) of the Inside-Outside algorithm (Sato et al., 2001). Of course this estimation does not directly apply to the graphical EM algorithm combined with OLDT search, as the OLDT interpreter will take more time than a parser and how much more time is needed depends on the implementaiton of OLDT search.63 Conversely, however, we may be able to take it as a rough indication of how far our approach, the graphical EM algorithm combined with OLDT search via support graphs, can go in the domain of EM learning of PCFGs.

6.3 Examing the Performance Gap

In the previous subsection, we compared the performance of the graphical EM algorithm against the Inside-Outside algorithm when PCFGs are given, using two corpora and three implementations of the Inside-Outside algorithm. In all experiments, the graphical EM algorithm considerably outperformed the Inside-Outside algorithm despite the fact that both have the same time complexity. Now we look into what causes such a performance gap. Simply put, the Inside-Outside algorithm is slow (primarily) because it lacks parsing. Even when a backbone CFG grammar is explicitly given, it does not take any advantage of the constraints imposed by the grammar. To see it, it might help to review how the inside probability e(s; t; A), i.e. P(nonterminal A spans from s-th word to t-th word) (s t), is calculated by the Inside-Outside algorithm for the given grammar. e(s; t; A) =

r= t01 X

P(A ! BC )e(s; r; B)e(r + 1; t; C ) s.t. A!BC in the grammar r=s Here P(A ! BC ) is a probability associated with a production rule A ! BC . Note that for a xed triplet (s; t; A), it is usual that the term P(A ! BC )e(s; r; B )e(r +1; t; C ) is non-zero X

B;C

63. We cannnot answer this question right now as the implementation of OLDT search is not completed. 438


only for a relatively small number of (B; C; r)'s determined from successful parses and the rest of combinations always give 0 to the term. Nonetheless the Inside-Outside algorithm attempts to compute the term in every iteration for all possible combinations of B, C and r and this is repeated for every possible (s; t; A), resulting in a lot of redundancy. The same kind of redundancy occurs in the computation of outside probability by the Inside-Outside algorithm. The graphical EM algorithm is free of such redundancy because it runs on parse trees (a parse forest) represented by the support graph.64 It must be added, on the other hand, that superiority in learning speed of the graphical EM algorithm is realized at the cost of space complexity because while the Inside-Outside algorithm merely requires O(NL2 ) space for its array to store probabilities, the graphical EM algorithm needs O(N 3 L3 ) space to store the support graph where N is the number of nonterminals and L is the sentence length. This trade-o is understandable if one notices that the graphical EM algorithm applied to a PCFG can be considered as partial evaluation of the Inside-Outside algorithm by the grammar (and the introduction of appropriate data structure for the output). Finally we remark that the use of parsing as a preprocess for EM learning of PCFGs is not unique to the graphical EM algorithm (Fujisaki, Jelinek, Cocke, Black, & Nishino, 1989; Stolcke, 1995). These approaches however still seem to contain redundancies compared with the graphical EM algorithm. For instance Stolcke (1995) uses an Earley chart to compute inside and outside probability, but parses are implicitly reconstructed in each iteration dynamically by combining completed items. 7. Related Work and Discussion

7.1 Related Work

The work presented in this paper is at the crossroads of logic programming and probability theory, and considering an enormous body of work done in these elds, incompleteness is unavoidable when reviewing related work. Having said that, we look at various attempts made to integrate probability with computational logic or logic programming.65 In reviewing, one can immediately notice there are two types of usage of probability. One type, constraint approach, emphasizes the role of probability as constraints and does not necessarily seek for a unique probability distribution over logical formulas. The other type, distribution approach, explicitly de nes a unique distribution by model theoretical means or proof theoretical means, to compute various probabilities of propositions. A typical constraint approach is seen in the early work of probabilistic logic by Nilsson (1986). His central problem, \probabilistic entailment problem", is to compute the upper and lower bound of probability P() of a target sentence in such a way that the bounds are compatible with a given knowledge base containing logical sentences (not necessarily logic programs) annotated with a probability. These probabilities work as constraints on 64. We emphasize that the dierence between the Inside-Outside algorithm and the graphical EM algorithm is solely computational eciency, and they converge to the same parameter values when starting from the same initial values. Linguistic evaluations of the estimated parameters by the graphical EM algorithm are also reported by Sato et al. (2001). 65. We omit literature leaning strongly toward logic. For logic(s) concerning uncertainty, see an overview by Kyburg (1994). 439

Sato & Kameya

the possible range of P(). He used the linear programming technique to solve this problem that inevitably delimits the applicability of his approach to nite domains. Later Lukasiewicz (1999) investigated the computational complexity of the probabilistic entailment problem in a slightly dierent setting. His knowledge base comprises statements of the form (H j G)[u1 ; u2 ] representing u1 P(H j G) u2 . He showed that inferring \tight" u1 ; u2 is NP-hard in general, and proposed a tractable class of knowledge base called conditional constraint trees. After the in uential work of Nilsson, Frish and Haddawy (1994) introduced a deductive system for probabilistic logic that remedies \drawbacks" of Nilsson's approach, that of computational intractability and the lack of a proof system. Their system deduces a probability range of a proposition by rules of probabilistic inferences about unconditional and conditional probabilities. For instance, one of the rules infers P ( j ) 2 [0 y] from P ( _ j ) 2 [x y ] where , and are propositional variables and [x y] (x y ) designates a probability range. Turning to logic programming, probabilistic logic programming formalized by Ng and Subrahmanian (1992) and Dekhtyar and Subrahmanian (1997) was also a constraint approach. Their program is a set of annotated clauses of the form A : F1 : 1; : : : ; Fn : n where A is an atom, Fi (1 i n) a basic formula, i.e. a conjunction or a disjunction of atoms, and j (0 j n) a sub-interval of [0; 1] indicating a probability range. A query 9 (F1 : 1 ; : : : ; Fn : n) is answered by an extension of SLD refutation. On formalization, it is assumed that their language contains only a nite number of constant and predicate symbols, and no function symbol is allowed. A similar framework was proposed by Lakshmanan and Sadri (1994) under the same syntactic restrictions ( nitely many constant and predicate symbols but no function symbols) in a dierent uncertainty setting. They used annotated clauses of the form A c B1 ; : : : ; Bn where A and Bi (1 i n) are atoms and c = h[; ]; [ ; ]i, a con dence level, represents a belief interval [; ] (0 1) and a doubt interval [ ; ] (0 1), which an expert has in the clause. As seen above, de ning a unique probability distribution is of secondary or no concern to the constraint approach. This is in sharp contrast with Bayesian networks as the whole discipline rests on the ability of the networks to de ne a unique probability distribution (Pearl, 1988; Castillo et al., 1997). Researchers in Bayesian networks have been seeking for a way of mixing Bayesian networks with a logical representation to increase their inherently propositional expressive power. Breese (1992) used logic programs to automatically build a Bayesian network from a query. In Breese's approach, a program is the union of a de nite clause program and a set of conditional dependencies of the form P(P j Q1 ^ 1 11 ^ Qn ) where P and Qi s are atoms. Given a query, a Bayesian network is constructed dynamically that connects the query and relevant atoms in the program, which in turn de nes a local distribution for the connected atoms. Logical variables can appear in atoms but no function symbol is allowed. Ngo and Haddawy (1997) extended Breese's approach by incorporating a mechanism re ecting context. They used a clause of the form P(A0 j A1 ; : : : ; An ) = L1 ; : : : ; Lk , where Ai's are called p-atoms (probabilistic atoms) whereas Lj 's are context atoms disjoint from p-atoms, and computed by another general logic program (satisfying certain restric440


tions). Given a query, a set of evidence and context atoms, relevant ground p-atoms are identi ed by resolving context atoms away by SLDNF resolution, and a local Bayesian network is built to calculate the probability of the query. They proved the soundness and completeness of their query evaluation procedure under the condition that programs are acyclic66 and domains are nite. Instead of de ning a local distribution for each query, Poole (1993) de ned a global distribution in his \probabilistic Horn abduction". His program consists of de nite clauses and disjoint declarations of the form disjoint([h1 :p1,...,hn:pn]) which speci es a probability distribution over the hypotheses (abducibles) fh1; : : : ; hn g. He assigned probabilities to all ground atoms with the help of the theory of logic programming, and furthermore proved that Bayesian networks are representable in his framework. Unlike previous approaches, his language contains function symbols, but the acyclicity condition imposed on the programs for his semantics to be de nable seems to be a severe restriction. Also, probabilities are not de ned for quanti ed formulas. Bacchus et al. (1996) used a much more powerful rst-order probabilistic language than clauses annotated with probabilities. Their language allows a statistically quanti ed term such as k (x)j(x) kx to denote the ratio of individuals in a nite domain satisfying (x) ^ (x) to those satisfying (x). Assuming that every world (interpretation for their language) is equally likely, they de ne the probability of a sentence ' under the given knowledge ('^KB) base KB as the limit limN !1 ##worlds worlds (KB) where #worldsN () is the number of possible worlds containing N individuals satisfying , and parameters used in judging approximations. Although the limit does not necessarily exist and the domain must be nite, they showed that their method can cope with diculties arising from \direct inference" and default reasoning. In a more linguistic vein, Muggleton (1996, and others) formulated SLPs (stochastic logic programs) procedurally, as an extension of PCFGs to probabilistic logic programs. So, a clause C , which must be range-restricted,67 is annotated with a probability p like p : C . The probability of a goal G is the product of such ps appearing in its refutation but with a modi cation such that if a subgoal g can invoke n clauses, pi : Ci (1 i n) at some refutation step, the probability of choosing k-th clause is normalized to pk = ni=1 pi. More recently, Cussens (1999, 2001) enriched SLPs by introducing a special class of log-linear models for SLD refutations w.r.t. a given goal. He for example considers all possible SLD refutations for the most general goal s(X ) and de nes probability P(R) of a refutation R as P(R) = Z 01 exp( i i (R; i)). Here i is a number associated with a clause Ci and (R; i) is a feature, i.e. the number of occurrences of Ci in R. Z is the normalizing constant. Then, the probability assigned to s(a) is the sum of probabilities of refutation for s(a).

N N

P

P

66. The condition says that every ground atom A must be assigned a unique integer n(A) such that n(A) > n(B1 ); : : : ; n(Bn ) holds for any ground instance of a clause of the form A B1 ; : : : ; Bn . Under this condition, when a program includes p(X ) q(X; Y ), we cannot write recursive clauses about q such as q (X; [H jY ]) q(X; Y ). 67. A syntactic property that variables appearing in the head also appear in the body of a clause. A unit clause must be ground. 441

Sato & Kameya

7.2 Limitations and Potential Problems

Approaches described so far have more or less similar limitations and potential problems. Descriptive power con ned to nite domains is the most common limitation, which is due to the use of the linear programming technique (Nilsson, 1986), or due to the syntactic restrictions not allowing for in nitely many constant, function or predicate symbols (Ng & Subrahmanian, 1992; Lakshmanan & Sadri, 1994). Bayesian networks have the same limitation as well (only a nite number of random variables are representable).68 Also there are various semantic/syntactic restrictions on logic programs. For instance the acyclicity condition imposed by Poole (1993) and Ngo and Haddawy (1997) prevents the unconditional use of clauses with local variables, and the range-restrictedness imposed by Muggleton (1996) and Cussens (1999) excludes programs such as the usual membership Prolog program. There is another type of problem, the possibility of assigning con icting probabilities to logically equivalent formulas. In SLPs, P(A) and P(A ^ A) do not necessarily coincide because A and A ^ A may have dierent refutations (Muggleton, 1996; Cussens, 1999, 2001). Consequently in SLPs, we would be in trouble if we naively interpret P(A) as the probability of A's being true. Also assigning probabilities to arbitrary quanti ed formulas seems out of scope of both approaches to SLPs. Last but not least, there is a big problem common to any approach using probabilities: where do the numbers come from? Generally speaking, if we use n binary random variables in a model, we have to determine 2n probabilities to completely specify their joint distribution, and ful lling this requirement with reliable numbers quickly becomes impossible as n grows. The situation is even worse when there are unobservable variables in the model such as possible causes of a disease. Apparently parameter learning from observed data is a natural solution to this problem, but parameter learning of logic programs has not been well studied. Distribution semantics proposed by Sato (1995) was an attempt to solve these problems along the line of the global distribution approach. It de nes a distribution (probability measure) over the possible interpretations of ground atoms for an arbitrary logic program in any rst order language and assigns consistent probabilities to all closed formulas. Also distribution semantics enabled us to derive an EM algorithm for the parameter learning of logic programs for the rst time. As it was a naive algorithm however, dealing with large problems was dicult when there are exponentially many explanations for an observation like HMMs. We believe that the eciency problem is solved to a large extent by the graphical EM algorithm presented in this paper.

7.3 EM Learning

Since EM learning is one of the central issues in this paper, we separately mention work related to EM learning for symbolic frameworks. Koller and Pfeer (1997) used in their approach to KBMC (knowledge-based model construction) EM learning to estimate parameters labeling clauses. They express probabilistic dependencies among events by de nite clauses annotated with probabilities, similarly to Ngo and Haddawy's (1997) approach, and locally build a Bayesian network relevant to the context and evidence as well as the 68. However, RPMs (recursive probability models) proposed by Pfeer and Koller (2000) as an extension of Bayesian networks allow for in nitely many random variables. They are organized as attributes of classes and a probability measure over attribute values is introduced. 442


query. Parameters are learned by applying to the constructed network the specialized EM algorithm for Bayesian networks (Castillo et al., 1997). Dealing with a PCFG by a statically constructed Bayesian network was proposed Pynadath and Wellman (1998), and it is possible to combine the EM algorithm with their method to estimate parameters in the PCFG. Unfortunately, the constructed network is not singly connected, and time complexity of probability computation is potentially exponential in the length of an input sentence. Closely related to our EM learning is parameter learning of log-linear models. Riezler (1998) proposed the IM algorithm in his approach to probabilistic constraint programming. The IM algorithm is a general parameter estimation algorithm from incomplete data for log-linear models whose probability function P(x) takes the form P(x) = Z 01 exp( ni=1 i i (x)) p0 (x) where (1 ; : : : ; n ) are parameters to be estimated, i (x) the i-th feature of an observed object x and Z the normalizing constant. Since a feature can be any function of x, the log-linear model is highly exible and includes our distribution Pmsw as a special case of Z = 1. There is a price to pay however; the computational cost of Z . It requires a summation over exponentially many terms. To avoid the cost of exact computation, approximate computation by a Monte Carlo method is possible. Whichever one may choose however, learning time increases compared to the EM algorithm for Z = 1. The FAM (failure-adjusted maximization) algorithm proposed by Cussens (2001) is an EM algorithm applicable to pure normalized SLPs that may fail. It deals with a special class of log-linear models but is more ecient than the IM algorithm. Because the statistical framework of the FAM is rather dierent from distribution semantics, comparison with the graphical EM algorithm seems dicult. Being slightly tangential to EM learning, Koller et al. (1997) developed a functional modeling language de ning a probability distribution over symbolic structures in which they showed \cashing" of computed results leads to ecient probability computation of singly connected Bayesian networks and PCFGs. Their cashing corresponds to the computation of inside probability in the Inside-Outside algorithm and the computation of outside probability is untouched. P

7.4 Future Directions

Parameterized logic programs are expected to be a useful modeling tool for complex symbolicstatistical phenomena. We have tried various types of modeling, besides stochastic grammars and Bayesian networks, such as the modeling of gene inheritance in the Kariera tribe (White, 1963) where the rules of bi-lateral cross-cousin marriage for four clans interact with the rules of genetic inheritance (Sato, 1998). The model was quite interdisciplinary, but the exibility of combining msw atoms by means of de nite clauses greatly facilitated the modeling process. Although satisfying the ve conditions in Section 4 the uniqueness condition (roughly, one cause yields one eect) the nite support condition (there are a nite number of explanations for one observation) the acyclic support condition (explanations must not be cyclic) 443

Sato & Kameya

the t-exclusiveness condition (explanations must be mutually exclusive) the independence condition (events in an explanation must be independent)

for the applicability of the graphical EM algorithm seems daunting, our modeling experiences so far tell us that the modeling principle in Section 4 eectively guides us to successful modeling. In return, we can obtain a declarative model described compactly by a high level language whose parameters are eciently learnable by the graphical EM algorithm as shown in the preceding section. One of the future directions is however to relax some of the applicability conditions, especially the uniqueness condition that prohibits a generative model from failure or from generating multiple observable events. Although we pointed out in Section 4.4 that the MAR condition in Appendix B adapted to our semantics can replace the uniqueness condition and validates the use of the graphical EM algorithm even when a complete data does not uniquely determine the observed data just like the case of \partially bracketed corpora" (Pereira & Schabes, 1992), we feel the need to do more research on this topic. Also investigating the role of the acyclicity condition seems theoretically interesting as the acyclicity is often related to the learning of logic programs (Arimura, 1997; Reddy & Tadepalli, 1998). In this paper we only scratched the surface of individual research elds such as HMMs, PCFGs and Bayesian networks. Therefore, there remains much to be done about clarifying how experiences in each research eld are re ected in the framework of parameterized logic programs. For example, we need to clarify the relationship between symbolic approaches to Bayesian networks such as SPI (Li, Z. & D'Ambrosio, B., 1994) and our approach. Also it is unclear how a compiled approach using the junction tree algorithm for Bayesian networks can be incorporated into our approach. Aside from exact methods, approximate methods of probability computation specialized for parameterized logic programs must also be developed. There is also a direction of improving learning ability by introducing priors instead of ML estimation to cope with data sparseness. The introduction of basic distributions that make probabilistic switches correlated seems worth trying in the near future. It is also important to take advantage of the logical nature of our approach to handle uncertainty. For example, it is already shown by Sato (2001) that we can learn parameters from negative examples such as \the grass is not wet" but the treatment of negative examples in parameterized logic programs is still in its infancy. Concerning developing complex statistical models based on the \programs as distributions" scheme, stochastic natural language processing which exploits semantic information seems promising. For instance, uni cation-based grammars such as HPSGs (Abney, 1997) may be a good target beyond PCFGs because they use feature structures logically describable, and the ambiguity of feature values seems to be expressible by a probability distribution. Also building a mathematical basis for logic programs with continuous random variables is a challenging research topic.

444

Parameter Learning of Logic Programs for Symbolic-statistical Modeling 8. Conclusion

We have proposed a logical/mathematical framework for statistical parameter learning of parameterized logic programs, i.e. de nite clause programs containing probabilistic facts with a parameterized probability distribution. It extends the traditional least Herbrand model semantics in logic programming to distribution semantics , possible world semantics with a probability distribution over possible worlds (Herbrand interpretations) which is unconditionally applicable to arbitrary logic programs including ones for HMMs, PCFGs and Bayesian networks. We also have presented a new EM algorithm, the graphical EM algorithm in Section 4, which learns statistical parameters from observations for a class of parameterized logic programs representing a sequential decision process in which each decision is exclusive and independent. It works on support graph s, a new data structure specifying a logical relationship between an observed goal and its explanations, and estimates parameters by computing inside and outside probability generalized for logic programs. The complexity analysis in Section 5 showed that when OLDT search, a complete tabled refutation method for logic programs, is employed for the support graph construction and table access is done in O(1) time, the graphical EM algorithm, despite its generality, has the same time complexity as existing EM algorithms, i.e. the Baum-Welch algorithm for HMMs, the Inside-Outside algorithm for PCFGs and the one for singly connected Bayesian networks that have been developed independently in each research eld. In addition, for pseudo probabilistic context sensitive grammars with N nonterminals, we showed that the graphical EM algorithm runs in time O(N 4 L3) for a sentence of length L. To compare actual performance of the graphical EM algorithm against the InsideOutside algorithm, we conducted learning experiments with PCFGs in Section 6 using two real corpora with contrasting characters. One is ATR corpus containing short sentences for which the grammar is not much ambiguous (958 parses/sentence), and the other is EDR corpus containing long sentences for which the grammar is rather ambiguous (3:0 2 108 at average sentence length 20). In both cases, the graphical EM algorithm outperformed the Inside-Outside algorithm by orders of magnitude in terms of time per iteration, which suggests the eectiveness of our approach to EM learning by the graphical EM algorithm. Since our semantics is not limited to nite domains or nitely many random variables but applicable to any logic programs of arbitrary complexity, the graphical EM algorithm is expected to give a general yet ecient method of parameter learning for models of complex symbolic-statistical phenomena governed by rules and probabilities. Acknowledgments

The authors wish to thank three anonymous referees for their comments and suggestions. Special thanks go to Takashi Mori and Shigeru Abe for stimulating discussions and learning experiments, and also to Tanaka-Tokunaga Laboratory for kindly allowing them to use MSLR parser and the linguistic data.

445

Sato & Kameya Appendix A. Properties of

PDB

In this appendix, we list some properties of PDB de ned by a parameterized logic program DB = F [ R in a countable rst-order language L.69 First of all, PDB assigns consistent probabilities70 to every closed formula in L by PDB () def = PDB (f! 2 DB j ! j= g) while guaranteeing continuity in the sense that limn!1 PDB ((t1 ) ^ 1 11 ^ (tn)) = PDB (8x(x)) limn!1 PDB ((t1 ) _ 1 11 _ (tn)) = PDB (9x(x)) where t1 ; t2 ; : : : is an enumeration of ground terms in L. The next proposition, Proposition A.1, relates PDB to the Herbrand model. To prove it, we need some terminology. A factor is a closed formula in prenex disjunctive normal form Q1 1 1 1 QnM where Qi (1 i n) is either an existential quanti cation or a universal quanti cation and M a matrix. The length of quanti cations n is called the rank of the factor. De ne 8 as a set of formulas made out of factors, conjunctions and disjunctions. Associate with each formula in 8 a multi-set r() of ranks by ; if is a factor with no quanti cation r () = fng if is a factor with rank n r(1) ] r (2 ) if = 1 _ 2 or = 1 ^ 2: Here ] stands for the union of two multi-sets. For instance f1; 2; 3g]f2; 3; 4g = f1; 2; 2; 3; 3; 4g. We use the multi-set ordering in the proof of Proposition A.1 because the usual induction on the complexity of formulas does not work. Lemma A.1 Let be a boolean formula made out of ground atoms in L. PDB() = PF (f 2 F j MDB ( ) j= g). (Proof) We have only to prove the lemma about a conjunction of atoms of the form D1x ^ 1 11 ^ Dnx (xi 2 f0; 1g; 1 i n). PDB (D1x ^ 1 11 ^ Dnx ) = PDB (f! 2 DB j ! j= D1x ^ 11 1 ^ Dnx g) = PDB (D1 = x1 ; : : : ; Dn = xn) = PF (f 2 F j MDB ( ) j= D1x ^ 1 11 ^ Dnx g) Q.E.D. 8 > < > :

1

n

1

n

n

1

1

n

Proposition A.1 Let be a closed formula in L. PDB() = PF (f 2 F j MDB( ) j= g). 69. For de nitions of F , PF , MDB ( ), DB , PDB and others used below, see Section 3. 70. By consistent, we mean probabilities assigned to logical formulas respect the laws of probability such as 0 P (A) 1, P (:A) = 1 0 P (A) and P (A _ B ) = P (A) + P (B ) 0 P (A ^ B ). 446


(Proof) Recall that a closed formula has an equivalent prenex disjunctive normal form that belongs to 8. We prove the proposition for formulas in 8 by using induction on the multi-set ordering over fr() j 2 8g. If r() = ;, has no quanti cation. So the proposition is correct by Lemma A.1. Suppose otherwise. Write = G[Q1 Q2 11 1 Qn F ] where Q1 Q2 1 11 QnF indicates a single occurrence of a factor in G.71 We assume Q1 = 9x (Q1 = 8x is similarly treated). We also assume that bound variables are renamed to avoid name clash. Then G[9xQ2 11 1 Qn F ] is equivalent to 9xG[Q2 1 11 QnF ] in light of the validity of (9xA) ^ B = 9x(A ^ B) and (9xA) _ B = 9x(A _ B) when B contains no free x. PDB () = PDB (G[Q1 Q2 1 11 QnF ]) = PDB (9xG[ Q2 11 1 Qn F [x]]) = klim P (G[ Q2 11 1 Qn F [t1 ]] _ 1 11 _ G[Q2 11 1 Qn F [tk ]]) !1 DB = klim P (G[ Q2 11 1 Qn F [t1 ] _ 1 11 _ Q2 11 1 Qn F [tk ]]) !1 DB = klim P (f 2 F j MDB ( ) j= G[ Q2 1 11 QnF [t1] _ 11 1 _ Q2 1 11 QnF [tk ] ]g) !1 F (by induction hypothesis) = PF (f 2 F j MDB ( ) j= 9xG[Q2 1 11 QnF [x]]g) = PF (f 2 F j MDB ( ) j= g) Q.E.D. We next prove a theorem on the i de nition introduced in Section 4. Distribution semantics considers the program DB = F [ R as a set of in nitely many ground de nite clauses such that F is a set of facts (with a probability measure PF ) and R a set of rules, and no clause head in R appears in F . Put head(R) def = fB j B appears in R as a clause headg: For B 2 head(R), let B Wi (i = 1; 2; : : :) be an enumeration of clauses about B in R. De ne i (B), the i (if-and-only-if) form of rules about B in DB72 by i (B) def = B $ W1 _ W2 _ 1 1 1 Since MDB ( ) is a least Herbrand model, the following is obvious. Lemma A.2 For B in head(R) and 2 F , MDB( ) j= i (B). Theorem A.1 below is about i (B). It states that at general level, both sides of the i de nition p(x) $ 9y1 (x = t1 ^ W1 ) _ 1 11 _ 9yn (x = tn ^ Wn) of p(1) coincide as random variables whenever x is instantiated to a ground term. Theorem A.1 Let i (B ) = B $ W1 _ W2 _11 1 be the i form of rules about B 2 head(R). PDB (i (B )) = 1 and PDB (B ) = PDB (W1 _ W2 _ 1 11). 71. For an expression E , E [ ] means that may occur in the speci ed positions of E . If 1 _ 2 in E [ 1 _ 2 ] indicates a single occurrence of 1 _ 2 in a positive boolean formula E , E [ 1 _ 2 ] = E [ 1 ] _ E [ 2 ] holds. 72. This de nition is dierent from the usual one (Lloyd, 1984; Doets, 1994) as we are here talking at ground level. W1 _ W2 _ 1 1 1 is true if and only if one of the disjuncts is true. 447

Sato & Kameya

(Proof)

PDB (i (B ))

=

PDB (f! 2 DB j ! j= B ^ (W1 _ W2 _ 11 1)g) +PDB (f! 2 DB j ! j= :B ^ :(W1 _ W2 _ 1 11)g)

= klim P (f! 2 DB j ! j= B ^ !1 DB

k _

i=1

Wi g)

+ klim P (f! 2 DB j ! j= :B ^ : !1 DB = klim P (f 2 F j MDB ( ) j= B ^ !1 F

k _ i=1

k _ i=1

Wi g)

Wi g) k _

+ klim P (f 2 F j MDB ( ) j= :B ^ : Wi g) !1 F i=1 (Lemma A.1) = PF (f 2 F j MDB ( ) j= i (B)g) = PF ( F ) (Lemma A.2) = 1 It follows from PDB (i (B)) = 1 that PDB (B ) = PDB (B ^ i(B )) = PDB (W1 _ W2 _ 1 1 1): Q.E.D. We then prove a proposition useful in probability computation. Let DB (B ) be the support set for an atom B introduced in Section 4 (it is the set of all explanations for B). In the sequel, B is a ground atom. Write DB (B) = fS1 ; S2; : : :g and DB (B) = S1 _S2 _1 1173 De ne a set 3B by 3B def = f! 2 DB j ! j= B $ DB (B)g: Proposition A.2 For every B 2 head(R), PDB(3B ) = 1 and PDB (B) = PDB( DB (B)). (Proof) We rst prove PDB (3B ) = 1 but the proof exactly parallels that of Theorem A.1 except that W1 _ W2 _ 1 1 1 is replaced by S1 _ S2 _ 11 1 using the fact that B $ S1 _ S2 _ 11 1 is true in every least Herbrand model of the form MDB ( ). Then from PDB (3B ) = 1, we have PDB (B ) = PDB (B ^ (B $ DB (B ))) = PDB ( DB (B)): Q.E.D. Finally, we show that distribution semantics is a probabilistic extension of the traditional least Herbrand model semantics in logic programming by proving Theorem A.2. It says that the probability mass is distributed exclusively over possible least Herbrand models. De ne 3 as the set of least Herbrand models generated by xing R and varying a subset of F in the program DB = F [ R. In symbols, W

_

W

_

_

W

73. For a set K = fE1 ; E2 ; : : :g of formulas, K denotes a (-n in nite) disjunction E1 _ E2 _ 1 1 1 448


3 def = f! 2 DB j ! = MDB () for some 2 F g: Note that as 3 is merely a subset of DB , we cannot conclude PDB (3) = 1 a priori, but the next theorem, Theorem A.2, states PDB (3) = 1, i.e. distribution semantics distributes the probability mass exclusively over 3, i.e. possible least Herbrand models. To prove the theorem, we need some preparations. Recalling that atoms outside head(R)[ F have no chance of being proved from DB, we introduce 30 def = f! 2 DB j ! j= :D for every ground atom D 62 head(R) [ F g: For a Herbrand interpretation ! 2 DB , !jF (2 F ) is the restriction of ! to those atoms in F . Lemma A.3 Let ! 2 DB be a Herbrand interpretation. 0 ! = MDB ( ) for some 2 F i ! 2 3 and ! j= B $ DB (B ) for every B 2 head(R). (Proof) Only-if part is immediate from the property of the least Herbrand model. For if-part, suppose ! satis es the right hand side. We show that ! = MDB (!jF ). As ! and MDB (! jF ) coincide w.r.t. atoms not in head(R), it is enough to prove that they also give the same truth values to atoms in head(R). Take B 2 head(R) and write DB (B) = S1 _ S2 _ 1 11 Suppose ! j= B $ S1 _ S2 _ 11 1 Then if ! j= B , we have ! j= Sj for some j , thereby !jF j= Sj , and hence MDB (!jF ) j= Sj , which implies MDB (!jF ) j= B. Otherwise ! j= :B . So ! j= :Sj for every j . It follows that MDB (! jF ) j= :B . Since B is arbitrary, we conclude that ! and MDB (!jF ) agree on the truth values assigned to atoms in head(R) as well. Q.E.D. W

W

Theorem A.2 PDB(3) = 1.

(Proof) From Lemma A.3, we have 3 = f! 2 DB j ! = MDB ( ) for some 2 F g = 30 \ 3B : \

B2head(R)

PDB (3B ) = 1 by Proposition A.2. To prove PDB (30 ) = 1, let D1; D2 ; : : : be an enumeration of atoms not belonging to head(R) [ F . They are not provable from DB = F [ R, and hence false in every least Herbrand model MDB ( ) ( 2 F ). So PDB (30 )

= mlim !1 PDB (f! 2 DB j ! j= :D1 ^ 11 1 ^ :Dm g) = mlim !1 PF (f 2 F j MDB ( ) j= :D1 ^ 1 1 1 ^ :Dm g) = PF ( F ) = 1: Since a countable conjunction of measurable sets of probability measure one has also probability measure one, it follows from PDB (3B ) = 1 for every B 2 head(R) and PDB (30) = 1 that PDB (3) = 1. Q.E.D. 449

Sato & Kameya Appendix B. The MAR (missing at random) Condition

In the original formulation of the EM algorithm by Dempster et al. (1977), it is assumed that there exists a many-to-one mapping y = (x) from a complete data x to an incomplete (observed) data y. In the case of parsing, x is a parse tree and y is the input sentence and x uniquely determines y. In this paper, the uniqueness condition ensures the existence of such a many-to-one mapping from explanations to observations. We however sometimes face a situation where there is no such many-to-one mapping from complete data to incomplete data but nonetheless we wish to apply the EM algorithm. This dilemma can be solved by the introduction of a missing-data mechanism which makes a complete data incomplete. The missing-data mechanism, m, has a distribution g (m j x) parameterized by and y , the observed data, is described as y = m (x). It says x becomes incomplete y by m. The correspondence between x and y , i.e. fhx; y i j 9m(y = m (x))g naturally becomes many-to-many. Rubin (1976) derived two conditions on g (data are missing at random and data are observed at random) collectively called the MAR (missing at random) condition, and showed that if we assume a missing-data mechanism behind our observations that satis es the MAR condition, we may estimate parameters of the distribution over x by simply applying the EM algorithm to y, the observed data. We adapt the MAR condition to parameterized logic programs as follows. We keep a generative model satisfying the uniqueness condition that outputs goals G such as parse trees. We further extend the model by additionally inserting a missing-data mechanism m between G and our observation O like O = m (G) and assume m satis es the MAR condition. Then the extended model has a many-to-many correspondence between explanations and observations, and generates non-exclusive observations such that P (O ^ O0 ) > 0 (O 6= O0 ), which causes O P (O) 1 where P (O) = G:9m O= (G) PDB (G). Thanks to the MAR condition however, we are still allowed to apply the EM algorithm to such nonexclusive observations. Put it dierently, even if the uniqueness condition is seemingly destroyed, the EM algorithm is applicable just by (imaginarily) assuming a missing-data mechanism satisfying the MAR condition. P

P

m

References

Abney, S. (1997). Stochastic attribute-value grammars. Computational Linguistics, 23 (4), 597{618. Arimura, H. (1997). Learning acyclic rst-order horn sentences from entailment. In Proceedings of the Eighth International Workshop on Algorithmic Learning Theory. Ohmsha/Springer-Verlag. Bacchus, F., Grove, A., Halpern, J., & Koller, D. (1996). From statistical knowledge bases to degrees of belief. Arti cial Intelligence, 87, 75{143. Baker, J. K. (1979). Trainable grammars for speech recognition. In Proceedings of Spring Conference of the Acoustical Society of America, pp. 547{550.

450


Beil, F., Carroll, G., Prescher, D., Riezler, S., & Rooth, M. (1999). Inside-Outside estimation of a lexicalized PCFG for German. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL'99), pp. 269{276. Breese, J. S. (1992). Construction of belief and decision networks. Computational Intelligence, 8 (4), 624{647. Carroll, G., & Rooth, M. (1998). Valence induction with a head-lexicalized PCFG. In Proceedings of the 3rd Conference on Empirical Methods in Natural Language Processing (EMNLP 3). Castillo, E., Gutierrez, J. M., & Hadi, A. S. (1997). Expert Systems and Probabilistic Network Models. Springer-Verlag. Charniak, E., & Carroll, G. (1994). Context-sensitive statistics for improved grammatical language models. In Proceedings of the 12th National Conference on Arti cial Intelligence (AAAI'94), pp. 728{733. Chi, Z., & Geman, S. (1998). Estimation of probabilistic context-free grammars. Computational Linguistics, 24 (2), 299{305. Chow, Y., & Teicher, H. (1997). Probability Theory (3rd ed.). Springer. Clark, K. (1978). Negation as failure. In Gallaire, H., & Minker, J. (Eds.), Logic and Databases, pp. 293{322. Plenum Press. Cormen, T., Leiserson, C., & Rivest, R. (1990). Introduction to Algorithms. The MIT Press. Cussens, J. (1999). Loglinear models for rst-order probabilistic reasoning. In Proceedings of the 15th Conference on Uncertainty in Arti cial Intelligence (UAI'99), pp. 126{133. Cussens, J. (2001). Parameter estimation in stochastic logic programs. Machine Learning, 44 (3), 245{271. D'Ambrosio, B. (1999). Inference in Bayesian networks. AI Magazine, summer, 21{36. Dekhtyar, A., & Subrahmanian, V. S. (1997). Hybrid probabilistic programs. In Proceedings of the 14th International Conference on Logic Programming (ICLP'97), pp. 391{405. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Royal Statistical Society, B39 (1), 1{38. Doets, K. (1994). From Logic to Logic Programming. The MIT Press. Flach, P., & Kakas, A. (Eds.). (2000). Abduction and Induction { Essays on Their Relation and Integration. Kluwer Academic Publishers. Frish, A., & Haddawy, P. (1994). Anytime deduction for probabilistic logic. Journal of Arti cial Intelligence, 69, 93{122. Fujisaki, T., Jelinek, F., Cocke, J., Black, E., & Nishino, T. (1989). A probabilistic parsing method for sentence disambiguation. In Proceedings of the 1st International Workshop on Parsing Technologies, pp. 85{94. Japan EDR, L. (1995). EDR electronic dictionary technical guide (2nd edition). Technical report, Japan Electronic Dictionary Research Institute, Ltd. 451

Sato & Kameya

Kakas, A. C., Kowalski, R. A., & Toni, F. (1992). Abductive logic programming. Journal of Logic and Computation, 2 (6), 719{770. Kameya, Y. (2000). Learning and Representation of Symbolic-Statistical Knowledge (in Japanese). Ph. D. dissertation, Tokyo Institute of Technology. Kameya, Y., & Sato, T. (2000). Ecient EM learning for parameterized logic programs. In Proceedings of the 1st Conference on Computational Logic (CL2000), Vol. 1861 of Lecture Notes in Arti cial Intelligence, pp. 269{294. Springer. Kita, K. (1999). Probabilistic Language Models (in Japanese). Tokyo Daigaku Syuppan-kai. Koller, D., McAllester, D., & Pfeer, A. (1997). Eective Bayesian inference for stochastic programs. In Proceedings of 15th National Conference on Arti cial Intelligence (AAAI'97), pp. 740{747. Koller, D., & Pfeer, A. (1997). Learning probabilities for noisy rst-order rules. In Proceedings of the 15th International Joint Conference on Arti cial Intelligence (IJCAI'97), pp. 1316{1321. Kyburg, H. (1994). Uncertainty logics. In Gabbay, D., Hogger, C., & Robinson, J. (Eds.), Handbook of Logics in Arti cial Intelligence and Logic Programming, pp. 397{438. Oxford Science Publications. Laerty, J. (1993). A derivation of the Inside-Outside Algorithm from the EM algorithm. Technical report, IBM T.J.Watson Research Center. Lakshmanan, L. V. S., & Sadri, F. (1994). Probabilistic deductive databases. In Proceedings of the 1994 International Symposium on Logic Programming (ILPS'94), pp. 254{268. Lari, K., & Young, S. J. (1990). The estimation of stochastic context-free grammars using the Inside-Outside algorithm. Computer Speech and Language, 4, 35{56. Li, Z., & D'Ambrosio, B. (1994). Ecient inference in Bayes networks as a combinatorial optimization problem. International Journal of Approximate Reasoning, 11, 55{81. Lloyd, J. W. (1984). Foundations of Logic Programming. Springer-Verlag. Lukasiewicz, T. (1999). Probabilistic deduction with conditional constraints over basic events. Journal of Arti cial Intelligence Research, 10, 199{241. Manning, C. D., & Schutze, H. (1999). Foundations of Statistical Natural Language Processing. The MIT Press. McLachlan, G. J., & Krishnan, T. (1997). The EM Algorithm and Extensions. Wiley Interscience. Muggleton, S. (1996). Stochastic logic programs. In de Raedt, L. (Ed.), Advances in Inductive Logic Programming, pp. 254{264. IOS Press. Ng, R., & Subrahmanian, V. S. (1992). Probabilistic logic programming. Information and Computation, 101, 150{201. Ngo, L., & Haddawy, P. (1997). Answering queries from context-sensitive probabilistic knowledge bases. Theoretical Computer Science, 171, 147{177. Nilsson, N. J. (1986). Probabilistic logic. Arti cial Intelligence, 28, 71{87. 452


Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann. Pereira, F. C. N., & Schabes, Y. (1992). Inside-Outside reestimation from partially bracketed corpora. In Proceedings of the 30th Annual Meeting of the Association for Computational Linguistics (ACL'92), pp. 128{135. Pereira, F. C. N., & Warren, D. H. D. (1980). De nite clause grammars for language analysis | a survey of the formalism and a comparison with augmented transition networks. Arti cial Intelligence, 13, 231{278. Pfeer, A., & Koller, D. (2000). Semantics and inference for recursive probability models. In Proceedings of the Seventh National Conference on Arti cial Intelligence (AAAI'00), pp. 538{544. Poole, D. (1993). Probabilistic Horn abduction and Bayesian networks. Arti cial Intelligence, 64 (1), 81{129. Pynadath, D. V., & Wellman, M. P. (1998). Generalized queries on probabilistic context-free grammars. IEEE Transaction on Pattern Analysis and Machine Intelligence, 20 (1), 65{77. Rabiner, L. R. (1989). A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77 (2), 257{286. Rabiner, L. R., & Juang, B. (1993). Foundations of Speech Recognition. Prentice-Hall. Ramakrishnan, I., Rao, P., Sagonas, K., Swift, T., & Warren, D. (1995). Ecient tabling mechanisms for logic programs. In Proceedings of the 12th International Conference on Logic Programming (ICLP'95), pp. 687{711. The MIT Press. Reddy, C., & Tadepalli, P. (1998). Learning rst-order acyclic horn programs from entailment. In Proceedings of the 15th International Conference on Machine Learning; (and Proceedings of the 8th International Conference on Inductive Logic Programming). Morgan Kaufmann. Riezler, S. (1998). Probabilistic Constraint Logic Programming. Ph.D. thesis, Universitat Tubingen. Rubin, D. (1976). Inference and missing data. Biometrika, 63 (3), 581{592. Sagonas, K., T., S., & Warren, D. (1994). XSB as an ecient deductive database engine. In Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data, pp. 442{453. Sato, T. (1995). A statistical learning method for logic programs with distribution semantics. In Proceedings of the 12th International Conference on Logic Programming (ICLP'95), pp. 715{729. Sato, T. (1998). Modeling scienti c theories as PRISM programs. In Proceedings of ECAI'98 Workshop on Machine Discovery, pp. 37{45. Sato, T. (2001). Minimum likelihood estimation from negative examples in statistical abduction. In Proceedings of IJCAI-01 workshop on Abductive Reasoning, pp. 41{47. Sato, T., & Kameya, Y. (1997). PRISM: a language for symbolic-statistical modeling. In Proceedings of the 15th International Joint Conference on Arti cial Intelligence (IJCAI'97), pp. 1330{1335. 453

Sato & Kameya

Sato, T., & Kameya, Y. (2000). A Viterbi-like algorithm and EM learning for statistical abduction. In Proceedings of UAI2000 Workshop on Fusion of Domain Knowledge with Data for Decision Support. Sato, T., Kameya, Y., Abe, S., & Shirai, K. (2001). Fast EM learning of a family of PCFGs. Titech technical report (Dept. of CS) TR01-0006, Tokyo Institute of Technology. Shen, Y., Yuan, L., You, J., & Zhou, N. (2001). Linear tabulated resolution based on Prolog control strategy. Theory and Practice of Logic Programming, 1 (1), 71{103. Sterling, L., & Shapiro, E. (1986). The Art of Prolog. The MIT Press. Stolcke, A. (1995). An ecient probabilistic context-free parsing algorithm that computes pre x probabilities. Computational Linguistics, 21 (2), 165{201. Tamaki, H., & Sato, T. (1984). Unfold/fold transformation of logic programs. In Proceedings of the 2nd International Conference on Logic Programming (ICLP'84), Lecture Notes in Computer Science, pp. 127{138. Springer. Tamaki, H., & Sato, T. (1986). OLD resolution with tabulation. In Proceedings of the 3rd International Conference on Logic Programming (ICLP'86), Vol. 225 of Lecture Notes in Computer Science, pp. 84{98. Springer. Tanaka, H., Takezawa, T., & Etoh, J. (1997). Japanese grammar for speech recognition considering the MSLR method. In Proceedings of the meeting of SIG-SLP (Spoken Language Processing), 97-SLP-15-25, pp. 145{150. Information Processing Society of Japan. In Japanese. Uratani, N., Takezawa, T., Matsuo, H., & Morita, C. (1994). ATR integrated speech and language database. Technical report TR-IT-0056, ATR Interpreting Telecommunications Research Laboratories. In Japanese. Warren, D. S. (1992). Memoing for logic programs. Communications of the ACM, 35 (3), 93{111. Wetherell, C. S. (1980). Probabilistic languages: a review and some open questions. Computing Surveys, 12 (4), 361{379. White, H. C. (1963). An Anatomy of Kinship. Prentice-Hall. Zhang, N., & Poole, D. (1996). Exploiting causal independence in Bayesian network inference. Journal of Arti cial Intelligence Research, 5, 301{328.

454

Parameter Learning of Logic Programs for Symbolic-statistical Modeling

Parameter Learning of Logic Programs for Symbolic-statistical Modeling

Suggest Documents

Logic Programs as Types for Logic Programs

Learning Logic Programs for Layout Analysis Correction

Learning Efficient Logic Programs - IJCAI

Logic programs as types for logic programs - Logic in Computer ...

Logic programs as types for logic programs - Logic in Computer ...

Incremental Learning of Functional Logic Programs - CiteSeerX

Learning the Structure of Probabilistic Logic Programs

Incremental Learning of Functional Logic Programs - UPV

Strategies in Combined Learning via Logic Programs

Learning efficient logic programs - Springer Link

Learning Logic Programs with Annotated ... - Semantic Scholar

Logic Programs as a Basis for Machine Learning - Semantic Scholar

Logic Programs as a Basis for Machine Learning - Semantic Scholar

Yet more efficient EM learning for parameterized logic programs by

A System for Learning Abductive Logic Programs - Unife

A Logic for Variable Aliasing in Logic Programs - Institute for

modeling and parameter identification for

Learning the Parameters of Probabilistic Logic Programs from ...

Structure Learning of Probabilistic Logic Programs by Searching the

The logic of quantum programs

Super Logic Programs

Unfolding of Equational Logic Programs

Ihductive Inference of Logic Programs

Products of weighted logic programs