INFORMATION FUSION, CAUSAL PROBABILISTIC NETWORK AND PROBANET II: Inference Algorithms and Probanet System Heping Pan, Daniel McMichael and Marta Lendjel Cooperative Research Centre for Sensor Signal and Information Processing SPRI Building, Technology Park, Adelaide, The Levels, SA 5095, Australia Email:
[email protected]
KEY WORDS: Information fusion, causal probabilistic network, Bayesian network, probabilistic inference, evidence propagation, causal tree propagation, junction tree propagation, graph triangulation, adaptive inference algorithms.
ABSTRACT
As an extension of an overview paper [Pan and McMichael, 1997] on information fusion and Causal Probabilistic Networks (CPN), this paper formalizes kernel algorithms for probabilistic inferences upon CPNs. Information fusion is realized through updating joint probabilities of the variables upon the arrival of new evidences or new hypotheses. Kernel algorithms for some dominant methods of inferences are formalized from discontiguous, mathematics-oriented literatures, with gaps lled in with regards to computability and completeness. In particular, possible optimizations on causal tree algorithm, graph triangulation and junction tree algorithm are discussed. Probanet has been designed and developed as a generic shell, or say, mother system for CPN construction and application. The design aspects and current status of Probanet are described. A few directions for research and system development are pointed out, including hierarchical structuring of network, structure decomposition and adaptive inference algorithms. This paper thus has a nature of integration including literature review, algorithm formalization and future perspective.
1 Introduction This is the second paper in our review series on information fusion and causal probabilistic networks. While the rst paper [Pan and McMichael 1997] has focused on the information fusion infrastructure and probabilistic knowledge representation, this second paper concentrates on formalizing kernel inference algorithms and discussing various possibilities for optimizing the algorithms. We also describe an experimental shell or mother system, called Probanet which has provided an environment for us to develop and test various algorithms. Probanet itself has also served as a bearer of all the formal algorithms we have abstracted from mathematics-oriented literature and we have developed from our own research and practice. This paper is organized as follows. Section 2 reviews the historical development on inference algorithms upon causal probabilistic network (CPN). Section 3 formalizes the basic notion and notation of CPN, and the primary inference method - the marginalization of the full joint probability of all the variables. Section 4 formalizes the causal tree algorithm for propagating evidence through a directed tree - a special structure of general graph. Section 5 formalizes the junction tree algorithm for propagating evidence through a decomposable graph. A key component in the junction tree algorithm is the transformation of a graph to a junction tree, which involves transformations such as moralization, triangulation, junction graph formation, and junction tree derivation. We show that the last three steps can be combined into one direct transformation from a moralized graph to a junction tree. After these sections, Section 6 describes the Probanet system, including the system objectives, design aspects, graph editor for probabilistic knowledge representation through constructing a general network, adaptive inference algorithms, and some open directions for further development. Therefore, this paper has a nature being partly for review, partly for algorithm formalization and optimization, and partly for reporting Probanet system development. It is ad-
vantageous that through these three aspects of eort, we have been able to identify and discover a number of unsolved problems and new problems, as well as new research directions open to the future.
2
A Review on Inference Algorithms
A causal probabilistic network (CPN) is a directed acyclic graph representing the joint probability distribution over a set of variables which totally de nes a problem domain. If all the variables are fully dependent on each other, any inference on a particular variable or a subset of variables given some evidence or hypotheses would require calculating the full joint probability of all the variables. However, in general, the variables are neither fully dependent on each other, nor completely isolated from each other, thus the graph appears to be sparse. A smart inference algorithm would necessarily exploit the sparse topology of the graph and make ecient inference without resorting to calculating the full joint probability (the full joint in short hereafter). The technical area of causal probabilistic modeling and inference truly belongs to the culmination of coverging endevours into graphical probabilistic methods for automated inference or say, reasoning under uncertainty from several of most prestigious academic disciplines including probability and statistics, arti cial intelligence and decision analysis. Early Initiatives in Statistics and Arti cial Intelligence: First of all, the relevant early work in statistics goes back to correlation and causation [Galton, 1988] and path analysis [Wright, 1921, 1934]. Similar ideas have re-emerged in causal econometric and social models [Wold, 1954; Blalock, 1971; Joreskog, 1973]. Arti cial intelligence [Barr and Feigenbaum, 1981] has approached the same area from symbolic knowledge-based expert systems. Apparently, CPN's can also represent 'if-then' production rules without or with certainty factor. In fact, expert systems such PROSPECTOR [Duda et al., 1976], MYCIN [Buchanan and Shortlie, 1984], INTERNIST [Miller
et al., 1982], etc. have attempted to handle uncertainty in reasoning with probability-like certainty factors, quasiprobabilistic calculus etc. Historically, there have been long furious debate on whether probability theory can handle all the issues with uncertainty, or should we create new tools such as fuzzy set theory, Dempster-Shafer belief theory, etc. Subjective Probability and Causal Networks: Since 1980's, researchers from many dierent disciplines have realized that probability which started from Bernoulli, Bayes, Laplace and Pascal in the 1700's is still the fundamental notion, and the probability theory still provides the unique coherent calculus for handling uncertainty in inference and reasoning. In particular, the subjective probability, or say Bayesian probability [Savage, 1972; de Finetti, 1974; Lindley, 1982], provides an appropriate tool for modeling causal probabilistic phenomena. The similar ideas of graphical representation of causal relations have re-emerged in various forms and terms such as in uence diagrams in decision analysis [Howard and Matheson, 1981; Shachter, 1986; Smith, 1989], recursive graphical models in contingency table analysis [Wermuth and Lauritzen, 1983], Bayes belief network [Pearl 1988], causal probabilistic network [Andreassen et al. 1987], causal network [Lauritzen and Spiegelhalter, 1988], probabilistic causal network [Cooper, 1984]. Today, the two most popular terms are causal probabilistic network (CPN) and Bayesian network. We have chosen the term CPN in this paper because it is plain English and characterising two most signi cant features of the approach: causation and probability. However, the term Bayesian network may be preferred by other professionals, but to public, it seems a bit demanding on what 'Bayesian' really means. Computational Complexity: Not surprisingly, Cooper (1987) has shown that exact probabilistic inference using CPN's is NP-hard, meaning that the time complexity of probabilistic inference for all probabilistic inference algorithms is an exponential function of the size of the network. This statement has eluded all ecient inference algorithms for unconstrained general graphs. However, this theoretical result does not prevent us from developing ecient algorithms when the graph is generally sparse. All smart algorithms developed later on have relied on exploiting the special topological structures of the graph from dierent aspects. Special Structures: Tree and Decomposable Graph: In fact, there are mainly two classes of network structure: trees and graphs. Trees are a subset of graphs, but a distinctive subset from general graphs where cycles may appear if the direction of links is dropped. The advantage of trees is that inference can be done directly through the original structure with no need to structural transformation. But for a general graph containing cycles, exact inference can not be carried out unless the graph is transformable to a junction tree - a sort of hyper tree. Therefore, the central idea underlying all smart inference algorithms is that of avoiding cyclic propagation of information throughout the network. Apparently, the best way to achieve this is to transform a graph to a hyper tree - still a tree structure. Inference Upon Trees: Trees in CPN's are distinguished between causal trees and polytrees. Causal trees are a basic type of network in which each node has at most one parent, and consequently no cy-
cles exist. Causal polytrees are an extension of causal trees where arbitrary arrow orientation is allowed, so a node may have multiple parents, but no more than one path exists between any two nodes. The term polytree was suggested by George Rebane [Pearl 1988, page 232] and is equivalent to singly connected network or generalized Chow tree. Inference on trees was rst considered by Kelly and Barclay (1973). Pearl (1982) developed a method for probability updating for causal trees based on message passing. This is probably the very rst complete exact inference algorithm ever known, though limited to causal trees. The method was extended to polytrees by Kim [Kim and Pearl 1983] and was used in a decision-aiding system called CONVINCE [Kim 1983]. This extended method is generally acknowledged as the Polytree Algorithm for inference on a special structure of network polytrees. While this algorithm is still widely applicable if a structured network happens to be a polytree, it cannot be generalized to general networks.
Inference Upon Decomposable Graphs:
The ideas of decomposable graphs trace back to graph triangulation and hypergraphs in graph theory. The notion of triangulated graphs and junction trees have been discovered and re-discovered in dierent areas, such as dynamic programming [Bertele and Brioschi, 1972], data base management [Beeri et al., 1983]. Two well-known graph triangulation algorithms were lexicographic search [Rose et al. 1976] and maximum cardinality search [Tarjan and Yannakakis 1984]. Good references on decomposable graphs and graph triangulation are [Golumbic, 1980; Lauritzen et al., 1984; Leimer, 1985]. A complete inference scheme upon triangulated graph was developed by [Lauritzen and Spiegelhalter, 1988] which is considered a milestone in the exact inference on general decomposable graphs. However, the idea is similar to the method of 'joint-peeling' developed by Cannings et al (1976, 1978) for the exact calculation of probability functions on arbitrarily complex pedigree. Jensen et al. (1990) then modi ed the Lauritzen and Spiegelhalter inference scheme and proposed a better scheme based on message-passing through junction trees, now called junction tree algorithm, or HUGIN propagation, where HUGIN is an integrated system developed by Jensen and his colleagues. A dierent message-passing scheme for junction trees was proposed by Shafer and Shenoy (1990). Dawid (1992) described applications of the junction tree algorithm for probabilistic expert systems.
Arc Resersal:
The decomposablility of a CPN graph is equivalent to the conditional independence of a node given its parent nodes. While the class of junction-tree algorithms must transform the original graph to a hyper-tree, another smart algorithm, called arc reversal, was developed by [Howard and Matheson, 1981; Olmsted, 1983; Shachter, 1986, 1990], which exploits conditional independence without transforming the original graph structure. The algorithm only reverses arcs in the original structure until the answer to the given query can be read directly from the graph. Indeed, each arc reversal corresponds to an application of Bayes' theorem. Apparently, this method is query-oriented.
Symbolic Manipulation Algebra:
D'Ambrosio (1989, 1991), Shachter et al.(1990), and Li and D'Ambrosio (1994) developed an algebraic view to the inference problem. Consider that the full joint probability of a problem domain de ned by n variables is the product of
the conditional probability of each variable given its parents, and is uniquely de ned given the network. Any conditional, marginal, or conjunctive query in the network can be calculated from this full joint probability. Thus, ecient probabilistic inference in the network can also be viewed as a problem of nding an optimal factorization given the complete set of n conditional probability distributions. This sort of optimal factorization may be done through a pure symbolic manipulation algebra without resorting to the graph structure. Variable/Bucket Elimination: Related to the symbolic manipulation algebra, a more queryoriented solution has been developed with various names such as variable elimination by [Zhang and Poole, 1994; 1996], and bucket elimination by [Dechter 1996]. Rather than nding the posterior probability for each variable, this approach only processes that part of the network relevant to the query given the evidence and only does the work necessary to answer that query. The approach is called variable elimination by Zhang and Poole (1994) because it sums out non-queried variables from a list of factors one by one. The algorithm is able to make use of the ner-grain factorization by exploiting various causal independences. Approximate Inference: Shachter et al. (1991) have shown that all exact inference methods in CPN's have complexities which are fundamentally identical since they are all based on performing similar operations on a similar underlying undirected graph - a triangulated graph. However, in case the network is very large, it happens that an exact inference algorithm may require more computing resources than available. Under this circumstance, approximate inference methods would help. A major approximate inference method is called stochastic simulation. The idea behind that is the causal probabilistic model represented by the network can be used to simulate the ow of impact through random sampling in the state space of each variable and the con guration space of each set of related variables. This sort of methods includes logic sampling [Henrion, 1988], likelihood weighting [Fung and Chang, 1989; Shachter and Peot, 1989], backward simulation [Fung and Del Favero, 1994], mean eld theory [Saul et al., 1996]. Good references on Gibbs sampling in CPN's are [Gilks et al., 1994; Jensen et al., 1995]. Dagum and Luby (1993) show that approximate inference in CPN's is also NP-hard. From the next section on, we formalize some kernel inference algorithms that have been implemented in Probanet. A range of possibilities for algorithm optimization are also discussed. We start with the primary rude-force method - full joint marginalization. Propagation algorithms on causal trees and junction trees will follow.
V = V 2V V
(4)
For a variable V 2 V, if there is a directed link L 2 L stemming from another variable X 2 V pointing to V , then X is called a parent of V , and V is a child of X . Let ,+V and ,,V denote the set of parents and the set of children respectively for variable V . Their con gurations are denoted by v+ and v, respectively. A node X is a neighbor of a node V if there is a link, either directed or undirected, between X and V . The set of neighbors for a node V is denoted by ,V . For a DAG G, we have ,V = ,+ [ ,, (5) V
V
P denotes a set of conditional probability functions/tables associated with each variable given its parent
P = fP (V j,+V g
(6)
The Full Joint Probability Distribution :
The full joint probability distribution, or brie y the full joint P (V) of a set of variables V = (V1 ; V2 ; : : : ; Vn ) is de ned as
P (V) = P (V1 ; V2 ; : : : ; Vn )
(7)
Recall the Bayes' rule for two dependent variables A and B ,
B ) ; P (B ) = 0 ,! P (AjB ) = 0 P (AjB ) = PP(A; (B )
(8)
Using this rule, the full joint P (V) can be factorized to
3 Causal Probabilistic Modeling and Full Joint Marginalization
A CPN N is a directed acyclic graph (DAG) G representing the joint probability distribution P(V) over a set of variables V for a problem domain
N = (G; P) = (V; L; P)
and P is a set of conditional probability distributions associated with each node V 2 V given V 's parent nodes. Let us have a closer look at the more detailed notion and notation. Notation: Let V denote a set of variables that characterize a problem domain. The variables are also called nodes of the graph G. We shall use the term variable and node interchangeably. We use bold-face upper case letters such as V, to denote sets of variables. The number of elements in a set X is denoted by jXj. We use upper-case letters such as V or Vi to denote variables being elements of the set V, and we use lower-case letters such as v, vi , x, y, z , etc to denote instantiation of the corresponding variables to an actual value or state. The set of all the possible values/states for a variable V is called the frame of V , and denoted by V . A con guration for a set of variables V refers to a combination of simultaneous instantiation of each variable V 2 V, and is denoted by v. The set of all possible con gurations for a given set of variables V is called the con guration space or the frame of the set V, denoted by V . Apparently,
(1)
where G is a DAG which is de ned by a set of nodes V and a set of directed links L over V, i.e. G = (V; L) (2) LVV (3)
P (V) =
Y
V 2V
P (V j,+V )
(9)
This factorization is fundamental as all existing inference methods were actually derived by smart manipulations of this equation. Marginalization of the Full Joint Probability: An evidence E(V ) on a variable V can be represented as a likelihood function
E(V ) = (e(v1); e(v2 ); : : : ; e(vj V j ))
(10)
and
jX
V j i=1
e(vi) = 1
(11)
Usually, the evidence takes binary values 0 or 1. In such case, only one state is evidenced to 1 and all other states to 0. The evidence in this form is called nding. An application of an evidence to the node V results in a posterior conditional probability of V given its parents and this evidence
Pe(V j,V ) = P (V; E(V )j,V ) = P (V j,V ) E(V ) (12) Note that if there is no evidence on a node V , we simply de ne
Pe(V j,V ) = P (V j,V ; E(V ) = ;) = P (V j,V ) (13) Let E denote the whole set of evidence on the variable set V, the posterior full joint probability of V given E is calcualted by
P (VjE) = PP(V(E; E) ) = P (V; E) Y = =
V 2V;E(V )2E
Y
V 2V
Step 2: marginalize for the variable Vi being queried B (Vi ) = P (Vi jE) X X X X X =
:::
v1 2 1 v22 2 vi,1 2 i,1 vi+1 2 i+1 P (v1 ; v2 ; : : : ; vi,1 ; Vi ; vi+1 ; vn jE)
:::
vn 2 n
(19)
We shall refer to this algorithm as FJM Algorithm.
4 CAUSAL TREE ALGORITHM
A causal tree CT is a special type of CPN
CT = (V; L; P) (20) where each variable V 2 V may have multiple children, but no more than one parent, i.e. j,, j 0;
j,+V j 1
V
(21)
Note that ,+V may be empty or a single variable, so we shall denote the set ,+V by a variable notation ,+V just to mean the parent variable of V if any. R
P (V; E(V )j,V )
Pe(V j,V )
W
(14) U
where
= P (1E)
(15)
Since P (E) is a constant with regard to the variable set V, we can normalize the factor out. The very basic inference method is the marginalization of the full joint for each variable V . Let
U = V n fV g
(16)
and u denotes a con guration of U. Then, the belief B (V ) on V given the evidence E can be calculated by
B (V ) = P (V jE) X = P (VjE) u2U
=
X
u2U
P (fV g; UjE)
(17)
We can put this calculation in an algorithm:
Algorithm: Full Joint Marginalization (FJM) For V = (V1 ; V2 ; : : : ; Vi,1 ; Vi ; Vi+1 ; : : : ; Vn ). Let i denote the frame of Vi , then
Step 1: calculate the posterior full joint P (VjE) = P (V1 ; V2 ; : : : ; Vn jE) = Pe(V1 j,V1 )Pe(V2 j,V2 ) : : : Pe(Vn j,Vn ) (18)
S
V
C
Figure 1: The partition of evidence on node V in a causal tree Consider a variable V and its surroundings. The evidence E to the causal tree CT can always be split between two distinct sets relative to variable V : evidence E,V from the sub-tree rooted at V , and evidence E+V from the rest of the tree. Written in equations, we have E = E, [ E+ (22) V
V
Note that E,V naturally refers to the evidence from V and/or some descendants of V , while E+V may or may not necessarily come from the ancestors of V , but E+V can only aect the belief of V through its parent ,+V . Fig.1 shows this relative partition of evidence at node V . The belief induced on V by the evidence E as partitioned in (22) can be expressed, using Bayes' rule, as: B (V ) = P (V jE,V ; E+V ) P (V; E,V ; E+V ) = P (E,V ; E+V )
P (E,V jV; E+V )P (V jE+V )P (E+V ) P (E,V jE+V )P (E+V ) P (E,V jV )P (V jE+V ) = P (E,V jE+V )
as a message sent from a child C of V to V . o (V ) is quite simple,
=
(23)
Let
(V ) = P (E,V jV ) (V ) = P (V jE+V ) 1 = P (E,V jE+V )
(24) (25) (26)
equation (23) can be abbreviated to
B (V ) = (V )(V )
(27)
Note (V ) and (V ) are two vectors indexed by the values v of variable V . (V ) is the in uence of the evidence E,V from V 's descendants and V itself, while (V ) is the in uence of the evidence E+V mediated from V's ancesters. Therefore, (V ) represents the diagnostic or retrospective support attributed to the assertion V = v by V's descendants and (V ) represents the predictive or causal support to that assertion by all non-descendants of V , but mediated by V's parent. Equation (27) is the basic formula for belief updating for each variable V by fusing evidence coming from V 's descendants and V itself and from the rest of the tree, mediated from V 's parent. 's are called bottom-up messages, and 's top-down messages passing through each variable. To generalize this formula to the whole causal tree, we only need to elaborate (V ) and (V ) to recursive formulas. The Bottom-Up Messages: (V ) The evidence E,V coming from V itself and V 's children ,,V , can be recursively partitioned between V itself and subtrees rooted at each of of its children E,V = EoV [C 2,,V E,C (28) where EoV denotes the evidence from V itself, E,C denotes the evidence from the subtree rooted at a child C of V , C denotes an enumeration of V 's children. Since V separates its children, we have (V ) = P (E,V jV ) = P (EoV [C 2,, E,C jV ) YV = P (EoV jV ) P (E,C jV ) (29) , C 2,V Let
o (V ) = P (EoV jV ) C (V ) = P (E,C jV ); equation (29) becomes
(V ) = o (V )
for C 2 ,,V
Y C 2,,V
C (V )
(30) (31)
EoV o (v) else if EoV o (v)
This is not a recursive formula yet. We now need to work out the micro-structure of C (V ) which can be considered
(33)
Pearl (1988) put the evidence EoV on V itself as a dummy child of V if EoV exists. Here we consider it more natural to simply de ne an evidence item for V itself. Since the evidence E,C from a subtree rooted at a child C of V is still to be mediated to V via C , we have for each element of C (V )
C (v) = = =
X
c2 C
X
c2 C
X
c2 C
p(E,C jv;c)p(cjv) p(E,C jc)p(cjv) (c)p(cjv)
(34)
Thus, the vector C (V ) can be written as a product of a matrix P (C jV ) with a vector (C ) as
C (V ) = P (C jV ) (C )
(35)
where P (C jV ) is supposed to be arranged with C as column index and V as row index, and (C ) is a column (vertical) vector. Note that P (C jV ) is the conditional probability table associated with variable C given its parent V, so P (C jV ) is readily readable from the long-term static memory. And (C ) can be calculated using the same formula from C's own children. Thus, applying equation (35) into equation (32), we obtain a recursive formula for calculating the messages:
(V ) = o (V )
Y
C 2,,V
P (C jV ) (C )
(36)
The Top-Down Messages: (V )
To simplify the notation for the following discussion, let U denote the parent of V , and ,V denote the siblings of V , i.e.
U = ,+V ,V = ,,U n fV g
(37) (38)
Since the evidence E+V from the rest of the tree is mediated via the parent U of V , we have for each element of (V )
(v) = p(vjE+V ) X = p(vjE+V ; u)p(ujE+V ) u2 U
= (32)
= ; = 1 8v = (V = vo ) = (v , vo ) 1 if v = vo = 0 otherwise
if
X
u2 U
p(vju)p(ujE+V )
(39)
Thus, the vector (V ) can be written as a product of a matrix P (V jU ) with a vector P (U jE+V ) as
(V ) = P (V jU ) P (U jE+V ) = P (V jU ) V (U ) (40)
where
V (U ) = P (U jE+V )
(41)
which may be considered as a message sent to V by its parent U . For calculating V (U ), we need to partition evidence E+V as
E+V = [S2,V E,S [ EoU [ E+U (42) where S denotes an enumeration of V 's siblings ,V . With this partition of evidence, we obtain
V (U ) = P (U jE+V ) = P (U j [S2,V E,S [ EoU [ E+U ) P ([S2,V E,S [ EoU j U; E+U )P (U jE+U )P (E+U ) = P ([S2,V E,S ; EoU ; E+U ) P ([S2,V E,S [ EoU j U )P (U jE+U ) = (43) P ([S2,V E,S ; EoU j E+U ) Considering the notation we adopted in equations (25) and (34) - (35), we have
0 Y V (U ) = @
S2,V
where
1 S (U )A o (U ) (U )
S (U ) = P (E,S jU ); (U ) = P (U jE+U )
for S 2 ,V
(44)
(45) (46) (47)
Note that denotes a constant, which can be normalized out later on. As long as is used to denote a constant, it may refer to dierent constants. Consider equation (40), equation (44) can be written in a recursive form
0 Y V (U ) = @
S2,V
1 S (U )A oU (P (U jW ) U (W )) (48)
where W = ,+U denotes the parent of U , i.e. the grandparent of V . From equations (27), (32), and (44), we obtain
V (U ) = (U )(U()U ) = B ((UU)) V
V
(49)
Applying this equation into (40), we obtain a recursive formula for calculating (V )
(V ) = P (V jU ) V (U ) = P (V jU ) B (U ) V (U ) (U ) (U ) = P (V jU ) V (U )
(50)
The above derivation has shown only the equations for calculating the belief P (V jE) of a variable V given the evidence E. The trick underlying this approach is the careful partition of the evidence E between the subtree rooted at V , and the rest
of the tree. This partition led to a bottom-up contribution and a top-down contribution to the belief P (V jE). Aspects of Optimal Implementation: However, there is still some way to go from these equations to a complete computable algorithm. Pearl (1988) proposed an object-oriented scheme for belief updating using message passing, where C (V ) and V (U ) are treated as messages. Considering equations (49) - (50), it is advantageous for variable/node V , instead of sending each child C a dierent message C (V ), to send all its children the value of its current belief B (V ), and let each child C recover its respective C (V ) message by dividing B (V ) by the value of the last message sent to V . From recursive equations (27), (32), (35), (50), and (49), we see that calculating (V )'s are totally independent from calculating (V ). But calculating (V )'s requires C (V )'s. Therefore, we need to save C (V )'s for each variable V as intermediate variables internal to V . The whole procedure can be put in a recursive algorithm as follows. Algorithm: Recursive Causal Tree Propagation: The whole causal tree propagation consists of two major steps
Step 1 - Bottom-up propagation: This step aims at
calculating all (V )'s from the leaf nodes up to the root node using equations (32) and (35). It starts with distributing the evidence E to each related node. Step 2 - Top-down propagation: After the rst step, C (V ) for each node C with parent V is available, so (V ) can be calculated with equation (50). This is a top-down procedure starting from the root node downwards through the whole tree. After these two steps, the belief on each node V is immediately computable using equation (27). In general, we shall refer to this algorithm the causal tree (CT) algorithm. Non-Recursive Bottom-Up Propagation: With the recursive causal tree propagation algorithm, whenever we nish calculating (V )' for a node V , we need to check if (U ) for V 's parent U can be calculated. In fact, (U ) can only be calcualted after all U 's chidren's have been calcualted. Therefore, for a parent node U having m children, we have to make m checks before calculating (U ). To avoid all the checks, we can transform the recursive algorithm to a non-recursive one by using the level of each node V - the distance from the root node to V . We start from the nodes at the bottom level, and proceed level by level upwards until the root node is reached.
5 JUNCTION TREE ALGORITHM Causal trees are a simple special type of general CPN's. The method for belief updating on causal trees presented in the last section was extended to polytrees by Kim and Pearl (1983). This type of methods for causal trees and polytrees are elegant as they can directly work with the original directed graph structures, though limited structures. However, these methods can not be generalized to more general decomposable graphs where each node may have multiple parents. In this section, we shall explicate a dominant inference method called Junction Tree Algorithm which was rst proposed by Lauritzen and Spiegelhalter (1988) and later on
improved by Jensen, Olesen and Andersen 1990]. Similar methods were used even earlier for pedigree analysis by Cannings Thompsen and Skolnick (1976, 1878). The key idea of this method is that of transforming a general DAG rst into a hyper-tree structure, so a message-passing scheme in a style similar to the causal tree algorithm may be applied for belief updating. This particular tree structure is called junction trees. A dierent message-passing methods for junction trees was proposed by Shafter and Shenoy (1990).
5.1 Equivalent Tree Structures of CPN's
From equation (9), we know that there is at least one basic approach for calculating the full joint P (V) by multiplying all the conditional probabilities (including marginals). Using Bayes' rule (8), we have
P (V ) =
Y
V 2V
P (V j,+V ) =
Y P (V; ,+V )
V 2V
P (,+V )
(51)
Notice the normalization condition for conditional probabilities X P (v; ,+V ) X P (vj,+V ) = + v2 V P (,V ) v2 V P P (v; ,+ ) = v2 V + V P (,V ) =1 (52) Let
V = P (,+V ) =
X v2 V
P (v; ,+V )
(53)
Equation 51 can be rewritten as
P (V) =
Y P (V; ,+V )
V + 2V P (V; ,V ) = VQ (54) V 2V V We can see that the full joint P (V) is proportional to the products of local joints P (V; ,+V ) associated with each variable V , and each local conditional probability can be easily V 2V
Q
obtained by normalizing the local joint probability. It has been recognised by Bayesian statisticians for many years [DeGroot 1970] that a full calculation of conditional probability distributions in a multivariable context is often unnecessary and one can work with the proportionality of the conditional distributions to the local joint distributions and not calculate the normalizing factor such as V unless strictly necessary. In order to explore this proportionality and not to be restricted by the de nition of probability, it is wise [Lauritzen and Spiegelhalter 1988] to introduce a notion of potential for a variable V or a group of variables. The full joint is then proportional to the products of potentials as
P (V) / (V) =
Y
V 2V
(V; ,+V )
(55)
Similar to the notation of P (:) and p(:) for probability distribution and point-wise probability, we also use (:) and (:) to denote potential distribution and point-wise potential. The
key dierence between potential and probability is that we do not require potentials summing up to 1. The potentials are non-negative non-vanishing real functions depending only on the states of variables. From Directed Graph to Undirected Graph: Apparently, potential provides a more exible expression as it only requires proportionality. To start with, we may initialize the potentials to conditionals, such as (V; ,+V ) = P (V j,+V )
(56)
However, considering the similarity of potentials to local joints, it becomes convenient to just work with undirected graphs. Also it is worth noticing from (55) that a potential on a node V only involves V itself and its parents ,+V . Therefore, it is natural to introduce the notion of cliques. In the simplest way, we may de ne a clique CV for each variable as
CV = fV; ,+V g
(57)
Based on the notion of cliques, the full potential (V) de ned in (55) can be written as (V) =
Y
V 2V
(CV )
(58)
This transition of unit from nodes to cliques corresponds to clustering nodes into relatively self-contained groups, which we call cliques. In order to pursue this line of exploration, we need to consider structural manipulations and transformations of graph. Hypergraph, Cliques and Separators: In order to deal with cliques of nodes, we need to generalize the notion of cliques to subsets of nodes. This naturally leads to the notion of hypergraphs. A hypergraph H for an undirected graph G = (V; L) contains the set V of nodes and a set C of subsets - cliques - of V, i.e.
H = (V; L; C) = (G; C)
(59)
A hypergraph is tractable if it is acyclic. A hypergraph H is acyclic if
no elements of C are subsets of other elments in C
(which means that cliques - elements of C may overlap, but there should not be complete containments); and the elements of C can be ordered (C1 ; C2 ; : : : ; Cm ) to have the running intersection property:
8j 2; 9i < j : Ci Cj \ (C1 [ : : : [ Cj,1 ) (60) This running intersection property is fundamental to probabilistic updating on a CPN. It implies that if we can transform a graph to a hypergraph consisting of cliques which satisfy the running intersection property, calculating the full joint may be achieved through a sequence of local calculations around each clique and the intersections of that clique with neighboring cliques. Note that the running intersection property may, at the rst sight, look like transforming a graph to a linear sequence of cliques. In fact, it is a transformation of a graph to a tree structure which may be traversed through one chain of cliques.
To further explore this possibility, let us de ne two notions: separator Sj and residual Rj ,
Sj = Cj \ (C1 [ : : : [ Cj,1 ); S1 = ; Rj = Cj n Sj ; R1 = C1
(61) (62)
The set Sj is called a separator because it separates the residual Rj from (C1 [ : : : [ Cj,1 ) n Sj . Here it is worth noting that C1 ; : : : ; Cj,1 are the precedents of Cj in the given ordering of cliques. We shall call any clique Ci containing Sj with i < j a possible parent clique of Cj . Triangulation ! Running Intersection Property: It has been shown by Beeri et al (1981, 1983), Lauritzen et al. (1984), Tarjan and Yannakakis (1984), and Leimer (1985, 1988) that 1. A hypergraph H is acyclic i its set of subsets C can be considered to be the set of cliques of a triangulated graph; 2. If a hypergraph H is acyclic and the nodes V of the corresponding triangulated graph are ordered in a perfect ordering (e.g. by maximum cardinality search [Tarjan and Yannakakis 1984], or by lexicographic search [Rose et al. 1976]), the cliques C can then be ordered according to some natural comparable property in each clique; 3. The ordering so obtained will have the running intersection property.
The Full Joint = Chain Products of Clique Marginals: Let us assume that a graph G = (V; L) is transformed
(e.g. via a triangulation algorithm) into an ordered chain of sets/cliques (C1 ; C2 ; : : : ; Cm ), which corresponds to an ordering of the set of cliques C. We shall only consider connected graphs, i.e. every node in the graph is connected to any other node in the graph via a path, however long it is. We shall also assume that the set C is a complete set of cliques, meaning that every node in G at least belongs to a clique in C. Therefore, we have
[mj=1 Cj = V
(63)
Under the assumption that (C1 ; C2 ; : : : ; Cm ) satis es the running intersection property (60), we obtain the equation
P (V) = P (C) = =
m Y j=1
P (Rj jSj )
m Y P (Rj [ Sj ) Y P (Cj ) m
j=1
P (Sj )
=
j=1
P (Sj )
(64)
where P (Cj ) and P (Sj ) are the marginal probability of clique Cj and separator Sj , both are sets of nodes. Therefore, equation (64) is an elegant factorization of the full joint probability P (V) into a simulataneous chain product of node set marginals. This leads to a practical and still exact way of calculating the full joint using the marginal probabilities of much smaller cliques. Historically, this kind of set chain representation as illustrated in (60) - (64) were implicitly used by Goodman (1970) and Darroch et al. (1980). Vorob'ev (1963) and Kellerer (1964) showed that a joint probability with given set marginals exists when (V; C) forms an acyclic hypergraph; and this joint
probability is unique (as de ned in (64)) if we further assume the joint probability to be Markovian or to have maximal entropy. The explicit equation (64) was attributed to Lauritzen and Spiegelhalter (1988). However, the running intersection property (RIP) used by Lauritzen and Spiegelhalter (1988) was a linear structure. Jensen at al (1990) extended the linear RIP to a tree, called junction tree, which forms a basis for more exible and ecient methods for global propagation of probabilities throughout the whole network. Cluster Tree: Given a CPN N = (V; L; P) = (G; P), a complete ordered set of cliques C = (C1 ; C2 ; : : : ; Cm ), transformed from G, satisfying the running intersection property (60) and the completeness condition (63), de nes a hyper-tree T
T = (C; S)
(65)
where elements of C are the nodes of the tree and S is a set of separators among Cj 's, which serve as links between nodes of T . Each element of C is now called a cluster, and each element of S is called a separator. The tree T is called Cluster Tree. Note that in fact both clusters and separators are cliques. As shown in (64) the full joint probability P (C) of all the nodes in T equals the full joint probability P (V) of all the nodes of G. Therefore, we only need to work with the cluster tree T for all the inference problems. Similar to the potentials introduced on nodes of G, we may also extend the probability for clusters to potentials, (C) =
m Y (Cj ) j=1
(Sj )
(66)
And the full joint probability P (T ) is proportional to the full joint potential (T )
P (C) / (C) P (C) = (C)
(67) (68)
where is a normalizing factor. A potential (X ) is said to be normalized if
X
x2 (X )
(x) = 1
(69)
A cluster tree T is said to be normalized if all the potentials de ned on T is normalized. For a normalized cluster tree, its full joint potential (C) equals its full joint probability P (C), and which in turns equals the full joint probability P (V) of the original graph G.
Marginalization
The family FV for each node V 2 V is de ned as
FV = fV g [ ,+V
(70)
Originally, the potential for each family is given as the conditional probability table of each node given its parents. The potential for each cluster C 2 C or each separator S 2 S has to be calculated from related family potentials. The transformation of large potentials to smaller one by eliminating a subset of variables is called marginalization. Consider a set of variables X and a subset Y X of X , i.e.
X = (X n Y ) [ Y
(71)
The marginal potential (Y ) of Y can be calculated by eliminating variables in X n Y . (Y ) =
X
X nY
(X )
(72)
which can be written in an analytical form X (y) = (z;y) (73) z2 X nY where y; z enumerate the con gurations of Y and X n Y respectively.
Local and Global Consistency of Potentials
Because in general, potentials are proportional by de nition and may not always be normalized, and considering asynchroneous updating of local joint potentials of clusters involving the same variable, consistency between neighboring clusters may be a problem. A cluster tree T is said to be locally consistent if for any two clusters A;B 2 C that are neighbors directly with separator S , X X (A) / (S ) / (B ) (74) AnS B nS Probability inferences, in general, involve clusters that may not necessarily be direct neighbors. A locally consistent cluster tree is said to be globally consistent if for any two clusters A and B with intersection I , we have X X (A) / (I ) / (B ) (75) AnI BnI With global consistency so de ned, we now can de ne a canonical cluster tree for a given graph: Junction Tree: A globally consistent cluster tree is called a junction tree. In order to ensure the global consistency for a cluster tree, a condition is that for any two clusters A; B 2 C, all clusters on the path between A and B contain the intersection A \ B . This condition is also called the junction tree property. In fact, this property guarantees that local consistency be propagated everywhere for maintaining the global consistency. The necessity of the junction tree property is for maintaining the probabilistic equilibrium on the entire cluster tree along with dynamic arrival of evidence. To summarize, a junction tree T of a CPN - graph G is constrained by three properties: 1. Family Property: For each node V of G, there is some cluster C of T which contains the family FV of V . 2. Tree Property: There is one and only one path between any two clusters in T . In other words, T is a tree. 3. Junction Tree Property: For any two clusters A and B of T and every cluster C on the path between A and B, A\B C (76) Furthermore, the link between any two direct neighboring clusters A and B of T is characterized by a separator S , S = A\B (77) In fact, separators are also cliques, but they are not treated as nodes of the junction tree because the operations to be de ned on separators are dierent from those on nodes in the inference algorithms.
5.2 Transformation of CPN to Junction Tree
In fact, transforming a general CPN - a graph G to a junction tree involves various operations upon an intermediate structure, called junction graph. In principle, a set of cliques C satisfying the family property and running intersection property de nes a graph where cliques are the nodes and the separators are the links. This graph is also an equivalent structure of the original CPN G because the joint probability of this graph is proportional to that of G. The key procedure in this transformation is graph triangulation. Graph triangulation goes back to [Fujisawa and Orino 1974; Rose et al 1976; Tarjan and Yannakakis 1984]. More details on graph triangulation and junction tree construction are referred to [Jensen 1988; Kjaerul 1993]. Construction of Junction Tree For A CPN: Let N = (G; P) = (V; L; P) be a CPN, where G denotes the structuring graph of N . A junction tree T can be constructed in the following steps: 1. Moralization: drop the direction for all the links in L, so to make G an undirected graph; and create an undirected link for each pair of the parents ,+V for every node V 2 V in G. 2. Triangulation: add links untill no cycle of length greater than three which has no chord remains. 3. Constructing the Junction Graph: form the node clusters for the triangulated graph and establish junctions among clusters, so to obtain the junction graph JG. 4. Forming the Tree: remove redundant links in the junction graph JG such that every pair of nodes in JG are singly-connected, and therefore a junction tree T is created. 5. Creating Clique Potentials: for each node and separator in the junction tree T , calculate its potential table from the potential tables of related variables in the original graph G. The notion of triangulated graphs plays a key role in construction of junction tree. In fact, this notion has also been applied in other widely dierent areas as solution of sparse symmetric systems of linear equations and pedigree analysis. The kernel technique applied to graph triangulation is called node elimination. Let us rst introduce some basic notions associated with node elimination. A node V 2 V is said to be eliminated if ,V is made a complete subgraph of G and, V and its incident links LV are removed from G. Historically, the two most well-known elimination ordering algorithms are lexicographic search by Rose et al (1976) and maximum cardinality search by Tarjan and Yannakakis (1984). The lexicographic search is guaranteed to produce minimal triangulations in O(jVjjLj) time, whereas the maximum cardinality search is not. But the latter has the property of not adding links to an already triangulated graph and it runs in O(jVj + jLj) time. Fujisawa and Orino (1974) have developed an extended elimination technique which produces minimal triangulations no matter how the nodes are ordered. Their algorithm runs in O(jVj(jLj + jTj)) time. Therefore, a good solution would be to use the extended elimination algorithm of Fujisawa and Orino together with the minimal
weight ordering. The weight of a clique refers to the size of the con guration space of that clique. Another heuristic algorithm [Jensen 1996] which has proven to give fairly good results is: eliminate repeatedly a node not requiring extra links and if this is not possible, eliminate a node yielding a clique of the minimal weight. On the other hand, even a triangulation produced by an algorithm may not be minimal, it can be minimized by some algorithms [Kjaerul 1993]. After the triangulation by eliminating nodes one by one, the cliques are formed. We simply put these cliques as clusters of the junction graph. The separators among clusters can be found by intersecting neighboring clusters. This yields a junction graph. A junction tree JT of a junction graph JG corresponds to a spanning tree of JG, which satis es the junction tree property. The Optimized Algorithm in Probanet: In fact, we can integrate the three steps: triangulation, junction graph construction, and junction tree derivation into a one-pass algorithm. Considering the running intersection property (60), each cluster Ci only has to be connected to one cluster Cj with an later index i < j , so the running intersection property is satis ed, and the resulting structure is a tree. Therefore, in a sequence of node eliminations for graph triangulation, before a node Vi 2 V is eliminated, V and its currently non-eliminated neighbors form a cluster Ci . After Vi is eliminated, its currently non-eliminated neighbors themselves form a separator Si . Therefore, with each node elimination, a pair of cluster and separator (Ci ; Si ) is formed. Here each cluster-separator is indexed by the node Vi of the highest elimination order i being in the cluster Ci but not in the separator Si . Now if we attach each separator Si to a cluster Cj Si of the latest index j > i, the result will be a junction tree. This is the optimized algorithm currently used by Probanet (see section 6). Creating The Potential Table For Each Cluster: After a junction tree JT has been derived from a given CPN N = (G; P), the potential table for each cluster and each separator can be created as follows:
set the potential table for each separator to 1, set the potential table for each cluster to 1, for each node V , nd out a cluster C which contains V 's family, and multiply P (V j,+V ) to (C ). It can be easily seen from equation (64) and also it has been proved by Jensen (1996) that the junction tree with its cluster tables and separators created in this way is a representation of the joint potential (V) for all the original variables V of the original CPN, simply because the product of all the cluster potential tables divided by the product of all the separator potential tables is the joint potential table of the total variable set V. The junction tree so created does provide an equivalent structure for the original CPN because we now can start moving the information around in the junction tree while this junction tree remains a representation of (V). The probability equilibrium on each link is yet to be reached. This can be done through an operation called absorption (see equations (79)) through each link. The equilibrium for the whole junction tree can be reached and maintained through message passing in both directions of each link.
5.3 Probability Updating upon Junction Trees Probability Equilibrium:
Let A and B be two neighboring clusters in a junction tree JT , and let S be their separator. The link between A and B through S is said to be on a probability equilibrium state, or in equilibrium for short, if
X A nS
(A) = (S ) =
X
B nS
(B )
(78)
If this is true, we also say that this link is consistent. If all links in the junction tree are in equilibrium or consistent, we say that the junction tree is in equilibrium or consistent. If a link is not in equilibrium or say, inconsistent, we can operate on the potential tables of S and one of A and B to make this link consistent. Consider cluster A is in its current state which is not to be updated at this moment, the potential table of S and B can be recalculated to make this link consisten as follows: X (S ) = (A) (79) An S (S ) (B ) = (B ) (80) (S ) We then put (S ) and (B ) as the new potential table for S and B respectively. We then say that B has absorbed the information from A, or simply, B has absorbed from or calibrated to A. It can be easily proved that after B has absorbed from A, their link becomes consistent: X X (B ) = (S ) (S ) (B ) = (S ) (S ) BnS (S ) BnS X = (S ) = (A) (81) AnS As the division is involved in the absorption (80), we can only absorb from A through S if (B ) has zeros in the entries corresponding to the zero entries in (S ) (while we shouldn't restrict entries in (A)). If a link allows absorption in both direction, we say this link is supportive. A junction tree is said to be supportive if all its links are supportive. There are facts upon probability equilibrium and absorption: 1. An inconsistent link is made consistent through one absorption in either direction of this link, and absorption has no eect upon a link which is already in equilibrium, or say, consistent. 2. Supportiveness is preserved under absorption. 3. The joint potential of the junction tree, i.e. the product of all the cluster potential tables divided by the product of all the separator potential tables is invariant under absorption. The rst fact provides a basis for a message passing mechanism in a junction tree which renders the tree from an inequilibrium state to an equilibrium state. The absorption operation through a link forms the kernel of a message passing mechanism for reaching an equilibrium state for the whole junction tree.
Message Passing: For two neighboring clusters A and B through a separator S , we say cluster A sends a message to cluster B when B absorbs from A. Message Passing Scheme: A cluster A can send exactly one message to a neighbor B and this message may only be sent when A has received a message from each of its other neighbors. It can be shown that when message passing has been done in both directions of each link of a junction tree JT , then JT is in equilibrium, i.e. becomes consistent.
5.5 Finding Belief On A Single Variable
For a single variable V of the original CPN, let C be a cluster containing V . After the complete evidence propagation through message passing, the joint probability of C with the evidence E can be read from C 's potential table. Then the probability of V can be calculated by marginalization
P (V; E) =
e (C ) = (C ) E(V )
(82)
where e(C ) denotes the new potential table of C after this evidentiation.
Figure 2: The Probanet system and its graph editor
C nV
P (C; E)
(83)
and the belief on V is determined by normalization
P (V jE) = P (V; E)
5.4 Evidence Propagation
In order to enter an evidence E(V ) on a variable V , we need to nd a cluster C which contains V , and multiply its table with E(V ),
X
(84)
6 Probanet - A Probabilistic Network Shell Probanet is a generic shell or say, a mother system for developing specialized probabilistic expert systems through constructing CPN's. It has been complete written in Java - a new object-oriented computer language. Hence, Probanet is most widely portable to almost any computer platform due to the platform-neutrality of Java.
6.1 Major Functional Components of Probanet
Probanet has been designed having the following components For a consistent junction tree JT , after an evidence EV is entered into a containing cluster C , the junctions associated with C become inconsistent, so the tree JT becomes inconsistent. Message passing is then required to maintain the equilibrium of the junction tree.
1. 2. 3. 4.
Construction Inference Training and Learning Query and Explanation
Fig. 2 shows the outlook of Probanet. The medical knowledge used as an example in the gure is referred to [Jacobson, 1994]. Construction: The rst step in developing a probabilistic expert system using CPN approach is to construct a CPN. This requires a complete toolkit. Probanet has got a graph editor for creating nodes and links and manipulating them, a node editor for de ning the frame of each node and all other attributes, and a conditional probability editor for inputing and editing the conditional probability table for each node given its parents. The graph editor itself is also used for displaying information relevant to inference, training, learning, query and explanation. Inference: The overall strategy for probabilistic inference in Probanet may be called adaptive inference algorithm. Apparently we found that each of the existing inference algorithms has its advantage and weekness. In particular, we see that the tree algorithms (causal tree and polytree) are the most ecient if the network happens to be a tree. However, they cannot be generalized to general graphs where undirected cycles exist. The junction tree algorithm is applicable for general graphs, but it creates extra structures for trees. On the other hand, a general network may be decomposed to a mixture of trees and graphs. Section 6.2 explicates more on Probanet's adaptive inference algorithm. Training and Learning: Training refers to estimating the conditional probability table for each node given its parents for a pre-structured network. Learning is more general, including learning the causal dependence structure of the problem domain, de ning the frame of each variable, and estimating the conditional probability table for each node given its parents. Probanet is still at its early stage of investigation and development on training and learning. Query and Explanation: In fact, the inference component has already targeted at the basic query on the belief - the posterior probability of each variable given the total evidence. The component on query aims at more compicated or sophisticated queries combining multiple variables. The component on explanation aims at translating the probabilistic inference to plain natural language script. In general, explanation should be speci cally designed for specialized systems. The next section describes the adaptive inference strategy of Probanet and shows examples of graph triangulation and junction tree formation by Probanet's algorithms.
6.2 Adaptive Inference Algorithms
The probabilistic inference strategy of Probanet is adaptive to the actual structure of the network, as de ned in the following algorithm. Probanet Adaptive Inference Algorithm: if the network is a causal tree, then apply the causal tree propagation algorithm as described in section 4; else if the network is a polytree, then apply the polytree propagation algorithm as described in [Kim and Pearl, 1983]; else if the network can be decomposed into subtrees and subgraphs, then apply the tree-graph adaptive al-
gorithm (to be explained later). else, then apply the junction tree propagation algorithm as described in section 5. The causal tree algorithm in Probanet is a non-recursive one, which trades space for speed. For the junction tree algorithm in Probanet, the transformation from a graph to a junction tree is much optimized with the following characteristics: 1. Unlike conventional node by node elimination for graph triangulation, we eliminate nodes group by group. For example, in Fig. 3 - 5, when node A is eliminated, nodes B and C are also eliminated together with node A. 2. After graph moralization, i.e. adding an extra link between each pair of the parents for every node, usually three steps are required before reaching a junction tree: (a) triangulation by node elimination, (b) forming the junction graph, and (c) deriving the junction tree from the junction graph. Probanet combines three steps in one pass through the moralized graph. The triangulation produces the junction tree directly. For node elimination ordering, Probanet uses the minimum weight criterion, yielding smallest clusters for the junction tree. Fig. 3 - 5 shows an example graph, its moralization and triangulation, and the corresponding junction tree. Fig. 6 8 shows the performance of Probanet on an example used in [Jensen, 1996, page 84, Figure 4.17]. A
C B
D
F
E
G
H
Figure 3: A directed acyclic graph.
Tree-Graph Adaptive Algorithm:
In case a given network can be decomposed into subtrees and subgraphs, it would be more ecient if we could just apply the tree algorithms on subtrees, and the junction tree algorithms on subgraphs, and then properly link the updated subtrees and subgraphs. This is a new inference algorithm currently under our investigation. Fig. 9 exempli es this idea. In this example, the original network is decomposed to two subtrees - a causal tree (part III) and a polytree (part II) - and a subgraph (part I).
Acknowledgement The rst author has enjoyed helpful discussions with Finn V. Jensen and Ue Kjaerul from Aalborg University, Denmark, and Bruce D'Ambrosio from Oregon State University,
A
A
C B
D
F
B
C
D
E
F
G
E
G
H
H
I
Figure 4: Moralization (dotted links) for the graph of Fig. 3. No more links are added for triangulation.
Figure 6: A directed acyclic graph from [Jensen, 1996, Figure 4.17]
ABCDE
A
DE
FDE
DE
GDE
B
C
D
E
F
G
DE
HDE
H
I
Figure 5: The junction tree for the graph of Fig. 3. Rectangles denote clusters, rounded boxes show separators.
Figure 7: Moralization (dotted links) and triangulation (dashed links) for the graph of Fig. 6.
USA, and thanks them also for providing many references and pointers.
Buchanan B.G. and Shortlie E.H. (1984): Rule-
REFERENCES Abbreviations: AI Arti cial Intelligence (Journal) UAI'xx Proceedings of the Conference on Uncertainty in Arti cial Intelligence, Morgan Kaufmann, San Francisco, CA, 19xx
Andreassen S., Woldbye M., Falck B., and Andersen
S.K. (1987): MUNIN: A causal probabilistic network for interpretation of electromyographic ndings. Proc. 10 International Joint Conference on AI (IJCAI-87), pp. 366-72, Milan. Barr A. and Feigbenbaum E.A. (1981): Handbook of Arti cial Intelligence, Lost Altos: Kaufmann. Berri C., Fagin R., Maier D., Yannakakis M. (1983): On the desirability of acyclic database schemes. Journal of the Association for Computing Machinery 30(3):479-513. Bertele U. and Brioschi F. (1972): Nonserial Dynamic Programming. New York: Academic Press. Blalock H.M. (1971): Causal Models in the Social Sciences. London:Macmillan.
based Expert Systems: the MYCIN Experiment of the Stanford Heuristic Programming Project. Reading: Addision-Wesley. Cannings C., Thompson E.A. and Skolnick M.H. (1976): Recursive derivation of likelihoods on pedigrees of arbitrary complexity. Advanced Applied Probability 8:622-625. Cannings C., Thompson E.A. and Skolnick (1978): Probability functions on complex pedigree. Advanced Applied Probability 10:26-61. Cooper G.F. (1984): NESTOR: A computer-based medical diagnostic aid that integrates causal and probabilistic knowledge. PhD dissertation, Department of Computer Science, Stanford University. Cooper G.F. (1987): Probabilistic inference using belief networks is NP-hard. Technical Report KSL-87-27, Medical Computer Science Group, Stanford University. D'Ambrosio B. (1989): Symbolic probabilistic inference. Technical Report, Oregon State University. D'Ambrosio B. (1991): Local expression languages for probabilistic dependence. UAI'91. Dagum P. and Luby M. (1993): Approximate probabilistic inference in Bayesian belief networks is NPhard. AI 60(1):141-153.
Fung R.M. and Chang K.C. (1989): Weighting and in-
AFG
AF EFA
AE BAE
FG FGI
AG
AF
DAG
CAF
EF
HEF
Figure 8: The junction tree for the graph of Fig. 6.
I
A
B
II
III
Figure 9: A network decomposed to two subtrees and one subgraph
Darroch J.N., Lauritzen S.L. and Speed T.P. (1980):
Markov elds and log-linear models for contingency tables. Annals of Statistics 8:522-539. Dawid A. (1992): Applications of a general propagation algorithm for probabilistic expert system. Statistics and Computing 2:25-36. Dechter R. (1996a): Bucket elimination: a unifying framework for probabilistic inference. UAI'96, pp. 211219. DeGroot M.H. (1970): Optimal Statistical Decisions. New York: McGraw-Hill. Duda R.O, Hart P.E. and Nilsson N.J. (1976): Subjective Bayesian methods for rule-based inference systems. Proc. Natl. Comput. Conf. (AFIPS) 45:1075-82. de Finetti (1974): Theory of Probability, Vol. 1 and 2, New York: Wiley. Fujisawa T. and Orino H. (1974): An ecient algorithm of nding a minimal triangulation of a graph. Proc. IEEE International Symposium on Circuits and Systems, San Francisco, pp. 172-175.
tegrating evidence for stochastic simulation in Bayesian networks. UAI'89. Fung R.M and Del Favero B. (1994): Backward simulation in Bayesian networks. UAI'94. Galton F. (1888): Co-relations and their measurement, chie y from anthropological data. Proc. Royal Society of London, 45:135-145. Gilks W., Thomas A. and Spiegelhalter D. (1994): A language and a program for complex Bayesian modeling. The Statistician 43:169-78. Golumbic M.C. (1980): Algorithmic graph theory and perfect graphs. London: Academic Press. Goodman L.A. (1970): The multivariate analysis of qualitative data: interaction among multiple classi cations. Journal of American Statistical Association 65:226-256. Henrion M. (1988): Propagating uncertainty in Bayesian networks by probabilistic logic sampling. In: Lemmer and Kanal (eds), Uncertainty in Arti cial Intelligence 2, Amsterdam: Elsevier Science. Howard R.A. and Matheson J.E. (1981): In uence diagrams. In Principles and applications of decision analysis, vol. 2 (1984), Menlo Park, California: Strategic Decisions Group. Jacobson J. (1994): A Bayesian Network-based Tutoring Shell for Diagnostic Procedure Selection. MSc thesis, Department of Computer Science, University of Wisconsin-Milwaukee. Jensen F.V. (1988): Junction trees and decomposable hypergraphs. Research report, Judex Datasystemer A/S, Aalborg, Denmark. Jensen F.V. (1996): An Introduction to Bayesian Networks. UCL Press. Jensen A.L. and Jensen F.V. (1996): MIDAS - an in uence diagram for management of mildew in winter wheat. UAI'96, pp. 349-356. Jensen C.S., Kong A. and Kjaerul U. (1995): Blocking Gibbs sampling in very large probabilistic expert systems. Journal of Human-Computer Studies, 42:647-666. Jensen F.V., Lauritzen S.L., and Olesen K.G. (1990): Bayesian updating in causal probabilistic networks by local computations. Computatinal Statistics Quarterly 4:269-292. Joreskog K.G. (1973): Analysis of covariance structures. In Krshnaiah (ed.), Proc. 3rd Symposium on Multivariate Analysis, pp. 263-283. New York: Academic Press. Kellerer H.G. (1964): Masstheoretische Marginalprobleme. Math. Ann. 153:168-198. Kelly C.W. and Barclay S. (1973): A general Bayesian model for hierarchical inference. Organizational Behavior and Human Performance 10:388-403. Kim J.H. (1983): CONVINCE: A conversation inference consolidation engine. PhD dissertation, Department of Computer Science, University of California, Los Angeles. Kim J.H. and Pearl J. (1983): A computational model for combined causal and diagnostic reasoning in inference systems. Proc. 8th Intl. Joint Conf. on AI (IJCAI-83), Karlsruhe.
Kjaerul U. (1993): Aspects of Eciency Improvement
in Bayesian Networks. PhD dissertation, Deparment of Mathematics and Computer Science, Aalborg University, Denmark. Kjaerul U. (1995): HUGS: Combining exact inference and Gibbs sampling in junction trees. UAI'95, pp. 368375. Lauritzen S.L., Speed T.P. and Vijayan K. (1984): Decomposable graphs and hypergraphs. Journal of the Australian Mathematical Society A, 36:12-29. Lauritzen S.L. and Spiegelhalter D.J. (1988): Local computations with probabilities on graphical structures and their applications to expert systems. Journal of Royal Statistical Society 50(2):157-224. Leimer H.-G. (1985): Strongly Decomposable Graphs and Hypergraphs. Thesis, Ber. Stochast. Verw. Geb. 85-1, University of Mainz. Li Z. and D'Ambrosio B. (1994): Ecient inference in Bayes nets as a combinatorial optimization problem. International Journal of Approximate Reasoning 11(1):55-81. Lindley D.V. (1982): Scoring rules and the inevitability of probability. International Statistical Review 50:1-26. Miller R.A., Poole H.E. and Myers J.P. (1982): INTERNIST-1: An experimental computer-based diagnostic consultant for general internal medicine. New England Journal of Medicine 307(8):468-470. Olmsted S. (1983): On Representing and Solving Decision Problems. PhD thesis, Department of Engineering-Economic Systems, Stanford University. Pan H.P. and McMichael D.W. (1997): Information fusion, causal probabilistic network and Probanet I: Information fusion infrastructure and probabilistic knowledge representation. Proc. 1st Intl. Workshop on Image Analysis and Information Fusion, Adelaide, Australia. Pearl J. (1982): Reverend Bayes on inference engines: A distributed hierarchical approach. Proc. National Conference on AI, Pittsburgh, pp. 133-36. Pearl J. (1988): Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, San Mateo, CA. Rose D.J., Tarjan R.E. and Lueker G.S. (1976): Algorithmic aspects of vertex elimination on graphs. SIAM Journal on Computing 5:266-283. Saul L.K., Jaakkola T. and Jordan M.I. (1996): Mean eld theory for sigmoid belief networks. Journal of Arti cial Intelligence Research 4:61-76. Savage (1972): The Foundations of Statistics, 2nd Edition, New York: Dover Publications. Shachter R.D. (1986): Evaluating in uence diagrams. Operations Research 33(6):871-882, 1986. Shachter R.D. (1990): Evidence absorption and propagation through evidence reversals. In M. Henrion et al, editors: Uncertainty in Arti cial Intelligence 5, pp. 173-190. Elsevier Science, Amsterdam. Shachter R.D., Andersen S.K. and Szolovits P. (1991): The equivalence of exact methods for probabilistic inference on belief networks. Technical report, Department of Engineering-Economic Systems, Stanford University. Shachter R.D., D'Ambrosio B. and del Favero B.A. (1990): Symbolic probabilistic inference in belief networks. Proc. 8th Nat. Conf. on AI, pp. 126-131. MIT Press.
Shachter R.D. and Peot M.A. (1989): Simulation ap
proaches to general probabilistic inference on belief networks. UAI'89. Shafer G. and Shenoy P. (1990): Probability propagation. Annals of Mathematics and Arti cial Intelligence 2:327-352. Smith J.Q. (1989): In uence diagrams for statistical modeling. The Annals of Statistics 17(2):654-672. Tarjan R.E. and Yannakakis M. (1984): Simple lineartime algorithms to test chordality of graphs, test acyclicity of hypergraphs, and selectively reduce acyclic hypergraphs. SIAM Journal of Computing 13:566-579. Vorob'ev N.N. (1963): Markov measures and Markov extensions. Theory of Probability and Applications 8:420-429. Wermuth N. and Lauritzen S.L. (1983): Graphical and recursive models for contingency tables. Biometrika 70:537-552. Wold H.D.A. (1954): Causality and econometrics. Econometrica 28:443-463. Wright S. (1921): Correlation and causation. Journal of Agricultural Research 20:557-585. Wright S. (1934): The method of path coecients. Annals of Math. Statistics, 5:161-215. Zhang N.L. and Poole D. (1994): A simple approach to Bayesian network computations. Proc. 10th Canadian Conf. on AI, pp. 171-178. Zhang N.L. and Poole D. (1996): Exploiting causal independence in Bayesian network inference. Journal of Arti cial Intelligence Research 5:301-328.