Mini-buckets: a general scheme for bounded inference Rina Dechter Department of Information and Computer Science University of California, Irvine
[email protected] and Irina Rish IBM T.J. Watson Research Center
[email protected]
!" #!! $ %&' (" ( ) !*( (+, !$ & , !. (/)-* 0 * !01 2! # !3 #!! $.4!) ) 0 5 6 ( !7) ! ! $ 8 !9" (4 $) !*(:;( $ ! ,-/ ( ( 6 (3) : #!! $9 (=$ +,)*1?!&. @A (+ B *01 ) C1 (&+, @D) &,E D *1 &-F (G&H$ &-9D:$ +,)*1?!&$ 1GI! $ 8 ! !) $.*1: J( #76$!0!) ) 6& ! is added by the moral graph. (b) The induced =? @A > , and (c) the induced graph along =? > A @ .
Fig. 4. (a) A belief network. The dotted arc
graph along
corresponding to the highest-index variable in the function’s scope (clearly, this is one! of ! the ”lower” buckets). If is observed (e.g., ), then is assigned the value in each of the bucket’s functions, and each resulting function is placed in its highest-variable bucket. This simplifies computation, and graphically corresponds to removing the evidence node from the moral graph. Note that constants (the results of eliminating a variable that is the only argument of a function) are placed into the first (lowest) bucket. Finally, the algorithm processes the lowest bucket, & . The algorithm returns the normalized 4 product of functions in & which yields the updated belief . ) . Note that only multiplication (no summation over ) ) need to be performed in this bucket. The following example illustrates elim-bel on the network in Figure 4a.
E XAMPLE 2.: Given the belief network in Figure 4a, the ordering
=
*
$ # ,
ACM Journal Name, Vol. V, No. N, Month 20YY.
8
B
P(e|b,c) P(d|a,b) P(b|a)
bucket B
P(c|a)
bucket C
hB (a,d,c,e) hC(a,d,e)
bucket D bucket E
E=0
hD (a,e)
bucket A
P(a)
h
E
(a)
P(A|E=0) Fig. 5. An example of elim-bel’s execution.
!
!
and evidence ; , # . 9. is computed by bucket elimination as follows. First, all the CPT’s are partitioned into the ordered buckets as shown in Figure 5. At first only the CPT functions (shown in non-bold style in Figure 4a) are included. (Note that upper-case letters denote nodes, and lower-case letters denote their values). Carrying out the computation from right to left using the bucket data-structure, and " denoting by ' the function computed in bucket , we get new functions as demonstrated in Figure 5. The computation performed in each bucket is given by: !( ! . ! ! / 1. bucket B: ' & . & . !( ! !( 2. bucket C: ' . & ' & ! !( 3. bucket D: ' ' ! ! 4. bucket E: ' 6' ! ! ! ! 5. bucket A: # 6. 9. ' where is a normalizing constant.
!
Similar bucket elimination algorithms were derived for the tasks of finding MPE, MAP, and for finding the maximum expected utility (MEU) [Dechter 1996; 1999]. Given a belief
4 network . , a variable ordering ) , and an evidence , the MPE task is to find the most-probable assignment to the variables, namely, to find
4 A
12 03254 . 4 (5) 9 >( , the product function 7 '# and # ' are defined over 4 54 54 ? 54 ? 7 /'# 1 , and '# 54 1 . ! )( >( . For every *! , 7 '# '#
An important property of bucket elimination algorithms is that their complexity can be predicted using a graph parameter called induced width [Dechter and Pearl 1987] (also known as tree-width [Arnborg 1985]), which describes the largest clique created in the graph by bucket elimination, and which corresponds to the largest scope of function recorded by the algorithm. D EFINITION 4.: [induced width] Given an undirected graph , the width of ) along ordering is the number of ’s neighbors preceding in . The width of the graph ACM Journal Name, Vol. V, No. N, Month 20YY.
10
bucket (X) = { h 1 , ... , h r , h r+1 , ..., h n } n
h =max X
i=1
hi
{ h r+1 , ..., h n }
{ h 1 , ... , h r } r
g = ( max X
i=1
h i ) .( max
g
X
n
hi )
i=r+1
>h
-
Fig. 7. The idea of mini-bucket approximation.
along , denoted , is the maximum width over all variables along . The induced graph of along is obtained by recursively connecting the preceding neighbors of each ) , 2 to 0 . The induced width along , denoted going from , is the width of the induced graph along , while the induced width is the minimum induced width along any ordering.
E XAMPLE 3.: Figures 4b and 4c depict the induced graphs (induced edges are shown $ as dashed lines) of the moral graph in Figure 4a along the orderings = * # $ and = # * , respectively. Clearly, and .
It can be shown that
T HEOREM 1.: [Dechter 1999] The time and space complexity of bucket elimination 2 , where 2 is the number of variables, bounds the variables’ algorithms is domain size and is the induced width of the moral graph along ordering , after all evidence nodes and their adjacent edges are removed.
The induced width will vary depending on the variable ordering. Although finding a ordering is NP-hard [Arnborg 1985], heuristic algorithms are investigated minimum[Bertele and Brioschi 1972; Dechter 1992; Kjæaerulff 1990; Kjaerulff 1992; Robertson and Seymour 1995; Bodlaender 1997; Bodlaender et al. 2001]. For more details on bucket elimination and induced width see [Dechter 1999]. 3. MINI-BUCKET APPROXIMATION FOR MPE We will introduce the idea of mini-bucket approximation using the combinatorial optimization task of finding the most probable explanation, MPE. Since the MPE task is NP-hard and since complete algorithms (such as the cycle cutset technique, join-tree-clustering [Pearl 1988] and bucket elimination [Dechter 1996]) work well only on relatively sparse networks, approximation methods are necessary. Researchers investigated several approaches for finding MPE. The suitability of Stochastic Local Search (SLS) algorithms for MPE was studied in the context of medical diagnosis applications [Peng and Reggia 1989] and, more recently, in [Kask and Dechter 1999b]. Best-First ACM Journal Name, Vol. V, No. N, Month 20YY.
11
search algorithms were proposed [Shimony and Charniack 1991] as well as algorithms based on linear programming [Santos 1991]. In this paper, we propose approximation algorithms based on bucket elimination. Consider the bucket-elimination algorithm elim-mpe. Since the complexity of processing a bucket depends on the number of arguments (arity) of the functions being recorded, we propose to approximate these functions by a: collection of smaller-arity functions. Let
: , and let > ' ' be the functions in the bucket of > be their scopes. When : ! " , 7 89 '( is computed. A simple : elim-mpe processes bucket( ), the function ' approximation idea is to compute an upper bound on ' by “migrating” the maximization inside the multiplication. in general, for any two non-negative functions and ' Since, ' , 03 : : 254 03254 03254 ' , this approximation will compute an " , '( , : upper bound on ' . For example, we can compute a new function 7 89 03254 that is an upper bound on ' . Procedurally it means that maximization is applied separately to each function, requiring less computation. 2 The idea is demonstrated in Figure 7, where the 2 bucket of variable having func2 tions is split into two mini-buckets of size and * , , and it can be generalized to any partitioning of a set of functions ' ' into subsets called mini-buckets. Let / be a partitioning into mini-buckets of the functions ' ' in : : contains the functions ' < ’s bucket, where the mini-bucket ' . The com " , : ' 3 0 5 2 4 7 89 '( , which can be rewritten as plete algorithm elim-mpe computes " : 254 , 7 89 7 $ ' " $ . By$ migrating ' 103 maximization into" each $ mini-bucket we can com $ $ pute: -7 89 03254 , 7 ' . The new functions 03254 , 7 ' are placed separately into the bucket of the highest-variable in their scope and the algorithm proceeds with the next variable. Functions without arguments (i.e., constants) are placed in the lowest bucket. The maximized product generated in the first bucket is an upper bound on the MPE probability. A lower bound can also be computed as the probability of a (suboptimal) assignment found in the forward step of the algorithm. Clearly, as the mini-buckets get smaller, both complexity and accuracy decrease.
D EFINITION 5.: Given two partitionings and if and only if for every set = is a refinement of that = # .
over the same set of elements, there exists a set # such
: : . : discussion it is easy to see that for any partitioning P ROOF. Based on the: above (be . By definition, given a refinement it or ) we have ' of a partitioning - , each mini-bucket 0 of belongs to is further some mini-bucket 0 of . In other words, each mini-bucket of . Therefore, partitioned into the corresponding mini-buckets of , < : 03" 254 7 $ ' - 03" 254 7 $ ' - 03" 254 7 ' : 1 , , , $ 89 89 1 89
P ROPOSITION 1.: If
is a refinement of
in
&
:
:
, then '
The mini-bucket elimination (mbe) algorithm for finding MPE, mbe-mpe(i,m), is described in Figure 8. It has two input parameters that control the mini-bucket partitioning. ACM Journal Name, Vol. V, No. N, Month 20YY.
12
& 0@ 7
$ "! # ( ' 7 3 < +, - ' 1 % 1 1 ' % 1 $ !
< /1$24357689 =< ' 0@ 7
Algorithm mbe-mpe(i,m) Input: A belief network , an ordering , evidence . Output: An upper bound and a lower bound on the , and a suboptimal solution that provides . 1. Initialize: Partition into buckets , , , where contains all CPTs whose highest-index variable is . 2. Backward: for to do If is observed ( ), assign in each and put the result in its highest-variable bucket (put constants in ). Else for in do Generate an -mini-bucket-partitioning,
. for each containing , do compute and place it in the bucket of the highest-index variable in , where is the scope of (put constants in ). , do 3. Forward: for to , given assign a value to that maximizes the product of all functions in . 4. Return the assignment , a lower bound , and an upper bound on the .
Algorithm mbe-mpe(i,m).
Fig. 8.
D EFINITION 6.: [(i,m)-partitioning] Let be a collection of functions ' ' de fined on scopes >9 > , respectively. We say that a function @ is subsumed by a function ' if any argument of @ is also an argument of ' . A partitioning of ' ' is canonical of one of those if any function @ subsumed by another function is placed into the bucket subsuming functions. A partitioning into mini-buckets is an -partitioning if and only if (1) it is canonical, (2) at most non-subsumed functions are included in each mini-bucket, (3) the total number of variables in a mini-bucket does not exceed , and (4) the partitioning is refinement-maximal, namely, there is no other -partitioning that it refines.
The parameters (number of variables) and (number of functions allowed per mini bucket) are not independent, and some combinations of and do not allow an (i,m)partitioning. However,
P ROPOSITION 2.: If the bound on the number of variables in a mini-bucket is not smaller than the maximum family size, then, for any value of , there exists an partitioning of each bucket.
P ROOF. For 0 , each mini-bucket contains one family. The arity of the recorded functions will only decrease and thus in each bucket an 0 -partitioning always ex ists. Any -partitioning that satisfies conditions 1-3 (but not necessarily condition 4), always includes all 0 -partitionings satisfying conditions 1-3. Therefore, the set of -partitionings satisfying conditions 1-3 is never empty, and there exists an ( partitioning satisfying conditions 1-4.
Although the two parameters and are not independent they do allow a flexible control of the mini-bucket scheme. The properties of the mini-bucket algorithms are summarized ACM Journal Name, Vol. V, No. N, Month 20YY.
13
Mini-buckets maxB
maxB
B
P(e|b,c) P(d|a,b) P(b|a)
P(e|b,c)
C
P(c|a)
P(c|a) hB (e,c)
hB (a,d,c,e)
2
hC(e,a)
E=0
E
h (a)
P(a)
A
3 hB (d,a)
hD (a,e)
E=0
E
3
P(d|a,b) P(b|a)
hC (a,d,e)
D
Max variables in a mini-bucket
hE(a)
P(a)
2 hD (a)
1
Complexity:
MPE Complexity: O(exp(5))
U = Upper bound ( MPE ) O ( exp(3) )
(a) A trace of elim-mpe
(b) A trace of mbe-mpe(3,2).
Fig. 9. Comparison between (a) elim-mpe and (b) mbe-mpe(3,2).
in the following theorem.
T HEOREM 2.: Algorithm mbe-mpe( 2 time and space complexity is
) computes an upper bound on the MPE. its 2 where .
We will prove the theorem later (section 7) in a more general setting, common to all minibucket elimination algorithms. In general, as and increase, we get more accurate approximations. Note, however, a monotonic increase in accuracy as a function of can be guaranteed only for refinements of a given partitioning.
E XAMPLE 4.: Figure 9 compares algorithms elim-mpe and mbe-mpe(i,m) where $ and over the network in Figure 4a along the ordering = * # . The exact algorithm elim-mpe sequentially records the new functions (shown in boldface) !( !( ! ! ' & , ' , ' , and ' . Then, in the bucket of = , it computes < ! ! ! $ , 03 254 . ' . Subsequently, an MPE assignment = # & , * , where is the evidence, is computed along by selecting a value that maximizes the product of functions in the corresponding buckets conditioned ! 2 03254 < . ! ' ! , , on the previously assigned values. Namely, ! ( 2 03254 ' , and so on. On the other hand, since bucket( # ) includes five variables, mbe-mpe(3,2) splits it into ! . ! ! , each containing no more than 3 / two mini-buckets /. & and /. variables, as shown in Figure 9b (the (3,2)-partitioning is selected arbitrarily). The new (! are generated in different mini-buckets and are placed infunctions ' & and ' dependently in lower buckets. In each of the remaining lower buckets that still need to be processed, the number of variables is not larger than 3 and therefore no further partitioning
ACM Journal Name, Vol. V, No. N, Month 20YY.
14
occurs. An upper bound on the MPE value is computed by maximizing over A the prod ! < ! ! ! uct of functions in = ’s bucket: ! . ' ' . Once all the buckets are processed, a suboptimal MPE tuple is computed by assigning a value to each variable that maximizes the product of functions in the corresponding bucket. By design, mbe-mpe(3,2) does not produce functions on more than 2 variables, while the exact algorithm elim-mpe records a function on 4 variables.
In summary, algorithm mbe-mpe(i,m) computes an interval ! containing the MPE value where ! is the upper bound computed by the backward phase and is the probability of the returned assignment. , 03254 6 . 594 /4 , rather than Remember that5mbe-mpe computes the bounds on . 4 4 4 , , on 103254 6 . . . . Thus
,
.
4 ,
! .
4
4
Clearly the bounds ! and for . are very close to zero when the evidence is unlikely, however the ratio between the upper and the lower bound is not dependent on 4 . . As we will see next, approximating conditional probabilities using bounds on joint probabilities is more problematic for belief updating.
4. MINI-BUCKET APPROXIMATION FOR BELIEF UPDATING As shown in Section 2, the bucket elimination algorithm elim-bel for belief assessment is similar to elim-mpe except that maximization is replaced by summation and no value /4 4 assignment is generated. Algorithm elim-bel finds . and then computes . /4 9. where is the normalization constant (see Figure 3). The mini-bucket idea used for approximating MPE can be applied to belief updating in a similar way. Let / be a partitioning of the functions ' ' : : scopes (defined:over >9 > ,: respectively) in ’s bucket. Algorithm elim-bel com , where ' : : " putes ' ! (% > * . Note that , : : 7 89 '( , and ! $ $ " " ' , 7 89 '( , can be rewritten as ' , 7 89 7 ' . If we follow the MPE : approximation precisely and migrate$ the$ summation operator into each mini-bucket, we " will compute @ : 7 89 large up, 7 $ ' $ . This, however, is an$ unnecessarily $ " per bound of ' in which each 7 ' is bounded by 7 ' . Instead, we rewrite , : " ' , 7 $ ' $ 7 8 7 $ ' $ . Subsequently, instead of bounding a function of : : over , we can bound 0 , by its maximum over , which yields : by its sum " , 7 $ ' $ 7 8 03254 " : , 7 $ ' $ . In summary, an upper bound of ' can be obtained by processing one of ’s mini-buckets by summation and the rest by maximization. Clearly,
P ROPOSITION 3.: For: every: partitioning ,' :
. partitioning of , then '
:
:
@
:
. Also, if
is a refinement
A lower bound on the belief, or its mean ! value, can be obtained in a similar way. Algorithm mbe-bel-max(i,m) that uses the elimination operator is described in Figure 10. Algorithms mbe-bel-min and mbe-bel-mean can be obtained by replacing the operator ! 2 /! 2 by and by , respectively.
ACM Journal Name, Vol. V, No. N, Month 20YY.
15
& 7
8 " ! # 8 8 (' 8 $ 7 3 < : *)"+ - ' 1 + - ' 1 ' % 1 7
Algorithm mbe-bel-max(i,m) Input: A belief network , an ordering , and evidence . Output: an upper bound on . 1. Initialize: Partition into buckets , , , where contains all CPTs whose highest-index variable is 2. Backward: for to do If is observed ( ), assign in each and put the result in the highest-variable bucket of its scope (put constants in ). Else for in do Generate an -mini-bucket-partitioning,
. , do For each , containing If compute Else compute Add to the bucket of the highest-index variable in (put constant functions in ). 3. Return the product of functions in the bucket of , which is an upper bound on (denoted ).
8.
$ 8( ,
Fig. 10. Algorithm mbe-bel-max(i,m).
4.1 Normalization
/ 4
4
Note that aprox-bel-max computes an upper bound on . but 4 not on . . If 4 an exact value of . is not available, deriving a bound on . from a bound #