Using Mutual Information to determine Relevance in ... - CiteSeerX

0 downloads 0 Views 262KB Size Report
cept of mutual information (MI) between two related random variables ... of an arc linking multi-state nodes, which we called the arc weight, and showed.
Using Mutual Information to determine Relevance in Bayesian Networks A. E. Nicholson and N. Jitnah School of Computer Science and Software Engineering, Monash University, Clayton, VIC 3168, Australia fannn,[email protected]

Abstract. The control of Bayesian network (BN) evaluation is important in the development of real-time decision making systems. Techniques which focus attention by considering the relevance of variables in a BN allow more ecient use of computational resources. The statistical concept of mutual information (MI) between two related random variables can be used to measure relevance. We extend this idea to present a new measure of arc weights in a BN, and show how these can be combined to give a measure of the weight of a region of connected nodes. A heuristic path weight of a node or region relative to a speci c query is also given. We present results from experiments which show that the MI weights are better than another measure based on the Bhattacharyya distance.

1 Introduction Belief (or Bayesian) networks (BNs) [13] have become a popular representation for reasoning under uncertainty, as they integrate a graphical representation of causal relationships with a sound Bayesian foundation. Belief network evaluation involves the computation of posterior probability distributions, or beliefs, of query nodes, given evidence about other nodes. An introduction to BNs and an example are given in Sect. 2. The scheduling of BN evaluation is important in the development of real-time decision making systems. Techniques which focus attention by considering the relevance of variables in a BN allow more ecient use of computational resources [13, pp.321-323][14]. The concept of a connection strength between two adjacent nodes in a binary BN was introduced by Boerlage in [3]; this measure was proposed for use when graphically displaying BNs, with a thicker arc indicating higher connection strength. In [10] we proposed new measures for the strength of an arc linking multi-state nodes, which we called the arc weight, and showed how it can be used to design an anytime BN belief update algorithm. We used a measure, wBh , which makes use of the Bhattacharyya distance [2], a statistical measure of the di erence between two probability distributions. In this paper we develop a new measure for the arc weight based on the well-known Mutual Information between two related random variables (Sect. 3), extending Pearl's [13, pp.321-323] outline of the possible use of mutual information for \relevance-based control". In Sect. 4 we derive expressions for combining

weights over a region consisting of connected nodes. The concept of the path weight of a node or of a region relative to a speci c query is also described, with appropriate formulas to calculate a heuristic estimate (Sect. 5). An experiment was conducted to compare the accuracy of the arc weight measure based on MI with the Bhattacharyya measure (Sect. 6); this experiment concluded that MI weights are more accurate. In Sect. 7 we discuss how the measure may be used to eciently control BN evaluation, and place it in the context of related research. We conclude with directions for future work in Sect. 8.

2 Belief Network Example Belief networks (BNs) [13] are directed acyclic graphs where nodes correspond to random variables. The absence of an arc between two nodes indicates an independence assumption. This framework allows a compact representation of the joint probability distribution over all the variables. Each node has a conditional probability distribution (CPD) which gives the probability of each state of the node, for all state combinations of its parents. If a node has no parents its prior distribution is stored. Evidence is entered by assigning xed values to nodes. We evaluate a BN by computing posterior probability distributions for some query nodes. These posteriors represent beliefs about states of query nodes. Evaluation can be done with or without evidence in the network. Fig. 1 shows an example network taken from standard BN literature [11]. The structure of the network and the CPDs for each node are shown. We use this BN for numerical examples of weight calculations in Sect. 3, 4, and 5. This network is a simple medical diagnosis model. The nodes have the following meanings. A: patient visited Asia recently, T: patient has tuberculosis, S: patient is a smoker, C: patient su ers from lung cancer, O: tuberculosis or lung cancer are detected, B: patient su ers from bronchitis, D: dyspnea is present, X: x-ray results are normal. Each node can take two values, true or false. For polytrees, BN evaluation is computationally straightforward, using a message-passing algorithm [13]. Beliefs can be updated in time linear in the number of nodes by this method, however most real-world BN models have undirected loops, in which case message-passing is not appropriate. The jointree algorithm [11] is the fastest existing method for exactly evaluating a BN containing undirected loops. This algorithm uses clustering to transform the BN into a tree structure i.e. with no loops. Unfortunately this algorithm has expensive memory requirements and is impractical on large networks. In general, both exact and approximate BN evaluation are NP-hard problems [6, 7]. One approach to reducing the complexity of evaluation is to focus on the most relevant part of the network, rather than using the complete model. Arc weights, as described in this paper, allow us to identify the most relevant nodes, given a speci c query.

3 Arc Weight Based on MI

Mutual Information (MI) [15, 13] is a measure of the dependence between two random variables. It is the reduction in uncertainty of X due to knowing Y , and

A

S

T

p(A) = (:01; :99)

C

B O

X

 :05; :95 



:1; :9 p(T A) = :01; :99 p(C S ) = :01 ; :99 j

 :6; :4 

p(B S ) = :3; :7

D

p(S ) = (:5; :5)

j

 :98; :02 

j

p(O

p(X O) = :05; :95 p(D j



0 :99; :01 1 99; :01 C T; C ) = B @ ::99 ; :01 A

j

:01; :99

0 :9; :1 1 O; B ) = B @ ::78;; ::32 CA

j

:1; :9

Fig. 1. Example BN and CPDs for medical diagnosis. vice-versa. The MI between two variables X and Y is given by: X I (X; Y ) = p(X; Y ) log pp(X(X;)pY(Y) ) x;y

(1)

Since p(X; Y ) = p(X )p(Y X ), this can be written as: X X I (X; Y ) = p(X ) p(Y X ) log p(pY(Y X) ) x y j

(2)

j

j

MI is symmetric, i.e. I (X; Y ) = I (Y; X ). It is a non-negative quantity and is zero if and only if X and Y are mutually independent.

3.1 Single Parent

Given a node Y with single parent X in a BN, the mutual information between X and Y describes the in uence of X on Y and vice-versa. The arc weight of X Y is computed as the mutual information between X and Y : X X X = i) (3) p(Y = j X = i) log p(Yp =(Yj = ppr (X = i) w(X; Y ) = j) pr j 2 (Y ) i2 (X ) !

j

j

where (X ) denotes the state space of X and ppr (X = i) denotes the prior probability of X being in state i. If X is a root node, its priors are stored in the BN. Otherwise ppr (X = i) is approximated by averaging the conditional probabilities of X , over all parent state combinations. Similarly, because ppr (Y = j ) is not directly available in the BN, it is approximated by averaging the conditional probabilities of Y over all states of X .?In the network  of Fig. 1, the averaged probability distribution of T is ppr (T ) = :03; :97 . Using Eq. 3, w(A; T ) = w(T; A) = 0.009.

3.2 Multiple Parents

Given a node Y with parent X and a set of other parents Z = Z0 ; : : : ; Zn , we de ne the weight of arc X Y as: X X ppr (X = i) ppr (Z = k) w(X; Y ) = f

g

!

k2 (Z)

X

j 2 (Y )

i2 (X )

p(Y = j X = i; Z = k) log p(Yp =(Yj =X j=Zi; =Z k=) k) pr j

j

j

(4)

ppr (X = i) is obtained as for Eq 3. For each state of Y , ppr (Y = j Z = k) is approximated by averaging the conditional probabilities of Y over all state combinations of parents in Z. To estimate ppr (Z = k), we rst calculate ppr (Z ) for every Z Z, then multiply out the joint Q probabilities for each combination of states, i.e. ppr (Z0 = l0 ; : : : ; Zn = ln ) = m=0:::n ppr (Zm = lm ). Using Eq. 4, w(T; O) = 0.602 in the network of Fig. 1. j

2

4 Weight of a Connected Region The arc weight measure based on MI described in Sect. 3, like the Bhattacharyya arc weight, WBh , is local; that is, they de ne the mutual in uence between neighbours. Sometimes we want to know the informational content of a set of nodes, forming a connected region. This section deals with how arc weights are combined to give a total weight to a BN region. We consider regions consisting of converging arcs, chains, subtrees and undirected loops. The method for combining arc weights is based on the additive property of MI. Given the node con guration X Y Z , the joint probability distribution of p(X; Y; Z ) is obtained by applying the Chain Rule: p(X; Y; Z ) = p(Z Y )p(Y X )p(X ) (5) If the arcs are ignored, then the three nodes would be considered independent and the joint distribution is: p0 (X; Y; Z ) = p(X )p(Y )p(Z ) (6) The informational content of the three variables is the cross-entropy between Eq. 5 and Eq. 6. It is straightforward to show that X Y; Z ) = I (X; Y ) + I (Y; Z ) (7) I (X; Y; Z ) = p(X; Y; Z ) log pp0((X; X; Y; Z ) x;y;z !

!

j

j

This means that the informational content of this group of nodes is simply the sum of the MI between the pairs of neighbours. Because MI is symmetric, the same result holds if the chain arcs were reversed, i.e. X Y Z . Notation: We use W () to denote the combined weight of all arcs in a connected region. For example, if the region R consists of the set of nodes fX; Y; Z g, then W (R) = W (XY Z ).

4.1 Weight of Converging Arcs This section addresses the problem of combining arc weights for a node with multiple converging arcs, i.e. one with many parents. For such a node, we want to estimate the combined e ect of all parents. Consider a node Y with parents X and Z : X Y Z . In this con guration, the weight of the 3-node region is the combined weight of the converging arcs to child node Y from its parents X and Z : !

W (XY Z ) =

X

i2 (X )

ppr (X = i)

X

X

k2 (Z )

ppr (Z = k)

= k) p(Y = j X = i;Z = k) log p(Y =pj (XY ==i;Z j ) pr j 2 (Y ) j

j

(8)

ppr (X ), ppr (Z ) and ppr (Y ) are obtained using the averaging procedure explained in Sect. 3.1. In general, given node Y with a set of parents Z, the formula is: W (ZY ) =

X

k2 (Z)

ppr (Z = k)

X

p(Y = j Z = k) log p(Yp =(jY Z==j )k) pr j 2 (Y ) j

j

(9)

Note that the combined weight of the converging arcs is a bound on the sum of the individual arc weights as calculated previously by Eq. 4:

W (Z0 : : : Zn Y ) w(Z0 ; Y ) + : : : + w(Zn ; Y )

(10)



In Fig. 1, the weight of the region consisting of nodes T, = 1.212.

O

and C is W (TOC )

4.2 Weight of a Chain Since mutual information can be added over a chain of nodes, we can add arc weights over chains. This section presents formulas for the combined weight of arcs along chains of nodes. Simple Directed Chain and Chain with a Diverging Node. For the chains of length n, Z0 Zn and Z0 Zm Zn , the weight of the chain from Z0 to Zn is: !  !



W (Z0 : : : Zn ) = w(Z0 ; Z1 ) +

!  !



+ w(Zn?1 ; Zn )

(11)

Summing over all arc weights on the chain between Z0 and Zn gives a measure of the combined e ect of the intervening nodes. In Fig. 1, the weight of the simple directed chain S C O X is W (SCOX ) = .022 + .618 + .555 = 1.195. The weight of the chain with a diverging node, O C S B is W (OCSB ) = .618 + .022 + .046 = .686. !

!

!

!

Head-to-head Node on a Chain. Consider the chain containing a head-to-head node: Z0 !    ! Zm?1 ! Zm Zm+1    Zn. If all parents of the head-to-head node Zm are on the chain, then Eq. 8 for converging arcs is used to calculate W (Zm?1 Zm Zm+1 ). The combined weight of the chain is then: W (Z0 : : : Zn ) = W (Z0 : : :Zm?1 )+ W (Zm?1ZmZm+1 )+ W (Zm+1 : : :Zn ) (12) For example, consider in Fig. 1 the chain A ! T ! O C S. O is a head-tohead node and all its parents are on the chain. The chain weight is W (ATOCS ) = .009 + 1.212 + .022 = 1.243. However, if Zm has parents which are not on the chain, then Eq. 4 should be used to calculate separately the weights of arcs into Zm . Therefore, W (Z0 : : :Zn )= W (Z0: : :Zm?1 )+ w(Zm?1 ; Zm )+ w(Zm ; Zm+1 )+ W (Zm+1 : : :Zn ) (13)

4.3 Weight of a Subtree AB

A

B

C

D

w(AB,CD) CD w(CD,E)

E

E w(E,F) F

F

Fig. 2. (a) Subtree S in a BN. (b) Clusters in jointree of S and associated arc weights. Consider a subtree S of a BN as shown in Fig. 2(a). To calculate the combined weight of the subtree, we transform the subtree into a jointree of clusters as shown in Fig. 2(b); this jointree has a chain structure. Applying Eq. 11 to the jointree, the subtree weight is: W (S) = w(AB; CD) + w(CD; E ) + w(E; F ) (14) where w(CD; E ) is calculated as the combined weight W (CDE ) of the converging arcs into E , as shown in Eq. 8, and w(E; F ) is computed as in Eq. 3. The weight of arc AB CD can be regarded as the cross-entropy between the joint distributions p(A; B; C; D) = p(A)p(C A)p(B )p(D B ) and p0 (A; B; C; D) = p(A)p(C )p(B )p(D). Therefore, X B) p(A)p(C A)p(B )p(D B ) log p(Cp(AC))pp((D w(AB; CD) = (15) D ) a;b;c;d !

j

j

j

j

j

j

which simpli es to: w(AB; CD) = w(A; C ) + w(B; D). The method used to calculate the subtree weight of this simple example can be generalised in a straightforward manner to any subtree of a BN. In Fig. 1, the weight of the subtree given by the nodes A, T, O, C, S and X is W (ATOCSX ) = w(A; T ) + w(S; C ) + W (TOC ) + w(O; X ) = :009 + :022 + 1:212 + :555 = 1:798.

4.4 Weight of a Loop In this section we derive a formula for the combined weight of arcs in a BN region consisting of an undirected loop. A simple example of such a loop is presented, then we propose a general formula for arbitrary loops. Fig. 3(a) shows a simple example of a group of nodes that form an undirected loop L1 in a BN. V

B

Y0

X

X

111 00 00 11 00 11 00 11

11n 00 00 11 00 11 00 11

00 11 00 11 1010 0110

C

110 00 00 11 00 11 00 11

00 11 00 11 1010 0110

X

A

Y1

D

Yn

Z

Fig. 3. (a) Loop L1 in a BN. (b) General Loop L in a BN, where Xi : : : Yi is a chain. The weight of loop L1 is the cross-entropy between the joint distributions

p(A; B; C; D) and p0 (A; B; C; D): W (L1) = =

X

B; C; D) p(A; B; C; D) log pp0((A; A; B; C; D) a;b;c;d

X

a;b;c;d

B; C ) p(A)p(B A)p(C A)p(D B; C ) log p(A)p(pB(AA)p)(pB(C)p(CA))pp((D D) j

j

j

j

j

j

X p(A)p(B A) log p(pB(BA) ) + p(A)p(C A) log p(pC(CA) ) + a;c a;b X p ( D B; C ) p(D B; C ) log p(D) p(B )p(C ) b;c;d = w(A; B ) + w(B; C ) + W (BC; D) (16) In the simplest general case of a loop L as shown in Fig. 3(b), where Xi : : : Yi is a chain, the weight can be calculated as follows. The combined weight of the =

X

j

j

j

j

j

j

diverging nodes from V is:

W (V; X0 : : : Xn ) =

X

v;x0 :::xn

n V) p(V )p(X0 V ) : : : p(Xn V ) log p(Xp0(XV )) :: :: :: pp((X X)

= w(V; X0 ) +

j



j

j

j

0

+ w(V; Xn )

n

(17)

Each chain weight W (Xi ; Yi ) can be calculated as shown in Sect. 4.2. Finally the combined weight W (Y0 Yn ; Z ) of the converging arcs into Z is found using Eq. 8. Therefore, the loop weight is: W (L) = w(V; X0 ) + + w(V; Xn ) + W (X0 ; Y0 ) + + W (Xn ; Yn ) + W (Y0 Yn ; Z ) (18) In Fig. 1, the weight of the loop given by the nodes S, C, O, D and B is W (SCODB ) = w(S; C ) + w(S; B ) + w(C; O) + W (OBD) = :022 + :046 + :618 + :164 = :85. 







5 Path Weight Relative to a Query Node In BN evaluation, it is sometimes desirable to assess the relative impact of selected nodes on the posterior belief of a query node Q. When performing approximate or anytime evaluation, we should select the most informative nodes to include in the computation. This section introduces the concept of path weight of a node X , relative to a query node Q, written WQ (X ). It is an estimate of the impact of X on Q, where the two nodes are not necessarily neighbours. Exact computation of path weight is an NP-hard problem. When X and Q are widely spaced and possibly connected via multiple paths, assessing the exactly e ect on X on Q is equivalent to evaluating the BN, with all possible values of X entered as evidence. Such an exhaustive procedure is impossible in practice. It is preferable to use a heuristic measure that can be evaluated quickly. Path Weight of a Node relative to Q. Consider the node con guration X ! Y ! Q. Boerlage's [3] bound on the connection strength (CS) of such a Markov Chain is given by the inequality: tanh( 41 CS (X; Q))