Bayesian knowledge base tuning - Distributed Information and

0 downloads 0 Views 2MB Size Report
Apr 16, 2013 - Our approach is based on Bayesian Knowledge Bases for modeling uncertainties. ... In essence, this provides an explanation of the answer which can help in ... In other words, not every parameter requires the same level of accuracy to ...... “Acid Level”, “Over Crowd”, “pH Value” and “Reproduce Ability”.
International Journal of Approximate Reasoning 54 (2013) 1000–1012

Contents lists available at SciVerse ScienceDirect

International Journal of Approximate Reasoning journal homepage: www.elsevier.com/locate/ijar

Bayesian knowledge base tuning Eugene Santos Jr. a , Qi Gu a,∗ , Eunice E. Santos b a b

Thayer School of Engineering, Dartmouth College, Hanover, NH 03755, USA Institute of Defense and Security, University of Texas at El Paso, El Paso, TX 79968, USA

ARTICLE INFO

ABSTRACT

Article history: Available online 16 April 2013

For a knowledge-based system that fails to provide the correct answer, it is important to be able to tune the system while minimizing overall change in the knowledge-base. There are a variety of reasons why the answer is incorrect ranging from incorrect knowledge to information vagueness to incompleteness. Still, in all these situations, it is typically the case that most of the knowledge in the system is likely to be correct as specified by the expert(s) and/or knowledge engineer(s). In this paper, we propose a method to identify the possible changes by understanding the contribution of parameters on the outputs of concern. Our approach is based on Bayesian Knowledge Bases for modeling uncertainties. We start with single parameter changes and then extend to multiple parameters. In order to identify the optimal solution that can minimize the change to the model as specified by the domain experts, we define and evaluate the sensitivity values of the results with respect to the parameters. We discuss the computational complexities of determining the solution and show that the problem of multiple parameters changes can be transformed into Linear Programming problems, and thus, efficiently solvable. Our work can also be applied towards validating the knowledge base such that the updated model can satisfy all test-cases collected from the domain experts. © 2013 Elsevier Inc. All rights reserved.

Keywords: Bayesian Knowledge-base Probability Tuning

1. Introduction How do you tune your probabilistic knowledge base? Let us say that your favorite probabilistic diagnostic system is not generating the answer that you or the expert expected. First, we would need to ask “why” are we getting that result (i.e., “why is this the most probable answer?”). The “why” could come in many forms ranging from determining which events have contributed most to the probabilistic mass to determining which rules or probabilistic relationships (e.g., direct conditional dependencies) were used and/or how frequently. In essence, this provides an explanation of the answer which can help in determining the cause of the discrepancy. Now secondly, what would we need to change in the probabilistic knowledge-base in order to correct the problem? If we believe that most of our knowledge is correct, should we modify as “little” as possible? Of course, what is “little”? Depending on your probabilistic representation, even a small change can have semantic rippling effects throughout the knowledge-base. Related to this is the effect of indirect relationships. Even though we may determine the primary contributors in the “why” above, the correction may involve a change to a single indirect relationship potentially far-removed from the contributor. Bayesian Networks (BNs) [1] have proven to be an effective knowledge representation and inference engine in artificial intelligence and expert systems research [2,3]. However, the quantification of the BN, which often requires specification of a huge number of conditional probabilities, is always a daunting task. In reality though, when the knowledge available on assessing certain conditional probabilities is vague or incomplete, such restriction turns out to be problematical. Moreover, the underlying causal relationship may even be cyclic, which is not permitted in a BN. ∗ Corresponding author. E-mail address: [email protected] (Q. Gu). 0888-613X/$ - see front matter © 2013 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.ijar.2013.04.005

E. Santos Jr. et al. / International Journal of Approximate Reasoning 54 (2013) 1000–1012

1001

To avoid these limitations, we represent our knowledge using Bayesian Knowledge-Bases (BKBs) [4,5]. BKBs are a rulebased probabilistic model that represents possible world states and their (causal) relationships using a directed graph. BKBs are an alternative to BNs by specifying dependence at the instantiation/outcome level (versus BNs that specify only at the random variable level); by allowing for cycles between variables; and, by loosening the requirement in specifying probability distributions (especially allowing for incompleteness). Apart from its capability to represent uncertain and incomplete information, it also supports a mathematically sound reasoning system that calculates likely outcomes from given evidence. To better understand the system and explain why it is generating a conflict with regards to users’ expectations, we look into the contributing factors leading to the conflict and show how changing them can assist in correcting the system. On the other hand, identifying just the primary contributing factors may not be enough. Thus, our second goal is to tune our knowledge system while minimizing the possible changes. This is crucial to the modelers who want to maintain system credibility by protecting certain knowledge from being changed. Previous work [6] proposed an algorithm for modifying a BKB such that the test-cases entered by the user can be satisfied. However, to validate a query, the given BKB often ends up with a large portion of rules/parameters being changed. In contrast, we start by changing only one rule (probability). Single-rule change, of course, can never be achieved in BNs because once a rule is changed, all the other rules in the same conditional probability table (CPT) column must be changed accordingly, such that the probability sum continues to be 1. We will show that the properties of BKBs can allow for such singular changes in different situations. Furthermore, we present a procedure to identify the candidate changes by using the fact that any joint probability can be expressed as a linear function of a rule, where the coefficients can be efficiently calculated using importance sampling, instead of reasoning across the entire BKB. Single rule change is good for maintaining the original probability distribution specified by the experts, but may still not guarantee a solution in every situation. We discuss such situations in terms of rule contribution and provide an intuitive explanation on when and why the single rule change is (in-)sufficient. In the second half of this paper, we extend our work to allow multiple rules (probabilities) to change, in particular, a set of rules that are analogous to a conditional probability table (CPT) in a BN, but still allows for a cyclic relationship between random variables. Multiple parameter changes can be more intuitive and meaningful since the rules involved in the same conditional statement X → Y are always correlated in practice. Another contribution of this work is that we show how to formulate the problem of multiple rule changes while still maintaining the same computation cost needed for single rule changes. In fact, it can be solved efficiently through Linear Programming. Though the rules with direct contributions are more likely to be critical to the results, the optimal solution may contain some rules whose influence is underestimated due to their low weights, but may cause a “butterfly effect” as the weights change. In other words, not every parameter requires the same level of accuracy to arrive at satisfactory behavior of the network; some probabilities may have more impact on the network’s output than others. This observation motivates us to incorporate the ideas of sensitivity analyses such that we can gain detailed insight of how the changes to a set of rules can help correct the system and guide further knowledge elicitation efforts [7,8]. Our underlying philosophy is that we want to change the rules that are most significant to the query, such that the amount of change can be minimized or controlled to some extent (as specified by the users if so desired). Sensitivity analysis of probabilistic networks have been applied to various problems, such as selective parameter updating [9], prior knowledge analysis [10] and imprecise probability study [11–13]. In related work, 1 Chan and Darwiche (2004) [2] applied sensitivity analysis in a BN-based knowledge system to identify possible parameter values such that the system output can satisfy a given constraint, e.g., greater than a predefined value. However, the restriction of forcing the probability sum of all possible states to be 1 may cause semantic conflicts in a complex scenario when the knowledge pieces are incrementally collected over time. For example, at each time step, to bear a certain level of incompleteness, they must have a complementary state to represent all other unknown possibilities. Then, if a new state is discovered later on, the semantics of such complementary states will become inconsistent with all the previous models. Our tuning approach, in contrast, leverages the flexibilities of BKBs by allowing single rule change, and thus, better suited for complex scenario application. Moreover, there is no need for Proportional Scaling when the variable is multi-valued. In the next sections, we start with a detailed description of BKBs, followed by our tuning process from single to multiple parameter changes. The applications of our results to knowledge fusion and validation will be introduced in the last part.

2. Bayesian knowledge-base BKBs subsume BNs. Instead of specifying the causal structure using conditional probability tables (CPT) as in BNs, it collects the conditional probability rules (CPR) in an “if–then” style. Fig. 1 shows a graph structure of a BKB fragment, in which A is a random variable with possible instantiations (or “states” in BNs terminology) {a1 , a2 }. Each instantiation of a random variable is represented by an I-node, or “instantiation node”, e.g., A = a1 , and the rule specifying the conditional probability of an I-node is encoded in an S-node, or “support node” with a certain weight. For example, q7 corresponds to a CPR which can be interpreted as: if A = a2 and C = c1 , then B = b2 with a probability 0.5.

1

For problems that either have cycles in variable level or incomplete distribution, BN sensitivity analysis is not applicable.

1002

E. Santos Jr. et al. / International Journal of Approximate Reasoning 54 (2013) 1000–1012

Fig. 1. Example of a BKB fragment.

One benefit of this approach is that it allows for independence to be specified at the instantiation level instead of the random variable level. Also, it does not require the full table representation of the CPTs and allows for reasoning even with incompleteness. As the BKB shows in Fig. 1, the dependency relationship at the variable level implies that variable B depends on both A and C. However, given the evidence of A = a1 , B becomes independent of C. This could happen in the real world when the role of a critical variable can dominate some local dependency relationships between variables. Other works such as Context-specific independence (CSI)[14] has also studied the fact that certain direct dependencies among variables might become irrelevant under some variable assignments. In a BN, in order to represent the probability distribution of B dependent on A and C, all CPT entries of P (B|A, C ) are required to fill in, which could grow exponentially when the number of the parent nodes is large. BKBs in contrast, only capture the knowledge that is available and does not require a complete distribution specification. The other feature of BKBs is that they also allow cyclic relationships among random variables. Imagine if the direction of some causal mechanism also depends on specific states of the variables. Santos et al. [15] gives an example of BKB modeling of a political election, in which the type of “race” may flip the causal direction between the belief of a piece of evidence and the voting action. The formal definition of the graphical representation of a BKB from Santos and Dinh [6] is given below: Definition 1. A correlation-graph over a set I of I-nodes is a directed graph G = (I ∪ S , E ), in which S is a set of S-nodes disjoint from I, E ⊂ {I × S } ∪ {S × I }, and ∀q ∈ S, there exists a unique α ∈ I such that (q, α) ∈ E. If there is a link from q ∈ S to α ∈ I, we say that q supports α . I |α

For each S-node q in a correlation graph G, we denote PredG (q) as the set of I-nodes pointing to q, i.e., PredG (q) → q ∈ E } and DescG (q) as the I-node supported by q, i.e., DescG (q) is the I-node α where q → α ∈ E.

Definition 2 (Santos et al. [4]). Two I-nodes, α1 and α2 are said to be mutually exclusive if there is an I-node R and an I-node R = v2 in I2 for which v1  = v2 .

= {α ∈

= v1 in I1

Similarly, two sets of I-nodes I1 and I2 are mutually exclusive if there exists two I-nodes α1 ∈ I1 and α2 ∈ I2 , such that α1 and α2 are mutually exclusive. For example, the sets of I-nodes {A = a1 , B = b2 } and {A = a2 , B = b2 , C = c1 } are mutually exclusive. A state is a set of I-nodes such that it contains no more than one instantiation of each random variable. Definition 3. A set of S-nodes R is said to be complementary if for all q1 , q2 ∈ R, DescG (q1 ) and DescG (q2 ) are mutually exclusive, but PredG (q1 ) and PredG (q2 ) are not mutually exclusive. Variable v is said to be the consequent variable of R, if for any S-node q ∈ R, DescG (q) is an instantiation of v. We denote ρv as the set that contains all possible complementary sets of S-nodes w.r.t. variable v, such that for any complementary set r ∈ ρv , v is the consequent variable of r. We also introduce v to denote the subset of ρv that removes all complementary sets in ρv that are a subset of some other complementary set, i.e., v = {r |r ∈ ρv ∧ r ∈ ρv , r ⊆ r }. For example in Fig. 1, {q5 , q6 } is a complementary set w.r.t. variable B and B = {{q5 , q6 , q12 }, {q7 }, {q8 }}. Note that for a

E. Santos Jr. et al. / International Journal of Approximate Reasoning 54 (2013) 1000–1012

1003

certain variable v in a Bayesian Network, the size of v goes exponentially with the number of its parent variables. However, since BKBs only capture context-specific dependence rules, the size of v can be considerably smaller in real-world cases. Definition 4. A set of S-nodes B is said to be a causal rule set (CRS) for the random variable v if B contains all S-nodes pointing to the instantiations of v. As a note, each S-node only belongs to one CRS. For each complementary set r {q5 , q6 , q7 , q8 , q12 }.

∈ v , r is a subset of v’s CBS. For example, the CRS for variable B in Fig. 1 is

Definition 5. A Bayesian knowledge base (BKB) is a tuple K w : S → [0, 1] such that

= (G, w) where G = (I ∪ S, E ) is a correlation graph, and

1. ∀q ∈ S , PredG (q) contains at most one instantiation of each random variable. 2. For distinct S-nodes q1 , q2 ∈ S that support the same I-node, Pred  G (q1 ) and PredG (q2 ) are mutually exclusive. 3. for any complementary set of S-nodes R ⊆ S, R is normalized: q∈R w(q) ≤ 1 where w(q) is a weight function that represents the conditional probability of P (DescG (q)|PredG (q)). The intuition behind these three conditions is that each S-node can only support one I-node; two rules supporting the same I-node cannot be satisfied at the same time; and, to ensure normalization of the probability distribution, every complementary set of S-nodes should be normalized. Definition 6. The CRS B w.r.t. the variable v is called normalized if for any complementary set r Theorem 1. Let B be the CRS of variable v, if for any complementary set r

∈ ρv , r is normalized.

∈ v , r is normalized, then B is also normalized.

Theorem 1 can be easily proved from the definition of v , since any complementary set in ρv is a subset of some complementary set in v . As in BNs, reasoning with BKB is also based on the calculation of joint probabilities over the possible worlds. Here, a world is a subgraph of a BKB including at most one I-node of each random variable and the associated S-nodes (a world is referred to as an inference in [4]). For example in Fig. 1, the dotted rectangle circles a world. As proved in Santos and Santos [4], the joint probability of a world τ is just the product of the weights of all S-nodes in τ , i.e., P (τ ) = (q∈τ ) w(q). The idea of a world plays an important role in two forms of reasoning with BKBs, belief revision (also called maximum a posteriori or MAP) [4,16] and belief updating. In belief updating, the goal is to calculate the probability of a state of a random variable given some evidence, e.g., P (Y = y|X1 = x1 , X2 = x2 , . . . , Xn = xn ). As shown in Rosen et al [5], the joint probability of a state θ = {X1 = x1 , X2 = x2 , . . . , Xn = xn } is a summation of the probabilities of all possible worlds where θ is true, i.e., P (θ ) = (τ ∈Iθ ) P (τ ), where Iθ represents the set of worlds containing θ . The worst-case complexity of reasoning with BKBs is NP hard. Some approximation algorithms, e.g., stochastic sampling methods, have been introduced to make the reasoning process more efficient [5]. However, given that BKBs only captures the knowledge available, for real-world problems [17–19], exact methods have been shown to be computationally feasible at least for models of moderate size. As we discussed earlier, calculating the contributing factors is important to explaining the phenomenon. An algorithm used to compute the contribution of node q to the probability of state θ , represented by cθ (q), was presented in [17], in which the contribution is the sum of the probabilities of the worlds including both θ and q. Definition 7 (Santos et al. [17]). cθ (q)

=



(τ ∈Iθ ∧q∈τ ) P (τ ).

Here, q is not restricted to just S-nodes. If it is an I-node, then the contribution measures how much influence an event can have on a state.

3. Tuning a BKB The main goal of this work is to tune the BKB by understanding the contributing factors leading to an answer that is inconsistent with the expert’s judgment. More specifically, three problems will be addressed in this section. First, before we actually tune a set of rules (weights of S-nodes), we want to know how they contribute to the original answer. Second, given a BKB, K = (G, w) and some evidence e entered by the user, we need to determine how we can identify necessary changes to a set of S-nodes such that the tuned BKB can enable P (v = v1 |e) > P (v = v2 |e), where P is the distribution after we apply changes on the rules (as compared to the original distribution P). Third, we want to select a set of S-nodes that are most sensitive to correcting the answer and focus our efforts on refining their weights. In other words, we want to keep the change as small as possible.

1004

E. Santos Jr. et al. / International Journal of Approximate Reasoning 54 (2013) 1000–1012

We begin by focusing on the goal of changing just a single S-node. Unlike BNs, which require the sum of the parameters in the same CPT column to remain 1.0, BKBs allow for incompleteness, which makes it possible to minimize the change to a single S-node. Let θ1 and θ2 be two states representing {v = v1 , e} and {v = v2 , e}, following Bayes rule, the original result: P (v = v1 |e) < P (v = v2 |e) can be transformed into: P (v = v1 , e) < P (v = v2 , e) or P (θ1 ) < P (θ2 ). As defined in the previous section, given an S-node q, we can express P (θ1 ) in terms of cθ1 (q) as: P (θ1 )

=

 τ ∈Iθ1

P (τ )



=



P (τ ) +

τ ∈Iθ1 ∧q∈τ

υ∈Iθ1 ∧q∈υ

P (υ)

= cθ1 (q) + a.

(1)

Similarly, P (θ2 )

=

 τ ∈Iθ2

P (τ )

= cθ2 (q) + b,

(2)

where a and b are two constants independent of w(q). An S-node with high contribution value can be considered as a critical parameter to the probability of a state. In Fig. 1 currently, P (θ1 ) = P (A = a1 , D = d1 ) = 0.096 and P (θ2 ) = P (A = a2 , D = d1 ) = 0.106, which indicates that the most probable state for variable A is a2 given evidence D = d1 . After calculating the contributions of each S-node, we determine that q9 participates in all possible worlds including θ2 , and thus has the largest contribution to P (θ2 ), i.e., cθ2 (q9 ) = 0.106. Therefore, it is reasonable to decrease the probability of P (θ2 ) by lowering the weight of q9 . However, q9 is also the largest contributor to P (θ1 ), i.e., cθ1 (q9 ) = 0.096, which suggests that lowering the weight of q9 will proportionally decrease P (θ1 ) and P (θ2 ) at the same time. As a result, q9 is not an appropriate choice to correct the system. So, which kind of S-node can help to correct the system? Considering that we only adjust the weights of S-nodes without changing the structure of BKB, the corresponding worlds of a state will stay the same. Let q be the S-node that we want to change and w be the amount of change that we apply to q. In order to achieve P (θ1 ) > P (θ2 ), from Eqs. (1) and (2), the following condition must hold: cθ 1 (q) + a

> cθ 2 (q) + b,

(3)

where  cθ i (q)

= cθi (q) 1 +

w



w(q)

= 1, 2.

i

In addition, to ensure the validity of the tuned system, all the complementary sets of S-nodes that involves q should remain normalized. Let B be the CRS for variable v, such that q ∈ B, then DescG (q) is an instantiation of v. From the definition of CRSs, every complementary R, such that q ∈ R, is a subset of B. Therefore, as long as B is normalized, the updated BKB is still valid. From Theorem 1, we can simplify the conditions of normalization as follows:

∀R ∈ v (q),

 q∈R

w(q)

≤ 1,

where v (q) only collects the complementary sets in v that contain q. Combining with inequality (3), the conditions to correct the system for S-node q can be integrated as follows: ⎧  ⎪ ∀R ∈ v (q), q∈R w(q) + w ≤ 1, ⎪ ⎪ ⎨ w(q) + w > 0, ⎪ ⎪ ⎪ ⎩ (α − β ) w > b − a + c (q) − c q

θ2

q

(4)

θ1 (q),

where

αq =

cθ1 (q) w(q)

and

βq =

cθ2 (q) w(q)

.

Since w is the only unknown variable, the solution space can be easily found by solving for the equality condition. The size of v (q) can be larger than 1 due to the incompleteness, but still remains small in practice. The time complexity of computing all contribution cθ (q) in terms of q is worst-case NP-hard, which is the same as belief updating. However, with the aid of the importance sampling method introduced in Rosen et al. [5], we can approximate all coefficients α and β

E. Santos Jr. et al. / International Journal of Approximate Reasoning 54 (2013) 1000–1012

1005

Fig. 2. Example of a BKB with no single S-node solution to make B = b1 be the most probable instantiation of variable B given evidence C = c2 .

efficiently. Considering the fact that P (θ1 ) and P (θ2 ) can be written as a function of w(q). ⎧ ⎨ P (θ

= w(q)αq + a,

⎩ P (θ

= w(q)βq + b.

1) 2)

(5)

Let x1 , x2 be two different weights assigned to q, Then the values of αq can be easily determined as the following: ⎧ ⎨ p1

θ1

= x1 αq + a,

⎩ p2

θ1

= x2 αq + a,

where p1θ1 and p2θ1 denote the corresponding probabilities of P (θ1 ) respectively. The time complexity of computing the

probability P (θ1 ) using sampling method is O(X 2 ), where X is the number of random variables. The way of computing β is similar to above. Unfortunately, the single S-node solution does not always exist. We use the BKB in Fig. 2 to describe the situations when we are unable to find a solution for inequalities (4). Assume our goal is to ensure: P (B = 1 | C = 2) > P (B = 2 | C = 2) or P (B = 1, C = 2) > P (B = 2, C = 2) by tuning just one S-node. Intuitively, the single S-node solution does not exist if there is no room to increase P (v = v1 , e) due to the normalization constraints: ⎧ ⎪ ⎪ ⎪ w(q1 ) + w(q3 ) ⎨

w(q2 ) + w(q4 ) ⎪ ⎪ ⎪ ⎩ w(q ) + w(q ) 6

7

≤ 1, ≤ 1, ≤ 1.

Instead, a second method may be to decrease the probability of P (v = v2 , e), such that {v = v1 } becomes the most probable state given evidence e. As the example shown in Fig. 2, we want to select the S-node that contributes to P (B = b2 , C = c2 ) and reduce its weight, e.g., q1 , q3 , q4 , q7 and q8 . Table 1 shows the probability of P (B = b1 , C = c2 ) and P (B = b2 , C = c2 ) after reducing the weight of these five S-nodes to 0 respectively. Obviously, no matter which S-node we pick to reduce to 0, the joint probability of P (B = b2 , C = c2 ) is still greater or equal to P (B = b1 , C = c2 ). In other words, no S-node contributes significantly enough to make P (v = v2 , e) greater than P (v = v1 , e). Note that such situations are not rare in BKB modeling, which makes it more challenging to find a generic tuning algorithm. The underlying reason is that BKBs only capture available semantics and therefore allow for incomplete probability specification. Reconsidering the previous case, the conditional probability of P (B = b1 |A = a2 ) is missing due to the lack of information. However if we consider to use BN as an alternative model, P (B = b1 |A = a2 ) will be forced to be set as 0.5 (since P (B = b2 | A = a2 ) is known to be 0.5). Such change can make a huge difference to the results as the single S-node solution suddenly becomes available, i.e., we can lower the probability of S-node q7 by 0.28. Nevertheless, if there exists a hidden state for variable B, e.g., B = b3 , then by making the strong assumption that P (B = b1 | A = a2 ) = 0.5 may compromise the soundness of the knowledge-base system.

1006

E. Santos Jr. et al. / International Journal of Approximate Reasoning 54 (2013) 1000–1012 Table 1 Updated Probabilities of states {B = b1 , C = c2 } and {B = b2 , C = c2 } with one weight of S-node assigned to 0. w(qi ) = 0 P (B = b2 , C = c2 ) P (B = b1 , C = c2 )

q1 0.252 0

q3 0.140 0.076

q4 0 0

q7 0.252 0.076

q8 0.140 0.076

Another interesting situation is that for some S-nodes in a BKB, we can always find the proper w that can satisfy the inequalities (4), or the existence of the solution is independent of the parameter setting. Lemma 1. There exists a solution to enable P (v = v1 |e) Iθ2 , q˜ ∈ S , and (2) ∃τ = (I ∪ S , E ) ∈ Iθ1 , q˜  ∈ S .

> P (v = v2 |e) by only changing S-node q˜ if (1) ∀τ = (I ∪ S , E ) ∈

Proof. From Eqs. (5): P (θ2 ) − P (θ1 )

= w(˜q)β − w(˜q)a + b − a.

If q˜ belongs to all of the worlds in Iθ2 but not all in Iθ1 , then we will have b b − a + cθ2 (q) − cθ1 (q)

= 0 and a > 0. So:

= P (θ2 ) − P (θ1 ) < −w(˜q)(α − β),

which guarantees a solution −w(˜q)

< w < 0 to inequalities (4). Actually, w can be easily solved from Eq. (5):

w = P (θ1 )w(˜q)/P (θ2 ) − w(˜q). For example, in Fig. 2, q8 is a S-node that appears in all the worlds in IA=a2 ,D=d1 , but this is not the case for IA=a1 ,D=d1 . Therefore, as long as we decrease w(q8 ) to a certain value, e.g., w = −0.301, the probability of P (A = a1 , D = d1 ) will become larger than P (A = a2 , D = d1 ). Intuitively, q8 has a full contribution to P (A = a2 , D = d1 ), and by removing it from correlation-graph of the BKB, P (A = a2 , D = d1 ) will become 0. Lemma 1 provides a quick single S-node solution without having to solve the set of inequalities (4). The idea is to find a S-node that contributes more to P (θ2 ) than P (θ1 ).  3.1. Sensitivity of a single S-node Now we address the third problem of selecting the optimal S-node that can correct the answer with minimal change. Though we may determine the primary contributors as we discussed above, an S-node with small direct contribution to the results may actually assist in correcting the system indirectly. For example, changing its weight may impact another highcontributing S-node, and thus flip the final answer. Therefore, the contribution to the original results alone is insufficient to correct the system efficiently. To overcome this, we incorporate sensitivity analysis. Sensitivity analysis has been used to evaluate the change in the posterior probability of the target query caused by parameter variation. Laskey [1] proposed to measure sensitivity by using partial derivatives of the probability in terms of the parameter. In our work, in order to achieve P (v = v1 |e) − P (v = v2 |e) > 0 while minimizing change, we measure the sensitivity with respect to the ratio r (v|e) = P (v = v1 |e)/P (v = v2 |e) as a function of the S-node to be changed. The idea is to make r (v|e) greater than 1.0 with a small | w |. Definition 8. Given a BKB K S (x|v, e)

=

= (G, w), let x be the weight of S-node q, the sensitivity of r (v|e) to x is a partial derivative:

∂ r (v|e) . ∂x

The ratio r (v|e) can be expressed in terms of x as: r (v|e)

=

P (v P (v

= v1 |e) P (θ1 ) αx + a = = , = v2 )|e P (θ2 ) βx + b

which is a fraction of two linear functions on x. Then the partial derivative of r (v|e) on x turns to be:

∂ r (v|e) ab − αβ = . ∂x (β x + b)2

(6)

The S-nodes with higher |S (x|v, e)| are more sensitive to the results, and thus more likely to correct the system with small change. Note that the S-nodes with S (x|v, e) < 0 implies a negative w, which means that there is no need to check the normalization condition in inequalities (4).

E. Santos Jr. et al. / International Journal of Approximate Reasoning 54 (2013) 1000–1012

1007

3.2. Tuning a variable CRS Tuning a single S-node is simple to apply. However, as the example shown in Fig. 3 above demonstrates, there may not exist a single S-node solution. In order to overcome this limitation, we instead consider changing the S-nodes in the CRS of a random variable. There are several advantages of applying single CRS tuning instead of single S-node tuning. First, allowing CRS tuning enlarges the range of its possible changes since the S-nodes in the same complementary set can be changed simultaneously. Therefore it becomes more likely to find a solution to correct the system. Back to the example in Fig. 2, if we apply a positive change of 0.3 to q6 and a negative change of -0.3 to q7 , which still satisfies normalization, then the updated BKB will have the probability of P (v = v1 |e) > P (v = v2 |e). Second, in contrast to single S-node tuning, single CRS tuning may be more intuitive and meaningful. For example, if the chance of “Rain = True” gets higher, it is reasonable to believe that the probability of “Rain = False” should become lower. Furthermore, if we want to test how a sensor influences the outcome in a mechanical system, we may want to change both the false-positive and false-negative rates at the same time. Let B be the CRS set of S-nodes w.r.t. variable v, w (q) be the change applied on S-node q ∈ v respectively. w (q) = 0 if the weight of q stays the same. Considering the fact that two S-nodes q1 and q2 can not coexist in the same world since either PredG (q1 ) and PredG (q2 ) are mutually exclusive or DescG (q1 ) and DescG (q2 ) are mutually exclusive, equalities (1) and (2) in this case will be replaced with: P (θ1 )

=

P (θ2 )

=

 τ ∈θ1



τ ∈θ2

P (τ )

=

P (τ )

=

 q∈B

cθ1 (q) + a,



q∈B

cθ2 (q) + b.

Combining with normalization constraints of v to find the solution of parameter changes over the CRS that satisfies P (v = v1 |e) − P (v = v2 |e) > 0, the following conditions must hold: ⎧   ⎪ ∀R ∈ v , q∈R w(q) + q∈R w(q) ≤ 1, ⎪ ⎪ ⎪ ⎪ ⎨ ∀q ∈ B, w(q) + w(q) > 0, ⎪ ⎪ ⎪ ⎪  ⎪ ⎩ n (α − β ) w (q ) > b − a + n [c (q ) − c (q )] i i i θ2 i i=1 i=1 θ1 i

(7)

= C0 ,

where n is the number of S-nodes in B,

αi =

cθ1 (qi ) w(qi )

and

βi =

cθ2 (qi ) w(qi )

.

 The solution can be treated as a Linear Programming (LP) problem [20] with objective function Min ni | w (qi )| and constraints from inequalities (7). This LP can be solved with the simplex method. The objective function with absolute value can be transformed into a linear objective with additional linear constraints and variables [21]. As the worst-case complexity of the simplex algorithm is exponential time, the complexity of this method is O(exp(m + n)), where m is the size of CRS and n is the number of complementary set for a variable v. Note that the time complexity of computing all coefficients αi , βi and searching for v is the same as for the single S-node solution. In practice, the simplex method can be very efficient with a polynomial complexity on average. Kelner and Spielman (2006) introduced a polynomial time randomized simplex algorithm where the expected running time of the simplex algorithm can be polynomial for any inputs [22]. Also, considering that the coefficient matrix A in the linear programming problem is always a sparse matrix, various efficient revised simplex methods can be used [23,24]. 3.3. Sensitivity of a variable CRS In this section, we analyze how the ratio r (v|e)

= P (v = v1 |e)/P (v = v2 |e) changes when we change values in a CRS B.

x = [x1 , x2 , . . . , xn ] be the vector that denotes the weights of S-node in CRS Definition 9. Given a BKB K = (G, w), let x is a partial derivative: B = {q1 , q2 , . . . , qn }, the sensitivity of r (v|e) to S ( x|v, e)

=

∂ r (v|e) . ∂ x

1008

E. Santos Jr. et al. / International Journal of Approximate Reasoning 54 (2013) 1000–1012

Similar to the way we derive r (v|e) as a function of the weight of a single S-node, r (v|e) on x can be given by: r (v|e)

=

P (θ1 ) P (θ2 )

=

α1 x1 + α2 x2 + · · · + αn xn + a . β1 x1 + β2 x2 + · · · + βn xn + b

Then the sensitivity of r (v|e) to x is a partial derivative vector:



∂ r (v|e) ∂ r (v|e) ∂ r (v|e) ∂ r (v|e) = , ,..., . ∂ x ∂ x1 ∂ x2 ∂ xn

(8)

The Euclidean norm of the partial derivative vector can be used to compute the sensitivity value. The CRS with higher sensitivity value will cause larger changes to the results. 3.4. Compatibility with BN based systems As a reminder, every BN can be transformed into a BKB by representing the entries in the CPT with “if–then” rules in the BKB. Therefore, tuning a variable  CRS can be also applied to a BN based system by simply replacing the first inequality in (7) with the equation constraint: q∈R w(q) = 1. Clearly, the new problem is still LP solvable. However we will show that the amount of change required for tuning a BKB is never larger than the equivalent BN with complete CPT. Let q1 , q2 be two S-nodes in the same CPT and let x and y denote the amount of change applied to q1 and q2 , respectively. We can simplify inequalities (7) into ⎧ ⎪ αx − β y ⎪ ⎪ ⎨ x+y ⎪ ⎪ ⎪ ⎩x+y

> C0 ,

=0

(if BN ),

≤0

(if BKB),

where α, β and C0 are positive coefficients as defined in inequalities (7). To make it more intuitive, we set up an experiment where ratio α/β varies from 0.1 to 10 and ratio (c0 + )/α varies from 0.1 to 0.9 and then solve the optimization problem above for BKB and BN accordingly. We plot the amount of change at optima: |x| + |y| in terms of different ratios in Fig. 3. As we can see, BKB tuning always requires a smaller change regardless the coefficient setting. In fact, if the original system models incomplete information, the requirement of a complete CPT, like BNs, may force the system to lose important latent information or potentially limit the future knowledge elicitation. A verification mechanism to check the completability of a knowledge-based system can be found in one of our previous works [25]. 3.5. Examples In what follows, we present concrete examples to illustrate how tuning method can be applied. 3.5.1. Aquatic ecosystem We first introduce a simple aquatic ecosystem to show how to correct the system output with minimal parameter changes. The graphical representation for this BKB is depicted in Fig. 4(a). Unlike BNs that do not support cyclicity, BKBs allow for cyclic dependency among random variables, as shown in Fig. 4(b). This BKB is also an example of incomplete conditional probability specification, e.g., P (Reproduce Ability = High | pH value > 7.5) is unknown. Such incompleteness can be problematic in BNs, but is allowed in BKBs. Notice that currently, the probability of {Acid Level = Low} is higher than {Acid Level = High} given low Reproduce Ability, i.e., P (Acid Level = Low, Reproduce Ability = Low) = 0.1053 > P(Acid Level = High, Reproduce Ability = low) = 0.0639. Assume this reasoning result contradicts our natural expectation. Thus, in order to flip the most probable explanation for variable “Acid Level” given low “Reproduce Ability”, we apply the tuning approach such that the knowledge-based system can be corrected with minimal changes. We start by finding a single S-node solution. We compute the sensitivity of each S-node on P(Acid Level = Low, Reproduce Ability = Low)/P(Acid Level = High, Reproduce Ability = Low) using equation (5). The sensitivity values corresponding to each random variable are shown in Table 2. We find that q1 has the largest sensitivity value of 6.06838, which suggests that we may satisfy the query by applying a small change on q1 . We calculate the coefficients and plug into the third condition in inequalities (4): 0.639(0.1 + w) > 0.1053 The solution space is: w > 0.065. Next we try to change the weights of the S-nodes in a CRS by first evaluating the sensitivity of CRSs in terms of variables “Acid Level”, “Over Crowd”, “pH Value” and “Reproduce Ability”. The sensitivity value of these four CRSs computed from Eq. (8) are: S ([Acid Level]) = 6.10572 > ([Over Crowd]) = 2.42735 > S ([pH Value]) = 1.25723 > S ([Reproduce Ability]) = 0.143616. We extract the S-nodes from CRS of “Acid Level” and form our objective as to minimize | w(q1 )|+| w(q2 )| while

E. Santos Jr. et al. / International Journal of Approximate Reasoning 54 (2013) 1000–1012

1009

Fig. 3. Comparison of total change between the BKB and equivalent BN across different coefficient settings.

maintaining constraints in inequalities (7). To transform the objective function with absolute value into a linear objective function, we introduce two positive variables t1 and t2 , where t1

= | w(q1 )|, −t1 ≤ w(q1 ) ≤ t1

t2

= | w(q2 )|, −t2 ≤ w(q2 ) ≤ t2 .

So the objective function turns to be Min(t1 + t2 ). We use simplex method to solve this problem and the minimum is attained when w(q1 ) = 0.0549, w(q2 ) = −0.0549, e.g., the new prior probabilities for Acid Level = High and Acid Level = Low are tuned to 0.1549 and 0.8451. As we can see, both of these two solutions only require a small change in the prior distributions of the original BKB while keeping the internal conditional probabilities collected from experts the same. In comparison, our previous work [6] proposed an algorithm for modifying a BKB such that the query specified by the user can be satisfied. The idea is to reduce the weights of S-nodes so that the joint probability of the preferred inference of a query becomes greater than the joint probability of any alternative. For example, the preferred inference in our case is {Acid Level = High, pH Value < 6.5, Reproduce Ability = Low}. We apply this algorithm to correct the system and compare it with the tuning method. The results in terms of the amount of change required for each of the S-node are shown in Table 3. Apparently, tuning requires much smaller change than the previous work as we only target the most sensitive factors contributing to the query. In addition, tuning method also maintains the completability, whereas the previous method may end up losing original semantics. 3.5.2. Medical diagnostic system In the next example, we show that in addition to updating an existing system, tuning method can also be useful to help control the states of key variables and answer questions like: what factors are important to maintain a patient’s health condition from going worse. Fig. 5 shows a medical diagnostic system to determine what potential care is best for a patient by considering his personal situation and preferences. Under a normal situation after the surgery (represented in Fig. 5), the most probable potential care should be “home care” rather than hospitalization, unless the patient prefers to stay in the

1010

E. Santos Jr. et al. / International Journal of Approximate Reasoning 54 (2013) 1000–1012

Fig. 4. (a) Example of correlation-graph of a BKB modeling an aquatic ecosystem (left); (b) underling random variable dependency relationship for BKB on the left (right).

Table 2 Sensitivity of each S-node r (AcidLevel|ReproduceAbility = Low). S-node q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q13

on

ratio

Sensitivity Value 6.0684 0.6743 2.4274 0 0.0950 0.7407 0.1436 0 0.1425 0 0.0182 0 0

Table 3 Change applied on each of the S-nodes by using two different methods. S-node q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q13

BKB Tuning 0.0549 −0.0549 0 0 0 0 0 0 0 0 0 0 0

the previous work 0 −0.3667 −0.1018 −0.0204 −0.0611 0 −0.2442 −0.3667 −0.0407 −00895 0 −0.2035 −0.3053

E. Santos Jr. et al. / International Journal of Approximate Reasoning 54 (2013) 1000–1012

1011

Fig. 5. (a) Example of correlation-graph of a BKB modeling a medical diagnostic system.

hospital, i.e., P((B)Patient preference = hospitalization) > P((B)Patient preference = home care). However, if the patient’s situation goes worse later, the hospitalization may become necessary. Then the next question is under what situation and how likely the “potential care” decision will be changed. We apply the tuning method as shown in the last example to find the solution of P(potential care = home) < P(potential care = hospitalization). The results show that, “Wound” is most sensitive/critical to the target decision. Without changing any internal conditional probabilities, the patient will have to take hospitalization if the probability of “Wound = open” increases by 0.138162. In other words, to prevent the health condition going worse, the patient should pay extra attention to wound care. Medicines that may slow down the wound healing and cause wound dehiscence should be taken very carefully.

4. Conclusions This paper proposed a method to tune a BKB. Specifically, we described how to explain and fix the conflict between the users’ expectation and system answer by making a small change to the parameters. The optimal set of parameters is determined by evaluating their sensitivity, which minimizes the amount of change. We started by tuning a single parameter, and then extended to multiple parameters. Another contribution of this work is that we demonstrated that tuning a system via multiple parameter changes has the same computational complexity as single parameter changes. The problem of multiple parameter changes can be transformed into a Linear Programming (LP) problem and efficiently solved. Besides the applications introduced in the last section, tuning method can also be used for controlling reliabilities or weights of the knowledge bases from different sources in knowledge fusion. In fact, knowledge fusion happens frequently in decision making scenarios when multiple perspectives/experts are involved. For example, in the case of a medical diagnosis system, two experts may hold different opinions about the causal relationship between two variables of interest. So, simply combining the two knowledge bases could cause a violation of probabilistic soundness. An algorithm [15] was proposed to encode and aggregate a set of knowledge fragments from different sources into one unified BKB. An important property of this approach is their ability to support transparency in analysis. In other word, all perspectives are preserved in the fused BKB without losing any information. The idea behind this algorithm is to take the union of all input fragments by incorporating additional variables called source nodes, indicating the source and reliability of the fragments. Therefore, without changing each fragment’s internal parameter settings, an easy way to control the system output is to simply tune the prior probabilities/reliabilities of source nodes.

1012

E. Santos Jr. et al. / International Journal of Approximate Reasoning 54 (2013) 1000–1012

Another application of system tuning with multiple parameters in BKB is on automatic knowledge validation, in which the goal is to tune the given BKB with the minimum necessary changes such that the updated BKB will pass all test cases. The complexity of solving multiple queries altogether is the same as for a single query, since it is still an LP problem with only an additional inequality needed for each query. One way to measure the sensitivity value w.r.t. multiple queries is to just simply take the average over all the queries. Acknowledgement This work was supported in part by Grants from AFOSR, DHS, and DTRA. A preliminary version of this paper can be found in [26]. References [1] K. Laskey, Sensitivity analysis for probability assessments in bayesian networks, IEEE Transactions on Systems, Man and Cybernetics (SMC) 25 (1995) 901–909. [2] H. Chan, A. Darwiche, Sensitivity analysis in bayesian networks: From single to multiple parameters, Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence (2004) 67–75. [3] R.G. Cowell, P. Dawid, S.L. Lauritzen, D.J. Spiegelhalter, Probabilistic networks and expert systems: Exact computational methods for Bayesian networks, Springer, 2007. [4] E. Santos Jr., S.E. Santos, A framework for building knowledge-bases under uncertainty, Journal of Experimental & Theoretical Artificial Intelligence 11 (1999) 265–286. [5] T. Rosen, S. Shimony, E. Santos, Reasoning with bkbs–algorithms and complexity, Annals of Mathematics and Artificial Intelligence 40 (2004) 403–425. [6] E. Santos Jr., H. Dinh, On automatic knowledge validation for bayesian knowledge bases, Data & Knowledge Engineering 64 (2008) 218–241. [7] V. Coupé, L. Gaag, J. Habbema, Sensitivity analysis: an aid for probability elicitation, Relation 269 (2000) 8889 [8] L. van der Gaag, S. Renooij, V. Coupé, Sensitivity analysis of probabilistic networks, Advances in Probabilistic Graphical Models (2007) 103–124. [9] H. Wang, I. Rish, S. Ma, Using sensitivity analysis for selective parameter update in bayesian network learning, Association for the Advancement of Artificial Intelligence (AAAI), 2002. [10] I. Cloete, S. Snyders, D. Yeung, X. Wang, Sensitivity analysis of prior knowledge in knowledge-based neurocomputing, Proceedings of 2004 International Conference on Machine Learning and Cybernetics 7 (2004) 4174–4180. [11] B. ONeill, Importance sampling for bayesian sensitivity analysis, International Journal of Approximate Reasoning 50 (2009) 270–278. [12] E. Kriegler, H. Held, Utilizing belief functions for the estimation of future climate change, International Journal of Approximate Reasoning 39 (2005) 185–209. [13] M. Oberguggenberger, J. King, B. Schmelzer, Classical and imprecise probability methods for sensitivity analysis in engineering: A case study, International Journal of Approximate Reasoning 50 (2009) 680–693. [14] C. Boutilier, N. Friedman, M. Goldszmidt, D. Koller, Context-specific independence in bayesian networks, Proceedings of the Twelfth International Conference on Uncertainty in Artificial Intelligence (1996) 115–123. [15] E. Santos, J. Wilkinson, E. Santos, Fusing multiple bayesian knowledge sources, International Journal of Approximate Reasoning 52 (2011) 935–947. [16] J. Pearl, Probabilistic reasoning in intelligent systems: networks of plausible inference, Morgan Kaufmann, 1988. [17] E. Santos, E. Santos, J. Wilkinson, H. Xia, On a framework for the prediction and explanation of changing opinions, IEEE International Conference on Systems, Man and Cybernetics (SMC) (2009) 1446–1452. [18] N. Pioch, J. Melhuish, A. Seidel, E. Santos Jr., D. Li, M. Gorniak, Adversarial intent modeling using embedded simulation and temporal Bayesian knowledge bases, Technical Report, DTIC Document, 2009. [19] E. Santos Jr., K. Kim, F. Yu, D. Li, E. Jacob, L. Katona, J. Rosen, A method to improve team performance in the or through intent inferencing, in: Proceedings of EUROSIM Congress on Modelling and Simulation, vol. 2, 2010. [20] K. Murty, Wiley, New York, 1983. [21] D. Shanno, R. Weil, Linear programming with absolute-value functionals, Operations Research 19 (1971) 120–124. [22] J. Kelner, D. Spielman, A randomized polynomial-time simplex algorithm for linear programming, in: Proceedings of the Thirty-eighth Annual ACM Symposium on Theory of Computing, 2006, pp. 51–60. [23] R. McBride, J. Mamer, Solving multicommodity flow problems with a primal embedded network simplex algorithm, INFORMS Journal on Computing 9 (1997) 154–163. [24] J. Mamer, R. McBride, A decomposition-based pricing procedure for large-scale linear programs: an application to the linear multicommodity flow problem, Management Science 46 (2000) 693–709. [25] E. Santos, Q. Gu, E. Santos, Incomplete information and bayesian knowledge-bases, IEEE International Conference on Systems, Man, and Cybernetics (SMC) (2011) 2989–2995. [26] E. Santos Jr., Q. Gu, E. Santos, Tuning a bayesian knowledge base, Proceedings of the 24nd International FLAIRS Conference, AAAI Press, 2011, pp. 638–643.