Causal Discovery with Prior Information - Semantic Scholar

Causal Discovery with Prior Information R.T. O’Donnell, A.E. Nicholson, B. Han, K.B. Korb, M.J. Alam, and L.R. Hope Faculty of Information Technology, Monash University email:{rodo,annn,bhan,korb,alam,lhope}@csse.monash.edu.au Abstract. Bayesian networks (BNs) are rapidly becoming a leading tool in applied Artificial Intelligence (AI). BNs may be built by eliciting expert knowledge or learned via causal discovery programs. A hybrid approach is to incorporate prior information elicited from experts into the causal discovery process. We present several ways of using expert information as prior probabilities in the CaMML causal discovery program.

1

Introduction

Bayesian Networks (BNs) are popular graphical models for probabilistic reasoning. BNs can be hand-crafted, using expert elicited domain knowledge, or machine learned. Expert elicitation is time-consuming, expensive and heavily dependent upon human expertise. Automated learning methods are necessary to break through this “knowledge bottleneck”, but these also suffer from limitations. Structural learning usually needs large and clean datasets. Also, the learning algorithms do not have common sense domain knowledge and hence the learned networks often contain rather simple errors. A hybrid approach is to introduce expert elicited knowledge into the learning process in the form of a constraint or prior probability. The hybrid approach both reduces the search space and biases the search, potentially improving the learning efficiency. In this study, we present several ways in which experts may specify structural information about the domain along with a confidence in the information. We show how this can be incorporated as soft priors into the CaMML (Causal discovery via MML) program [1, 2]. Elsewhere [3] we compare CaMML to a variety of BN learners and show (a) CaMML achieves comparable results without prior information and superior performance with it and (b) that CaMML is well calibrated to variations in the expert’s skill and confidence.

2

BN structural learners

A Bayesian network is a directed acyclic graph (DAG) whose nodes represent random variables and arcs represent direct dependencies (e.g., causal relationships). Each node has a conditional probability table (CPT), quantifying the relationship between connected variables. An important concept for causal learning is the statistical equivalence class (SEC) [4]. Two BNs in the same equivalence class can be parameterised to give an identical joint probability distribution. There is no way to distinguish between the two using only observational data, although they may be distinguished given experimental data. BN structural learning algorithms can be classified into constraint-based (e.g. PC [5], implemented in Tetrad [6]) and metric-based. Metric-based methods such as K2 [7] and CaMML [1] search for a BN to minimise or maximise a metric;

many different metrics have been used (see [8, Ch.8]). These learners also vary in the search method used, and in what is returned from the search; some (e.g., K2) return a DAG, others (e.g., GES [9]) learn only the SEC. Here, we introduce expert prior information about the BN structure into CaMML. CaMML attempts to learn the best causal structure to account for the data, using a minimum message length (MML) metric with an MCMC search over the model space. MML [10] provides a Bayesian information-theoretic metric, making a tradeoff between prior probability (model complexity) and goodness of fit. The MML code for BNs describes: (1) network structure, (2) the parameters given this structure (CPTs), and (3) the data given these parameters. We have modified the encoding of the BN structure to incorporate expert priors. CaMML’s structure code requires a probability for the existence of an arc (in either direction), called Parc , which by default it estimates from the data. In contrast to other metric learners that use a uniform prior over DAGs or SECs for their search, CaMML uses a uniform prior over Totally Ordered Models (TOMs). A TOM consists of (1) a DAG and (2) a full temporal ordering for that DAG. A single DAG may include several TOMs, just as an SEC several DAGs. Figures 1(a) and (b) show two DAGs that belong to the same SEC. However, Figure 1(a) represents a single TOM with a total ordering ABC while the DAG in Figure 1(b) represents two TOMs with orderings ABC and ACB. So under CaMML’s default prior, the DAG in Figure 1(b) is twice as likely, reflecting the larger number of ways this DAG can be realized (see [8] for further discussion). CaMML also differs from other learners in using Metropolis sampling to estimate a distribution over the model space. For a fair comparison with other learners, here we restrict CaMML to using its single “best” BN, the DAG which best represents all DAGs in the highest posterior MML equivalence class.

DAGs

A

B

B

C A

TOMs

C

A

B

B

B

A

C

C

C

(a)

A (b)

Fig. 1. These DAGs belong to the same SEC, but represent different TOMs: (a) one TOM with total ordering ABC; (b) two TOMS, with total orderings BAC and BCA.

Finally, existing BN learners do support the use of structural knowledge: K2 [7] requires a total ordering; Heckerman and Geiger [11] proposed the use of a Minimum Weighted Spanning Tree (MWST) algorithm to learn a tree-like BN structure, which can then be used to initialise K2. Tetrad’s implementations of Greedy Equivalence Search (GES) and PC allow the specification of ‘temporal tiers’, a partial ordering of variables; earlier versions of CaMML (e.g., [1]) allowed the specification of temporal orderings. However, these are all hard constraints. Heckerman and Geiger [12] proposed soft constraints by computing a prior distribution from the edit distance between an expert specified structure and the candidate structure, while Castelo and Siebes [13] use a prior over directed arcs. Neither method is supported in any BN structural learning package.

3

Structural Priors in CaMML

When eliciting structural information from experts, we want to give them a number of ways to describe relationships between variables. While specific, accurate information is ideal, we generally prefer accuracy over specificity. If the information we require is too specific, we may fail to get anything useful. Hence, we introduce several levels of structural information, each of which can be accompanied by a confidence level. The levels are presented below in (arguably) most specific to most general order. Full structure. An expert may supply a fully specified network. Direct causal connections between variables may be indicated. This requires a high level of knowledge of the causal process between the variables. Direct relation. It may be known that two variables are related directly, but the direction of causality is unknown. Causal dependency. This allows an expert to indicate that one variable is an ancestor of the other, when the mechanism between them remains unknown. For example, it is generally accepted that smoking causes lung cancer, however little is known about the detailed process. Temporal order. In many domains it is clear that some variables come before others; we allow that to be indicated independent of other information. Correlation. The most general sort of information we use is correlation. This implies that there is some connection between the nodes. It may be a causal dependency in either direction, or via a common ancestor. During the elicitation process, the expert may respond with either a full structure, or any combination of the remaining prior types above. The system described below synthesizes the information into a coherent whole. 3.1

Pairwise Relationship priors

CaMML allows an expert to specify priors on a combination of five types of pairwise relationships. Direct causal connections, direct relations and temporal order are considered “local” as only the variables involved are required to determine the status of the relationship. Indirect causal relation and correlation are “global” as the full network may be required to determine this. Local priors are converted into four distinct relationships. A → B, B → A, A 9 B and B 9 A where A 9 B represents (A ≺ B and not A → B). The distinction between A 9 B and B 9 A is required as the TOM prior treats them as distinct states. Expert specified “local” priors are mapped into this space by: – Fix A → B and B → A if they are specified by an expert. – Fix all other implied values. For example, if P (A → B) = 0.2 and P (A ≺ B) = 0.3 specified, then P (A 9 B) = 0.1 is implied. – Set remaining values proportional to P (A ≺ B)×P (A−B) where P (A ≺ B) and P (A − B) are expert priors if specified, or default priors if not specified. – If the generated prior does not match the expert priors, reject expert priors as inconsistent. The message length of a TOM, t, based on local priors, is:   − log Pi→j if i → j X − log Pj→i if j → i Clocal (t) = − log Pi9j if i 9 j i,j   − log Pj9i if j 9 i where the sum is over all unique pairs (i, j).

Global Priors are mapped from expert specified priors on indirect causal and correlation relationships. Experts can specify A ⇒ B, A is an ancestor of B, and A ∼ B, A and B are correlated. Internally these are transformed into A ⇒ B, B ⇒ A, A ⇔ B and A 6∼ B, where A ⇔ B represents A and B have a common cause (direct or indirect), but one is not an ancestor of the other. A 6∼ B represents A and B are uncorrelated, there is no causal chain between them and they do not share a common ancestor. We translate these global priors into CaMML’s internal format in a similar way to our local priors. The default prior is calculated by sampling TOMs using our default value for Parc and k, the number of nodes in the network.  − log Pi⇒j if i ⇒ j g  X − log Pj⇒i if j ⇒ i Cglobal (t) =  − log Pi⇔j if i ⇔ j i,j   − log Pi6∼j if i 6∼ j where g is the set of (i, j) pairs which have an expert specified prior on i ⇒ j, j ⇒ i or i ∼ j.

Encoding the TOM structure Summing these partial costs gives the total cost of the TOM’s structure: C(t) = Clocal (t) + Cglobal (t) + γ. The constant γ is used to normalise the probability distribution; i.e., γ enforces the efficiency PT OMs −C(t) requirement that t e = 1. In practice, γ need never be calculated as all operations in the CaMML search use relative MML costs, so we omit it from further discussion. Example Consider the example of Figures 2. We have an expert who is 70% sure that A causes B, 20% sure that B causes A and 60% sure of a link between A and C, Parc = 0.5. CaMML converts these priors to the table shown in 2(b). During the sampling process, suppose the TOM of Figure 2(c) is sampled. Using Figure 2(b), the cost of Figure 2(c) would be C(t) = − ln(.05) − ln(.3) − ln(.25).

0.7 B

A 0.2

0.6 C

i A A B

j P (i → j) P (j → i) P (i 9 j) P (j 9 i) B 0.70 0.20 0.05 0.05 C 0.30 0.30 0.20 0.20 C 0.25 0.25 0.25 0.25

(a)

(b)

A C B

(c)

Fig. 2. (a) Expert specified network with local priors. (b) CaMML’s interpretation of priors from (a). (c) Candidate TOM with ordering ACB.

3.2

Using an expert specified BN

Prior information can also be coded as a specific network structure, say DAG d, if that network (or subnet) is known to be near the truth. Priors over other networks can be computed via their edit distance from the given network. We also allow the expert to specify a confidence, expressed as a probability Pd . We apply Pd as probability of each arc in d. We have defined two edit distance functions: EDd (d1 , d2 ), the edit distance between two DAGs d1 and d2 ; EDt (d, t), the edit distance from a DAG d to the TOM t itself. See [3] for details.

Message Length. Given these edit distances, we can generate priors over TOM space. The partial costs for a candidate TOM t using these edit distances are: Cd (t) = −EDd (d, dt )(log(Pd ) − log(1 − Pd )) Ct (t) = −EDt (d, t)(log(Pd ) − log(1 − Pd ))

4

Conclusions

The aim of this research was to improve the performance of CaMML by incorporating prior structural knowledge – that is, knowledge about the relationships between the domain variables under consideration – elicited from experts. Our method can handle numerous types of structural information, providing much greater flexibility in the elicitation process. It also incorporates the expert’s confidence, allowing both hard constraints (as supported by other BN learners) and soft constraints, so that experts are not forced into over- or underconfidence. In [3] we present experimental results showing that CaMML achieves results without prior information comparable to competitive algorithms, and superior performance with prior information. Furthermore, we show CaMML is well calibrated to variations in expert skill and confidence. Acknowledgements. This research was supported by ARC Discovery Grant DP0450096.

References 1. Wallace, C., Korb, K.: Learning linear causal models by MML sampling. In Gammerman, ed.: Causal Models and Intelligent Data Management. Springer (1999) 2. O’Donnell, R.T., Allison, L., Korb, K.B.: Learning hybrid Bayesian networks by MML. Proc. 19th Australian Joint Conf. on AI, LNAI (2006) 3. O’Donnell, R.T., Nicholson, A.E., Han, B., Korb, K.B., Alam, M.J., Hope, L.R.: Incorporating expert elicited structural information in the CaMML causal discovery program. Technical Report TR 2006/194, Clayton School of IT, Monash University (2006) 4. Chickering, D.M.: A transformational characterization of equivalent Bayesian network structures. In: 11th UAI, San Francisco (1995) 87–98 5. Spirtes, P., Glymour, C., Scheines, R.: Causation, Prediction and Search. Second edn. MIT Press (2000) 6. Hayduk, L.A. In: Equivalent Models: TETRAD and Model Modification. Johns Hopkins University Press, Baltimore (1996) 121–154 7. Cooper, Gregory F.and Herskovits, E.: A Bayesian method for the induction of probabilistic networks from data. Machine Leaning 9 (1992) 309–347 8. Korb, K.B., Nicholson, A.E.: Bayesian Artificial Intelligence. Chapman & Hall/CRC, Boca Raton (2004) 9. Chickering, D.M.: Optimal structure identification with greedy search. Journal of Machine Learning Research 3 (2003) 507–554 10. Wallace, C.S.: Statistical and Inductive Inference by Minimum Message Length. Springer, Berlin, Germany (2005) 11. Heckerman, D., Geiger, D., Chickering, D.M.: Learning Bayesian networks: The combination of knowledge and statistical data. Machine learning 20 (1995) 197–243 12. Heckerman, D., Geiger, D.: Learning Bayesian networks: A unification for discrete and Gaussian domains. In: 11th UAI, San Fransisco, Morgan Kaufmann (1995) 274–84 13. Castelo, R., Siebes, A.: Priors on network structures. Int Jrn of Approximate Reasoning 24(1) (2000) 39–57