Learning Bayesian Networks under the Control of Mutual Information Gernot D. Kleiter
University of Salzburg, Austria E-mail
[email protected]
Abstract The extraction of the structure of a Bayesian network from data is conceived as a stochastic process of learning a random acyclic directed graph. The nodes of the graph are taken as xed but its arcs are randomly switched on or o. The probabilities for the arcs being in the on or o state are controlled by the amount and the precision of the mutual information the arcs transmit to the neighboring nodes. A 2 approximation for the second order probability density function of mutual information in contingency tables with Dirichlet distributed rst order probabilities is proposed. EM is used for partially observed data. The neighborhood of each node is de ned by its Markov shield. The density function is employed to select only those dependencies for which the probability that the mutual information exceeds a selected minimum mutual information (MMI) criterion is large. The criterion is combined with Gibbs sampling to learn the structure of a Bayesian network from complete or incomplete data. The method allows to nd the posterior probability distribution of Bayesian network structures.
1 INTRODUCTION The extraction of the structure of Bayesian networks from data has attained much attention recently [1, 5]. Intuitively, the learning is controlled by the strength of multivariate dependencies between sets of variables. Statistical tests are employed to protect the learning process against `chance' eects. The tests suer from several diculties though. They easily lead to over tting the model, that is, to the extraction of too many dependencies. Further problems arise from missing
Radim Jirousek
University of Economics, Prague, Czech Republic E-mail
[email protected]
data. The data may, for example, be an amalgama-
tion of several partially overlapping samples. In the literature on Bayesian networks, relative frequencies are sometimes interpreted as probabilities. In the presence of partially observed data this can easily lead to `inconsistent' probabilities. Such `inconsistencies' are avoided if a clear distinction between frequencies and probabilities is made. Expectation maximization (EM) algorithms are employed for incomplete data [7]. EM is an iterative procedure. Applying it globally to a whole network may lead to diculties [2]. First, it is very slow. Second, in a large network there are many local maxima and the decision whether a global or a local maximum has been reached is dicult. Hypothesis testing is complicated by the fact that the asymptotic distribution of the likelihood ratio test statistic is not 2 . A criterion often used to measure the strength of dependencies in Bayesian networks is mutual information [4]. The criterion plays, e.g., an important role in extracting Bayesian networks under the control of minimum description length (MDL). If the probabilities follow a Dirichlet distribution then mutual information, being a function of these probabilities, also follows a probability distribution. The distribution tells us how sure we can be about the strength of a dependence. It allows us to obtain con dence intervals or to determine the probability that the strength of a dependence is larger than a given value. We propose a distribution that can be applied to incomplete data. In the following section we will rst propose a probability density function of mutual information in a two-dimensional contingence table.
2 DISTRIBUTION OF MUTUAL INFORMATION Mutual information is a well known measure of deviance from independence. We will use it as a criterion in our learning process. The motivation why we are
interested in the probability density function of mutual information is that we want to know the precision of the criterion. One and the same point value of mutual information may be highly reliable and precise in a large sample and quite unreliable and inprecise in a small sample. We want to extract informative and reliable structures. Similar arguments were put forward for making inferences (propagating probabilities) in an already given structure [6] Often we are interested in functions of probabilities in contingency tables. Mutual information is a typical example. For a two way contingency table we have XX (1) ij log ij : = i+ +j i j If the probabilities on the right hand side are not perfectly known the measure on the left hand side cannot be known perfectly either. We assume that the ij follow a Dirichlet distribution. Extensive investigations by Monte Carlo methods with random probabilities generated from Dirichlet distributions led to the following approximation:
Proposition 1 (Density of mutual information)
If X1 and X2 are two discrete random variables and if the distribution of their joint probabilities is Dirichlet, (11; 12; 21) Di(n11; n12; n21; n22), then the distribution of mutual information is approximately 2 (I + 1; 1=n) distributed, where
I = n log(n) +
?
X i
XX i
j
nij log(nij )
ni+ log(ni+ ) ?
X j
(2)
n+j log(n+j ) :
A random variable X is 2 distributed with degrees of freedom and scaling factor !, in shorthand notation X 2 (; !), if its probability density function is x=2?1 exp ? x ; (3) 1 p(xj; !) = 2=2?(=2) 2! !=2 0 x < 1; > 0; ! > 0 : I is of course well known from the 2I quantity used in testing goodness of t. The mean and the variance of the distribution are = ! = I +n 1 and 2 = 2!2 = 2(In+2 1) : (4) If X1 and X2 are independent, I is zero and the posterior distribution is 2 with one degree of freedom and variance approaching zero as the sample size n ! 1. Usually, the degrees of freedom in 2 applications in contingency tables are related to the number of cells.
Note that in the present context they are related to the strength of the dependence. The scaling factor 1=n expresses the accuracy of the dependence. This implies that for large n the mean of the distribution is close to the mutual information as estimated directly from the relative frequencies. In the following section we propose a procedure for the case of incomplete data. We will replace the actual sample size n by a virtual sample size n^ that expresses the number of complete plus the worth of the number of incomplete data.
3 INCOMPLETE DATA It often happens that more is known about the dependencies between some variables than between some other variables. We may, for example, know more about a conditional symptom probability than about the marginal `epidemic base rates' of a related disease. We would like to be able to express that, for example, the variance of the second order probability density function of the conditional symptom probabilities is less than that of the according marginals. Such situations typically arise in the presence of partially observed data. A general method to estimate probabilities in incomplete data is expectation maximization (EM) [7]. EM is an iterative method. To explain the procedure we consider the most elementary case, a 2 2Ptable P (Table 1). The total number of cases is N = ij nij + P P n + n . The number of complete observai ? ? j i j PP tions in the core is n = ij nij .
X1
X2 1 2 ? 1 n11 n12 n1? 2 n21 n22 n2? ? n1? n2?
Table 1: Frequency table for two binary variables with partially observed data
Proposition 2 (EM algorithm) Let nij denote the
cell counts of fully observed data, ni? the number of partially observed cases in raw i, n?j the number of partially observed cases in column j , fij(v) the estimP ate of the frequency in iteration v, fi(+v) = j fij(v) , P and f+(vj) = i fij(v) . The cell frequencies are estimated iteratively by
fijv
( )
fij(v?1) fij(v?1) = nij + ni? (v?1) + n?j (v?1) ; fi+ f+j
(5)
until convergence to a desired accuracy is attained. For the rst iteration the observed frequencies fij(0) = nij are used as estimates.
The counts of the incomplete cases ni? and n?j are proportionally distributed to the cells. The proportion is determined by the current conditional probability estimates fij(v) =fi(+v) and fij(v) =f+(vj) , respectively. The procedure can be extended to more than two dimensions of course. The variance (precision) of the estimates is approximately ([7], equation 9.5, p. 180, modi ed for two variables) ^j ji ? ^ij Ni+ ^ (1 ? ^ ) ij ij 1 ? 1 ? ^ N (6) var(^ij ) n ij ^ijj ? ^ij N+j 1 ? 1 ? ^ N ; ij
where N is the total number of complete and incomplete data, Ni+ the total number of data in raw i and N+j the total number of data in column j; n is the number of complete data. The probability estimates are obtained from the according nal iterative frequency estimates, ^ij = fij =N and ^j ji = fij =(f1;j + f2;j ). If there are no missing data then the variance is equivalent to the rst factor. The second factor reduces the variance according to the additional observations in raw i, and the third factor reduces the variance according to the additional observations in column j. We calculate the number of `virtual' observations (1 ? ^ij ) : n^ ij = ^ijvar(^ (7) ij ) Taking the average over all cells of the table give the average number of virtual observations I X J X n^ = I 1 J n^ ij : i
j
(8)
From the average number of virtual observations we get a rough estimate of the number of hypothetical complete observations the incomplete observations are worth. We use the number to estimate the variance of the mutual information in the table. To nd the probability density function for incomplete data we replace in proposition 1 the actual sample size n by the virtual sample size n^ . In the following section we use the probability density function of mutual information to control a stochastic simulation process. The probability that the mutual information provided by the nodes in the neighborhood of a focus node exceeds a given eect size becomes a scoring function for the extraction process.
4 STOCHASTIC SIMULATION We conceive a Bayesian network as a random graph, more speci cally, as a random directed acyclic graph (DAG). While the nodes of the graph are deterministically xed, the edges are randomly switched on and o, although always respecting the properties of an acyclic graph. Gibbs sampling controlled by the distribution of mutual information is used to nd the posterior probabilities of graph structures. For each pair of vertices (Y; X) an edge is placed between Y and X with probability PX jY . Intuitively, we want the probability to be proportional to the amount (eect size) and precision of mutual information that Y contributes to X within the structure of the Bayesian network. The amount and precision of mutual information are expressed by the distribution of . In a Bayesian network the conditional probabilities of the states of node X depend locally on the states of its neighbors and its neighbors only. The neighbors build the Markov shield of the focus node X. The Markov shield of X consists of the set of parents pa(X), children ch(X), and parents of the children pa(ch((X)) of X. A variable occurs only once in the Markov shield, though it simultaneously may be a parent of X and a parent of a child of X etc. Moreover, the focus node is not itself a member of the Markov shield (though it clearly is a parent of its children). We compute the mutual information at X given its neighbors. For each node X in the network we iteratively set the probability that an edge (Y; X) is placed between Y and X equal to the probability that Y is a member of a Markov blanket of X that contributes substantial and reliable information to X. In each iteration we rst generate a random DAG. In the rst iteration the prior probabilities that edges are switched on are provided by the user. It is reasonable to relate them to the number of variables and to the degree of connectedness one is expecting to hold for a given domain, and for a given eect size. There is the option of including `causal zeroes', that is arcs for which the direction is known perfectly. High blood pressure, for example, may be caused by family history, but not vice versa. Afterwards the probabilities are adaptively updated in each iteration in the light of the data available in the current Markov shields. Updating is done as follows: For each node in the DAG a two-dimensional table is constructed. The raws correspond to the states of the focus node (the node currently analyzed), and the columns to the possible states of the neighbor nodes. The number of columns is thus exponential in the number of variables contained in the Markov shield. The frequency counts as obtained from the data le are put into the table. We usually
add a count of 1 to each cell. This corresponds to a non-informative prior Dirichlet distribution and helps to avoid problems with zero entries while taking logarithms later on. In the case of incomplete data EM is applied at this stage. Note that because of the two dimensions the two simple formulas (5) are sucient. No iterations in higher dimensions are necessary. A few iterations will usually do the job. Mutual information is determined according to equation (2). Next, we compute the probability PX that the conditional mutual information of X given its current Markov shield is larger than a minimum mutual information (MMI) c: PX = p( > cjMarkov shield) : (9) We update the selection probabilities of each arc in exactly the neighborhood of X that took part in the calculation of PX . The selection probabilities are set equal to PX . It is convenient to store the probabilities in a matrix analog to an adjacency matrix of a graph. For small networks we repeat this procedure for one thousand iterations, say, to obtain convergence, and repeat it another one thousand times to count the frequencies with which various structures appear. The frequencies are standardized to obtain posterior probabilities for the various Bayesian network structures. For problems with many variables a listing of all possible Bayesian networks is infeasible. To select a single Bayesian network we should intensively make use of prior causal knowledge and hypotheses. Such knowledge is often available. We found it helpful to make use of the following heuristic: We build a `reward matrix'. Every time an arc is selected in the process with probability Pi;j we add this value in a matrix analog to the adjacency matrix of the graph. At the end of the iteration process the entries in the reward matrix are divided by the number of iterations. Because in each iteration an edge may be selected in several Markov shields an average reward can be larger than 1. The arcs with the high rewards are helpful to build further hypotheses about the structure of a ` nal' Bayesian network. The reward matrices are usually highly symmetrical in respect to their main diagonals and thus leave the direction of an arc undetermined. It is possible though to do further interactive experimental analyzes with dierent structures. This can easily be done by prohibiting one direction of an arc by putting the prior probability to zero, that is by introducing a causal zero. The calculation of the posterior distribution of a set of Bayesian networks by Gibbs sampling is a selforganizing and data-driven process. But there is not
just one optimal Bayesian network that ts the given data best. The operations of arc removal and arc reversal, e.g., can `legally' be applied to a Bayesian network. They change the structure but are probabilistically equivalent (this is true for complete data, but not for incomplete data). It is plausible that interactive control during the extraction process by human knowledge helps to select a nal model { if that is what we want.
5 EXAMPLES Recently, Edwards [2] p. 9, discussed the Florida murder data 1976-1977 originally published by Radelet (Table 2). There are three binary variables: the colors of victims (black, white), the color of murderers (black, white), and the sentences (death or other). Note that (perhaps contrary to prejudice) in the marginal cross classi cation murderer sentence the death sentence for white murderers (12.5 %) is slightly higher than that for black murderers (11.4 %). A :A B :B B :B C 6 0 11 19 :C 97 9 52 132 Table 2: Florida murder data 1976-1977; A=black victim, :A=white victim, B = black murderer, :B=white murderer, C=death sentence, :C=other sentence. We analyzed the data at various MMI levels. For each analysis 2000 iterations were run. Probability estimates were obtained from the relative frequencies of the random graphs between iteration 1000 and 2000 (Table 3). Graphs with very small probabilities are not listed. For MMI ! 0 the process oscillates between the six possible cliques. Each clique has approximately the same frequency. The graphs are maximally connected and have equal probabilities. Intuitively, this is a consequence of admitting uncritically arcs that in the limit contribute zero mutual information. For large MMIs the random graphs are unconnected with probability 1. No arcs are selected at all because there are no frequency tables in the data resulting in probabilities higher than the critical MMI. The interesting structures are found between these two extremes. For relatively small MMIs there are practically just two graphs obtaining probabilities :5 each. They dier only by the direction of the arcs between A and B. They are probabilistically equivalent and the random graphs oscillate between both structures. For relatively large MMIs the unconnected graph obtains a considerable probability, about .55 for MMI = :22. For minimum eect
sizes around MMI = :17 the limiting structure oscillates between two one-arc graphs. In both the sentence is independent of the color of the murderer and the victim. There is a strong relationship between the color of the murderers and victims. There is no graph with two arcs that obtains a reasonable posterior probability under any of the minimum eect sizes. Minimum Mutual Information .02 .07 .12 .17 .22 .27 Unc. graph .000 .000 .000 .042 .549 .910 B!A,C!A, C!B .427 .479 .493 .009 .000 .000 A!B,C!A, C!B .422 .514 .495 .003 .000 .000 A!B,A!C .017 .000 .003 .000 .000 .000 B!A .000 .000 .000 .383 .196 .037 A!B .000 .000 .000 .363 .202 .027 A!B,A!C .000 .000 .000 .067 .005 .000 B!A,C!B .000 .000 .000 .069 .002 .000 Arcs
Table 3: Posterior probabilities for various graph structures and sizes of minimum mutual information for the Florida murder data 1976-1977; A = victim, B = murderer, C = sentence. For each graph the posterior probability may be plotted as a function of the eect size. The resulting curve can show steep rises and falls. The clique (B ! A, C ! A, C ! B), for example, has the probabilities :167 at MMI ! 0, :321 at MMI = :01, :427 at MMI = :02, :479 at MMI = :07, :493 at MMI = :12, :212 at MMI = :13, :068 at MMI = :14, and :018 at MMI = :15. We next analyze data from an investigation on coronary heart disease that recently too has been discussed by Edwards [2], p. 24 .. The investigation included six binary variables, Smoker (A, yes/no), Mental Work (B, strenuous yes/no), Physical Work (C, strenuous yes/no), Systolic Blood Pressure (D, less than 140 yes/no), Lipoprotein Ratio (E, ratio of beta to alpha lipoproteins less than 3), and Family History (F, of coronary heart disease). The sample consisted of 1841 cases. The 26 = 64 cell counts are 44; 40; 112; 67; 129;145;12;23;35;12;80;33; 109; 67; 7; 9; 23;32;70;66;50;80;7;13;24;25;73;57; 51; 63; 7; 16;5;7; 21; 9; 9;17;1;4;4; 3; 11; 8;14; 17; 5; 2; 7; 3; 14; 14; 9;16;2;3;4; 0; 13; 11; 5;14;4;4 : The last index of the list ABCDEF changes fastest. We add 1 to each cell. Edwards [2] selected the decomposable structure shown in gure X. He used the signi cance level :05 for edge removal in a loglinear model.
Smoker Mental Work Physical Work Systolic BP Lipoprotein Ratio Family History
A B C D E F
A B C D 0 0 0 0
0
0
E F 0 0 0 0
Table 4: `Causal zeroes' in the adjacency matrix of the coronary heart disease example; the direction of the arcs is from columns to raws; thus, high systolic blood pressure (D) does not lead to a certain family history (F), but the family history may lead to high or low blood pressure. In this example we have considerable prior knowledge about the causal direction of edges. Smoking does not cause mental or physical work, it has no in uence on the family history of coronary diseases. The coronary variables have no in uence on smoking behavior. We put the causal zeros in the adjacency matrix shown in Table 4. For small eect sizes we obtain many highly connected networks, but all with small posterior probabilities . At MMI = :075 the two nine arc structures shown in Figure ?? have the probabilities :056 and :078, but there are many other structures having probabilities around :01. As MMI increases the two structures become more and more probable. At MMI = :13 they have about equal probabilities of :5 each. No other structure occurs within the iterations 1000 to 2000. They function as an attractor. At about MMI = :1526 two slightly dierent structures appear containing ve arcs only. They are extremely unstable, though, and from MMI = :157 on only structures with the two arcs E ! D or D ! E arise. As MMI further increases the unconnected graph gets more and more probable. For example, at MMI = :2 its probability is :667 and the probability for the two other structures is :169 and :164, respectively. For eect sizes larger than MMI = :24 the structure converges to a completely unconnected network. A
F
Ps PPs @ ? PPPs C s? @ B @@?s ? s
D E Figure 1: Loglinear model of the coronary disease data, decomposable model ADE, ACE, BF, ABC (Edwards [2], p. 24).
A
F
A
F
@ ? C @ ? C sP @ s ? s BP B @PP?@? PP?@P?@?s @ P @R s?? RP R s?? @ RP q ?s? q ?s? )P-@ )P@ s
s
s
D
E D E Figure 2: Low eect size MMI = :075 generating two highly connected Bayesian network structures with posterior probability :056 (left) and :078 (right); only the arcs D ! E and E ! D are oscillating. At MMI = :13 both structures obtain the probabilities :528 and :472, respectively. A
s
B
F
A
s
C s B
P@s PPP @R ?s qs )P-P
s
F
s
C s
P P P @@ R s?P )PPq s
s
A
s
D E D E Figure 3: Two highly unstable structures at eect size MMI = :1526 with posterior probability close to :5 each.
6 CONCLUSION We proposed a method of how to learn Bayesian networks from data under the control of the probability density function of mutual information. Usually there is not just one unique the best graph, that can be extracted from the data. We better think in terms of a three dimensional space: (i) the discret set of possible graphs listed in some order on the rst axis, (ii) the minimum amount of mutual information MMI transmitted by the arcs plotted along the second axis and, nally, (iii) the posterior probability of the graphs at the various MMI values on the third axis. We obtain a landscape of posterior probabilities over the set of graphs and eect sizes. In this landscape there are various `basins' containing stable structures. Though, we have seen very steep and sudden changes in this landscape. Our approach may give a wider perspective for the analysis of learning networks in this problem space. Our procedure demonstrates a relation well known from inferential statistics: highly multivariate structures can only be extracted from data having enough observations. If m is the number of (binary) variables and N the number of observations, then we should have N=2m > 3, or better N=2m > 5. Systems working on the basis of point probabilities only, not watching the limited precision of the estimates, are chronically over-
s
B
s s
A
F
s
-s
C s B
s
s
s
F
s s
C
s
D E D E Figure 4: Eect sizes between MMI = :16 and MMI = :19 generating structures with one arc only. For MMI > :2 the posterior probability for an unconnected graph is larger than .7. optimistic in respect to the complexity of the structures that can be extracted from data.
Acknowledgements The authors were supported by the Aktion AustrianCzech Exchange program. The rst author was also supported by the Fonds zur Forderung der wissenschaftlichen Forschung, Grant P-09800-PHY.
References [1] Buntine, W. (in press). A guide to the literature on learning probabilistic networks from data. IEEE Transactions of Knowledge and Data Engineering. [2] Edwards, D. (1995). Introduction to Graphical Modelling. Springer, New York. [3] Heckerman, D. (1996). A tutorial on learning Bayesian networks. Microsoft Corporation,
[email protected]. [4] Hajek, P. and Havranek, T. and Jirousek, R (1992). Uncertain Information Processing in Expert Systems, CRC Press, Boca Raton, Florida. [5] Heckerman, D., Geiger, D., and Chickering, D. (1995). Learning Bayesian Networks: The combination of knowledge and statistical data. Machine Learning 20: 197-243. [6] Kleiter, G. D. (in print). Propagating imprecise probabilities in Bayesian networks. Arti cial Intelligence. http://www.edvz.sbg.ac.at /psy /people /kleiter /kleiter.htm
[7] Little, R. J. A. and Rubin, D. B. (1987). Statistical Analysis with Missing Data. Wiley, New York.