Improved MDL Score for Learning of Bayesian Networks Zheng Yun BIRC, School of Comp. Eng. Nanyang Technological University Singapore 639798, +65-67906613
[email protected] Abstract In this paper, we propose two modifications to the original Minimum Description Length (MDL) score for learning of Bayesian networks. The first modification is that the description of network structure is proved to be unnecessary and can be omitted in the total MDL score. The second modification consists in reducing the description length of conditional probability table (CPT). In particular, if a specific variable is fully deterministic given its parents, i.e., the variable will take a certain value with probability one for some configurations of its parents, we show that only the configurations with probability one need to be retained in the CPT of the variable in the MDL score during the learning process of Bayesian networks. We name the MDL score with the above two modifications the Improved MDL score or IMDL score for short. The experimental results of classic Bayesian networks, such as ALARM [2] and ASIA [17], show that the same search algorithm with the IMDL score can identify more reasonable and accurate models than those obtained with the original MDL score.
1. Introduction Bayesian networks provide concise and efficient representation of the joint probability distribution over a set of random variables. In recent years, there have been lots of endeavors in learning Bayesian networks from data [3, 11, 14, 20, 22, 26]. Methods to learn Bayesian networks from data often consist of two components. The first component is a score which is used to evaluate how well the learned model fits the data. The second component is a learning algorithm which is used to identify one or more network structures with high scores by searching through the space of possible network structures. There have been some scores proposed for learning Bayesian networks. These includes the AIC score [1] , the BIC score [25], the K2 score [4], the BDe score [13],
Kwoh Chee Keong School of Comp. Eng. Nanyang Technological University Singapore 639798, +65-67906057
[email protected] the GU score [19] and the MDL score [16, 24, 27]. In particular, the BIC score is minus the MDL score [12]. Also, a lot of learning algorithms have been proposed. The number of possible network structures grows exponentially with the number of variables [8]. Therefore, most algorithms for learning Bayesian networks are heuristic search algorithms. Some examples are the K2 algorithm [4], the Structure EM algorithm [5], the Sparse Candidate algorithm [9], the GES algorithm [18], the Hill-Climbing algorithm, the Simulated Annealing algorithm [15] and so on. The main focus of this paper is to introduce two modifications to the MDL score. The first revision says that the description of structure is not necessary and can be omitted in the total MDL score. It is based on the fact that the network structure G of a Bayesian network B = (G, Θ) is implicitly contained in the Θ. In other words, if the Θ is known, it it straightforward to obtain G. The second modification consists in reducing the description length of conditional probability table (CPT). Especially, if a specific variable is fully deterministic given its parents, i.e., the variable will take a certain value with probability one for all configurations of its parents, then only the configurations with probability one need to be retained in the CPT of the variable. That is to say, in deterministic relations, the only information that needs to be maintained is the deterministic configurations. Thus, the description length of the Θ is reduced further given that there exists deterministic relations in B. We call the final MDL score after the above two modifications Improved MDL (IMDL) score. The experimental results show that the same search algorithm with the IMDL score can identify more reasonable and accurate models than those obtained with the original MDL score. We also compare the results with related works where the MDL score is employed in the learning process.
2. Background We introduce the notation first. We use capital letters to represent random variables, such as X and Y ; lower case
letters to represent an instance of the random variables, such as x and y; bold capital letters, like X, to represent a set of variables; and lower case bold letters, like x, to represent a configuration of X. The cardinality of variable (or set) X is represent with |X|. For the purpose of simplicity, we represent P (X = x) with p(x), P (Y = y) with p(y), and so on. Consider a finite set V = {Xi , . . . , Xn } of discrete random variables where each variable Xi may take on values from a finite domain. A Bayesian network for V is the tuple B(G, Θ). G is a directed acyclic graph (DAG) whose nodes are in one-to-one correspondence to variables in V, whose edges encode the conditional dependence between variables. In particular, Xi is independent of its nondescendants given its parents Πi in G [21]. The second component Θ is a set of parameters which quantify the network. In particular, Θ = ∪i Θi , where Θi is the conditional probability distribution (CPD) of node Xi . Since ∀Xi ∈ V is discrete, the CPD is often a table, CPD is also called conditional probability table (CPT). Θi consists of the conditional probability of Xi given every configurationQ of it parents. Let ri be the number of states of xi , and qi = xl ∈Πi rl be the number of states of Πi , and use the integer j to index the instances of Πi , thus P (Xi = i|Πi = j) denotes the probability of Xi = k given Πi = j, or shortly θijk . Therefore, Θi = ∪j ∪k θijk . The Bayesian network B encodes the joint probability over V by pB (x1 , . . . , xn ) =
n Y
p(xi |Πi , Θi )
(1)
i=1
where Πi is the set of parents of node Xi in G. The MDL principle [23] is based on the idea that the best model of a collection of data items is the model that minimizes the sum of (1) the encoding length of the model, and (2) the encoding length of the data given the model. The model learned with MDL principle will obtain an optimal tradeoff between accuracy and complexity of the model, thus avoiding overfitting the data. When learning Bayesian networks from a number of learning instances D = {v1 , . . . , vN }, the model is the Bayesian network B(G, Θ). The network B learned with MDL score fits the probabilities of instances in D well, thus obtain an optimal encoding length of D. Formally, the MDL score of a candidate network B(G, Θ) given D is described as follows. First, consider the encoding length of B. To describe B, three components are needed to be described, V, G and Θ. In the next sections, we will show that it is actually not necessary to encode G in the total description length of B. V is a set of variables with individual cardinalities, the Pn information need to be maintained is logn + i=1 logri ,
Pn where logn is for the number of variables, and i=1 logri is for the cardinalities. To describe G, an adjacency table for each variable is required. To describe such a adjacency table, the number of parents and a list of individual parents for each variables is sufficient. Since each of these items could Pn be encoded with logn bits, G could be described with i=1 (1 + |Πi |) bits. To describe Θ, it is needed to encode all θijk s. To encode a θijk , the common selection of description length in literature is 21 log N [12, 7, 6, 10]. each variable Xi , PrFor i θijk = 1, thus only there would be qi × ri θijk s. But k=1 qi × (ri − 1) parameters must be maintained. Therefore, the P description length of Θ is 12 log N i qi (ri − 1). By combining the description length of the above three components, the description length of B is
DL(B)
n X (logri + (1 + |Πi |)logn)
= logn +
i=1
n
X 1 + log N qi (ri − 1) 2 i=1 Second, we consider the description length of D given B. By encoding vi with the probability pB (vi ) defined in B, the total description length of D would be optimal. Each case vi is encoded with −logpB (vi ) bits. Therefore, Pnthe total description length of D is DL(D|B) = − i=1 logpB (vi ) bits. Let pbD (xi |Πi ) be the frequency of variable configuration (xi , Πi ) in D, from Equation 1, this measure DL(D|B) could be derived as
−
n X
logpB (vi ) = −N
qi X ri n X X
pbD (xi |Πi )logθijk
i=1 j=1 k=1
i=1
From standard argument [7], if we let θijk = pbD (xi |Πi ), the measure could be minimized. Hence, the measure DL(D|B) is DL(D|B) = N
n X
b i |Πi ) H(X
i=1
b i |Πi ) = − P P pbD (xi |Πi )logb where H(X pD (xi |Πi ) is k j the empirical conditional entropy of Xi given Πi . Since H(X|Y ) = H(X) − I(X; Y ), DL(D|B) is converted to X DL(D|B) = N (H(Xi ) − I(Xi ; Πi )). i
Combining the results above, the final MDL score of the candidate network B given D is given as follows
ScoreM DL (B, D) = logn +
n X (ri + (1 + |Πi |))logn i=1
A 0 0 0 1 1 1
B 0 1 2 0 1 2
In the original MDL score, there would need 12 log N · 2 · (3 − 1) = 2logN bits to encode θ(B). But in the IMDL score, only log N bits are needed to describe θ(B). From the results in the section 2, we obtain
p(b|a) 1 0 0 0 1 0
DL(B) = logn +
X 1 qi (ri − 1) + log N 2 θijk 6=1 X 1 + log N qi 2
n
X 1 + log N qi (ri − 1) 2 i=1 −N
(H(Xi ) − I(Xi ; Πi )).
logri
i=1
Table 1. The CPT of node B.
n X
n X
θijk =1
(2)
Therefore, the IMDL score is proposed as following:
i=1
P P Note that the terms logn, i logri and N i H(Xi ) do not depend on the structure G, thus can be eliminated from the score in the implementations. To learn Bayesian networks with MDL score, common methods use the search algorithms such as those described in section 1 to find an optimal structure which minimize the MDL score defined in Equation 2.
ScoreIM DL (B, D) = logn +
The first improvement is to eliminate the description length of G in the MDL score. The underlying idea is that the network structure G is implicitly contained in the CPT Θ. In other words, if the Θ is known, it is straightforward to obtain G. Thus, there is no need to encode G when describing the candidate model B. The second improvement is to reduce the description of Θ in the MDL score. In particular, if a specific variable is fully deterministic given its parents, i.e., the variable will take a certain value with probability one for some configurations of its parents, we show that only the configurations with probability one need to be retained in the CPT of the variable in the MDL score during the learning process of Bayesian networks. n
DL(Θ)
X 1 log N qi (ri − 1) 2 i=1 X 1 qi (ri − 1) = log N 2 θijk 6=1 X 1 + log N qi 2
=
θijk =1
Consider an example to elucidate the reduction of description of Θ. For example, node B in a Bayesian network has one parent, A. The CPT of B is given in Table 1.
logri
i=1
X X 1 1 qi (ri −1)+ log N qi + log N 2 2 θijk 6=1
−N
3. The Improvements to the MDL score
n X
θijk =1
n X (H(Xi ) − I(Xi ; Πi )).
(3)
i=1
P Similar i logri and P to MDL score, the terms logn, N i H(Xi ) do not depend on the structure G, thus can be eliminated from the score in the implementations. Friedman and Goldszmidt [7] pointed out that the BIC score is the negative of the MDL score when ignoring the description of G in the MDL score. In this sense, the IMDL score can also be regarded as an improvement to the BIC score, by providing the second improvement introduced in this paper. Since the original MDL score is not asymptotically to the posterior probability of data given the model, there would not be significant improvement to the convergence of the learning procedures with MDL score when the sample size increases. However, since the AMDL is asymptotical to the posterior probability, the learning procedure will convergence along with the increasing of sample size.
4. Experiments and Results The local search algorithm discussed in Heckerman et al. [13] are used as the graph searching method, although other searching methods are also applicable. The local search method begins with empty graphs. Then, in each step of the searching process, it tries to apply the modification with the maximum improvement to the score of the model, until reaches an optimal model and return it as the learned model.
We use the API provided by the BNJ software 1 to implement the IMDL score and the search method. The datasets for the ALARM [2] and ASIA [17] networks are used to validate the IMDL score in our experiments. The results are compared in two ways, the structure difference and the cross entropy (also called KL divergence) between the learned model and the original model. The symmetric difference δ [13] of the learned model (B) and the original model P is defined as δ=
n X
δi
(4)
i=1
where δi = (Πi (B)
[
Πi (P ))\(Πi (B)
\
Πi (P )).
In detail, the numbers of extra edges (Addition), missing edges (Deletion) and reversed edges (Reversion) are also compared in our experiments. The cross entropy measures the distance between two distributions. As discussed in section 2, Bayesian networks are used to represent complex joint distributions. Therefore, the cross entropy can be used to measure the distance between the learned model and the original model. Let p(v) denotes the joint distribution encoded by the original model and q(v) denotes the joint distribution encoded by the learned model. Formally, the cross entropy are defined in Equation 5.
Model ASIA ALARM
MDL -948.7 -170769.1
IMDL -1067.7 -171617.2
Difference 119 848.1
Table 2. The MDL score and the IDML score of the learned model for the ASIA network and the ALARM network. The unit is bit.
Model ASIA-MDL ASIA-IMDL ALARM-MDL ALARM-IMDL
C.E. 9.5 4.1 45.3 32.1
A. 2 0 7 7
D. 1 1 4 4
R. 1 0 14 9
δ 5 1 39 29
Table 3. The structure difference and cross entropy of the learned model for the ASIA network and the ALARM network.
P The scores inPTable 2 are negative, since the logn, i H(Xi ) terms are omitted in our imi logri and N plementation. From Table 2, it can be seen that the IMDL is a better than MDL score, since it encodes the learned model with less description length. The more concise description of the model makes the learned model more accurate than the model learned with the MDL score, as shown in Table 3 and Figure 1. The cross entropy and structure difference for the learned X ASIA and ALARM network are shown in Table 3, where p(v) H(p, q) = p(v)log (5) C.E., A., D. and R. denote the cross entropy, the number of q(v) v addition, the number of deletion and the number of reversion respectively. The cross entropy between joint distributions encoded with the two Bayesian networks are further deduced by From Table 3, it can be seen for both the ASIA and the Heckerman et al. [13] as ALARM network, the models learned with the IMDL score are more accurate than those learned with MDL score in terms of both cross entropy and structure difference. qi X ri n X X p(xi = k|Πi = j) H(p, q) = p(xi = k, Πi = j)log . We discuss the results of the ASIA network further. The q(x = k|Π = j) original ASIA network and the best learned ASIA networks i i i j k in 100 models obtained with the MDL and IMDL score are (6) shown in Figure 1. We use the logic sampling algorithm provided in BNJ From Figure 1 (b), it can be seen that two wrong edges, software to generate samples from the original ASIA and from node “Cancer” to node “Tuberculosis” and from node ALARM network. Then, we learn the models from the “XRay” to node “VisitAsia”, are introduced with in model obtained datasets. One dataset with 1000 samples for the learned with the MDL score. Also, the edge from node “TuASIA network and one dataset with 10000 samples for the berculosis” to “TbOrca” is reversed in the model learned ALARM network are used in our experiments. with the MDL score. The edge from node “VisitAsia” to The encoding length of the ASIA network and the node “Tuberculosis” is missing in the ASIA model learned ALARM network learned with the MDL and IMDL score with the MDL score. are shown in Table 2. In Figure 1 (c), it can be seen that only the edge from node “VisitAsia” to node “Tuberculosis” is missing in the 1 The BNJ software are written in the Java language and under the GNU General Public License (GPL). The BNJ software is available ASIA network obtained with the IMDL socre. We also apat http://bnj.sourceforge.net/. ply the K2 algorithm [4] to the same dataset of the ASIA
combination with the IMDL score to validate the usefulness of the IMDL score in the future.
References
(a)
(b)
(c) Figure 1. The results for the ASIA network. (a) The original ASIA network. (b) The best learned ASIA network in 100 models obtained with the MDL score. (c) The best learned ASIA network in 100 models obtained with the IMDL score.
network, but with an additional input of the topological order of the variables. The K2 algorithm also obtains the same model as shown in Figure 1 (c).
5. Conclusion In this paper, we propose two improvements to the MDL score for learning Bayesian networks. When applying the MDL score in learning Bayesian networks, the optimal model achieves the most concise encoding of the model and the dataset given the model. The IMDL score further reduces the encoding length of the learned model without losing the completeness in describing the model. Therefore, the models learned with the IMDL are more accurate than those obtained with the MDL score, as shown in section 4. In this paper, we use the local search algorithm to find optimal models. More searching methods can be used in
[1] H. Akaike. A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6):716– 723, 1974. [2] I. Beinlich, H. Suermoudt, R. Chavez, and G. Cooper. The alarm monitoring system: A case study with two probabilistic inference techniques for belief networks. In Proceedings of the 2nd European Conference on Artificial Intelligence in Medicine, pages 247–256, 1989. [3] W. Buntine. A guide to the literature on learning graphical models. IEEE Transactions on Knowledge and Data Engineering, 8:195–210, 1996. [4] G. F. Cooper and E. Herskovits. A bayesian method for the induction of probabilistic networks from data. Machine Learning, 9:309, 1992. [5] N. Friedman. The bayesian structural em algorithm. In Uncertainty in Artificial Intelligence: Proceedings of the Fourteenth Conference (UAI-1998), pages 129–138, San Francisco, CA, 1998. Morgan Kaufmann Publishers. [6] N. Friedman and M. Goldszmidt. Discretizing continuous attributes while learning bayesian networks. In International Conference on Machine Learning, pages 157–165, 1996. [7] N. Friedman and M. Goldszmidt. Learning bayesian networks with local structure. In Proceedings of the 12th Annual Conference on Uncertainty in Artificial Intelligence (UAI-96), pages 252–262, San Francisco, CA, 1996. Morgan Kaufmann Publishers. [8] N. Friedman and D. Koller. Being bayesian about network structure. a bayesian approach to structure discovery in bayesian networks. Machine Learning, 50:95–125, 2003. [9] N. Friedman, I. Nachman, and D. Pe’er. Learning bayesian network structure from massive datasets: The “sparse candidate” algorithm. In Proceedings of the 15th Annual Conference on Uncertainty in Artificial Intelligence (UAI-99), pages 206–215, San Francisco, CA, 1999. Morgan Kaufmann Publishers. [10] N. Friedman and Z. Yakhini. On the sample complexity of learning bayesian networks. In Proceedings of the 12th Annual Conference on Uncertainty in Artificial Intelligence (UAI-96), pages 274–282, San Francisco, CA, 1996. Morgan Kaufmann Publishers. [11] C. Glymour and G. F. Cooper, editors. Computation, Causation, and Discovery. AAAI Press/The MIT Press, Menlo Park, CA and Cambridge, MA, 1999. [12] D. Heckerman. A tutorial on learning bayesian networks. Technical Report MSR-TR-95-06, Microsoft Research, Mar. 1995. [13] D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian networks: the combination of knowledge and statistical data. Machine Learning, 20(3):197–243, 1995. [14] M. I. Jordan, editor. Learning in Graphical Models. MIT Press, Cambridge, Massachusetts, 1998.
[15] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simulated annealing. Science, 220:671–680, 1983. [16] W. Lam and F. Bacchus. Learning bayesian belief networks: An approach based on the mdl principle. Computational Intelligence, 10:269–293, 1994. [17] S. Lauritzen and D. Spiegelhalter. Local computations with probabilities on graphical structures and their application to expert systems. Journal of Royal Statistics Society, Series B, 50(2):157–224, 1988. [18] C. Meek. Graphical Models: Selecting causal and statistical models. PhD thesis, Carnegie Mellon University, 1997. [19] K. Mehmet and C. Gregory. A bayesian network scoring metric that is based on globally uniform parameter priors. In Proceedings of the 18th Annual Conference on Uncertainty in Artificial Intelligence (UAI-02), pages 251–258, San Francisco, CA, 2002. Morgan Kaufmann Publishers. [20] R. E. Neapolitan. Learning Bayesian networks. Prentice Hall, Upper Saddle River, NJ, 2003. [21] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, San Mateo, CA, 1988. [22] J. Pearl. Causality: models, reasoning, and inference. Cambridge University Press, Cambridge, UK, 2000. [23] J. Rissanen. Modeling by shortest data description. Automatica, 14:465–471, 1978. [24] J. Rissanen. Stochastic complexity and modeling. The Annals of Statistics, 14(3):1080–1100, 1986. [25] G. Schwarz. Estimating the dimension of a model. Annals of Statistics, 6:461–464, 1978. [26] P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search. MIT Press, Cambridge, MA, 2 edition, 2000. [27] J. Suzuki. A construction of bayesian networks from databases based on an mdl principle. In Proceedings of the 9th Annual Conference on Uncertainty in Artificial Intelligence (UAI-93), page null, San Francisco, CA, 1993. Morgan Kaufmann Publishers.