Ecient Algorithms for Learning Simple Belief Networks Luis M. de Campos, Juan F. Huete Dpto. Ciencias de la Computacion e Inteligencia Arti cial E.T.S.I. Informatica Universidad de Granada 18071 - Granada
[email protected],
[email protected] Abstract
Belief networks are graphical structures able to represent dependence and independence relationships among variables in a given domain of knowledge. We focus in the problem of automatic learning of these structures from data, and restrict our study to an speci c type of belief netwoks: Simple Graphs. We develop an ecient algorithm that recovers simple graphs using only zero and rst order conditional independence relationships, which overcomes some of the practical diculties of the existing algorithms.
keywords: Belief networks, Independence relationships, Simple graphs, Learning algo-
rithms.
1 Introduction.
Belief networks (also called Bayesian networks, causal networks or in uence diagrams) allow us to represent our knowledge about a given domain by means of graphical structures, namely directed acyclic graphs, where nodes represent variables and the absence of some arcs represents the existence of certain conditional independence relationships between variables (if a causal interpretation is given, then the arcs signify the existence of direct causal in uences between the linked variables). In this sense, belief networks are graphical representations of the so-called Dependency Models [5]. Once a complete belief network has been built, it constitutes an ecient device to perform inferences [4, 5]. However, there remains the previous problem of building such a network, i.e., to provide the graph structure and the numerical parameters necessary for characterising the network. So, an interesting task is then to develop methods able to learn the network directly from raw data, as an alternative or a complement to the (dicult and time-consuming) method of eliciting opinions from the experts. This
work has been supported by the DGICYT under Project n. PB92-0939
1
In this paper, we consider the particular case in which our knowledge about a given domain can be represented through a Simple Graph [1], i.e., a graph where every pair of nodes with a common direct child have no common ancestor nor is any an ancestor of the other. Our purpose is to develop an ecient learning algorithm for simple graphs based on conditional independence tests. In the next Section, we brie y introduce several preliminary concepts. In section 3, we focus in simple graphs, and analize a learning algorithm for this kind of structures, proposed in [1]. Our algorithm for learning simple graphs is described in Section 4. Finally, Section 5 contains our concluding remarks and some proposals for future work.
2 Preliminaries.
A Dependency Model [5] is a pair M = (U; I ), where U is a nite set of elements or variables, and I (:; : j :) is a rule that assigns truth values to a three place predicate whose arguments are disjoint subsets of U . The intended interpretation of I (X; Y j Z ) (read X is independent of Y given Z ) is that having observed Z , no additional information about X could be obtained by also observing Y . For example, any probability distribution P is a dependency model, where the predicate I is de ned through the concept of stochastic independence (I (X; Y j Z ) , P (x j z; y) = P (x j z) whenever P (z; y) > 0, for every instantiation x,y,z of the subsets of the variables X , Y , Z respectively). However, dependency models are applicable to many situations far beyond probabilistic models. Making use of information about dependencies and independencies is important to design ecient reasoning systems: indeed, independence allows us to modularise the knowledge in such a way that we only need to consult the information that is relevant to the speci c question which we are interested in, instead of having to explore a complete knowledge base; thus, using independence is important to perform eciently reasoning tasks in extensive and/or complex domains of knowledge. Therefore, the question is how to represent this kind of information. Graphs constitute an intuitive way of structuring the knowledge: the links in a graph (and the absence of links) allow us to represent explicitly dependence assertions (and independence ones, respectively). So, it makes sense to use graphs to incorporate structural information about dependencies and independencies in a reasoning system. The connection between belief networks and dependency models is established by means of the well-known d-separation criterion [5], which allows us to consider a belief network as a dependency model, i.e., it is a graphical de nition of conditional independence: d-separation: Given a directed acyclic graph (dag) G, a trail C (a trail in a directed graph is a sequence of adjacent nodes, the direction of the arrows does not matter) from node x to node y is said to be blocked by the set of nodes Z , if there is a vertex z 2 C such that, either z 2 Z and arrows of C do not meet head-to-head at z , or z 62 Z , nor has z any descendants in Z , and the arrows of C do meet head-to-head at z . A trail that is not blocked by Z is said to be active. Two subsets of nodes, X and Y , are said to be d-separated by Z , and this is denoted by I (X; Y j Z )G , if all trails between the
2
nodes in X and the nodes in Y are blocked by Z . We can say that the graph structure represents the dependency model qualitatively. However, if we want to use it for decision or inference purposes, we must also give quantitative information about the strength of the dependencies displayed by the graph. So, quantitatively, the model is represented by means of a joint probability distribution for the variables, which can be factorized in the following way: P (x1 ; x2; : : : ; xn) =
Yn
i=1
P (xi j (xi))
where xi represents any assignment of values to the variable xi, and (xi ) denotes any assignment of values to the variables in the set (xi ), which is the parent set of variable xi in the graph. In that case, the concept of probabilistic conditional independence between variables allows us to represent a joint probability distribution eciently by decomposing it as a product of conditional probability distributions, which quantify the strengths of the in uences. Therefore, it is natural to expect that the concept of conditional independence can be used to develop algorithms for learning belief networks [6, 7].
3 Simple Graphs.
Heckerman [2] introduced simple graphs as a model to represent dependence relationships between a set of diseases (e1 ; e2; : : : ; en) and a set of symptoms or tests (p1 ; p2; : : :; pm ), as Figure 1 shows. This represents (through the d-separation criterion) a relationship of marginal independence between diseases, and a conditional independence relationship between symptoms or tests given that we know what diseases are present and which are absent. e1
p1
e2
p2
. . .
e3
.
.
.
en
.
pm
Figure 1: Simple graph representing diseases and symptoms. Formally, a simple graph is a dag where every pair of nodes with a common direct child have no common ancestor nor is one an ancestor of the other. Note that, by de nition, no dag can contain directed cycles, but it may contain undirected ones (if we eliminate the directionality of the arrows). There exists other kind of belief networks, called singly 3
connected networks (or polytrees), where even undirected cycles are forbidden. Simple graphs include polytrees as particular cases, but their representation capabilities are much greater. Given a dag G, a subgraph of G of the form x ! z y is called a head-to-head pattern, and the node z is then a head-to-head node. A simple graph can be graphically characterized in a very simple way, as a dag whose undirected cycles (if any) are all of a special form: they contain at least two head-to-head nodes. Observe that in any dag, every cycle must contain at least a head-to-head node. So, a simple graph may contain (undirected) cycles, but it cannot contain cycles having only one head-to-head node. Figure 2 shows examples of non-simple and simple graphs. x1
x2
x3
x2
x1
x5
x4
x3
x5
x4
x6
x6
Figure 2: Examples of non-simple and simple graphs. Anyway, in our opinion, the name simple graph (coined by Geiger et al. in [1]) is misleading, since it induces the idea of an straightforward structure. However, simple graphs may be quite complex, as the one displayed in Figure 5. Moreover, simple graphs can represent conditional independence relationships on any order. Now, we are going to describe an algorithm for learning simple graphs, proposed in [1]. In the general case, the problem of learning belief networks is quite dicult, and the existing algorithms (particularly those based on detecting independence relationships between variables) need a time which is exponential with the number of variables. The constrained structure of simple graphs permitted Geiger et al. to develop a more ecient algorithm. The algorithm is depicted in Figure 3. Proofs of its correctness are found in [1]. If the underlying belief network is simple, the algorithm will output the correct graph (or one graph isomorph to it); otherwise, the algorithm gives out an error code as the output. Isomorphism [6] represents a theoretical limitation on the ability to identify directionality of some links, using information about independences; for example, the following three graphs re ect the fact that x and y are marginally dependent, but given z they become conditionally independent: x z y; x ! z ! y; x z ! y. Theoretically, this algorithm is easy to understand and its complexity is cuadratic, because it makes at most two independence test for each pair of variables (steps 2 and 3). However, from a more pragmatic point of view, this algorithm has a problem: for each pair of variables, it needs a conditional independence test of order n ? 2, where n is the number of variables in the model. So, when we have to learn the graph using statistical data, testing conditional independence statements of order n ? 2 requires an amount of time which is 4
Geiger, Paz and Pearl's Algorithm
1. Start with a complete undirected graph G 2. Remove every edge x|y for which I (x; y j U n fx; yg) holds 3. Remove every edge x|y for which I (x; y j ;) holds 4. Orient every pair of edges x|y and y|z towards y whenever x|y|z is in the graph, and I (x; z j ;) holds 5. Orient the remaining links without introducing new head-to-head patterns and such that the resultant dag is simple. If the resultant orientation is not feasible, then return `fail' Otherwise, output the resultant network Figure 3: Algorithm for learning simple graphs. exponential with n. Therefore, although the algorithm is polynomial (order O(n2 )) in the number of independence tests, the overall complexity is still exponential. Moreover, to reliably test high order conditional independence statements we need an enormous amount of data. Therefore, this algorithm would be appropriate if the conditional independence tests were inexpensive; for example, this is the case if we can obtain the answer by asking the results of the tests to an expert.
4 A New Algorithm for Learning Simple Graphs.
In this section, we propose an algorithm that reduces to the minimum (which may be zero) the number of necessary conditional independence tests of order greater than one, and is still polynomial in the number of tests. When the underlying belief network is simple, our algorithm will recover the graph (up to isomorphism) using only zero and rst order conditional independence tests (in this case, each test can be made in polynomial time). Otherwise, the algorithm will give out an error code as the output (and in this case some conditional independence tests of order greater than one may be necessary). The development of this algorithm is motivated by the following question: does an independence relationship between the nodes connected directly to a node x exist, such that it identi es whether a dag is simple or not?. The following result [3] gives a positive answer to this question: Let G be any dag; then, G is simple if and only if for every node x 2 U the relationship I (px ; cx j x) holds, 8px 2 Parentsx ; cx 2 Childrenx , where Parentsx and Childrenx are the sets of parents and children of node x in G. From this result, we posed the following question: given a hidden simple graph G, could we recover this graph using as input a list of marginal and rst-order conditional independence relationships obtained from G (through the d-separation criterion)?. Again, the answer is positive. The algorithm given in Figure 4 achieves this objective. The idea is to search, for each variable x, the set of nodes connected directly with x, and then fuse these elements to build the graph. To do so, rst, for any node x, we select, as the subset of nodes adjacent to x, those variables that have neither zero nor rst order conditional independence relationships with x (steps 1.1 and 1.2); second, we eliminate those variables 5
that have a conditional independence relationship with x of order greater than one (step 1.3). The sets of nodes x , x , Kx (y), x(y) and x (y) that appear in the algorithm can be calculated as indicate the following equations: x : The set of variables marginally dependent with x, i.e. x = fy 2 U j :I (x; y j ;)g x : The set of variables having neither zero nor rst order conditional independence relationships with x, i.e.
x = fy 2 x j :I (x; y j z ) 8z 2 U n fx; ygg
:
Kx (y)
The initial set of candidate nodes: Kx (y) = fw 2 x n fyg j :I (y; w j x)g
x (y): The set of candidate nodes:
x (y) = fwi 2 Kx (y) such that 9wj 2 Kx (y) satisfying 1:) I (wi; wj j ;) 2:) :I (wi; wj j y)g
x (y): It can be constructed by re ning the set of candidate nodes x (y) through:
x(y) = x (y) n fwi such that either a) 9z 2 y ; z 2 w such that I (z; x j ;) and :I (z; x j y); or b) 9z 2 y ; z 2 w such that :I (z; x j ;) and I (z; x j y) and :I (z; wi j y)g i
i
Now, let us explain how and why the algorithm works (for more details, see [3]). Before, let us x some terms: a trail between two nodes x and y in a dag G is said to be simple if every node in the trail is not head-to-head; an (undirected) cycle in a dag G is said to be a simple cycle if it contains at least two head-to-head nodes. If the cycle contains exactly two head-to-head nodes, then it is called an active simple cycle; in that case, the two head-to-head nodes are named the closing nodes of the cycle. First, whenever two nodes x and y are marginally dependent (x 2 y and y 2 x), there exists a simple trail linking them in the dag. Moreover, for the simple trails in a dag, we can establish the following classi cation: HT (x; y): Those simple trails linking x and y with a head connection at x and a tail connection at y, i.e., directed paths from y to x. T H (x; y): Those simple trails linking x and y with a tail connection at x and a head connection at y, i.e., directed paths from x to y. HH (x; y): Those simple trails linking x and y with a head connection at x and a head connection at y. In these trails, we can nd a node z , such that we have directed subpaths from z to x and from z to y. 6
CH Algorithm
1. For each x in U do 1.1. Calculate x 1.2. Calculate x 1.3. For each y 2 x do 1.3.1. Calculate Kx (y). If Kx (y) = ; go to step 1.3. 1.3.2. Calculate x (y). If x (y) = ; go to step 1.3. 1.3.3. Calculate x (y). If x (y) 6= ; then exclude y from x 1.4. For each pair y; z 2 x , If I (y; z j ;) holds, put the nodes y; z as parents of x 2. Fuse every x , to obtain G 3. Direct the remaining links without introducing new head-to-head connections Figure 4: New algorithm for recovering simple graphs. It can be seen [3] that in a simple graph G, whenever there are more than one simple trail linking two nodes x and y, then all these trails necessarily are of type HH (x; y). Note that each pair of these trails form a simple cycle, in fact an active simple cycle. When we are considering a simple graph, if there exists an active simple cycle with closing variables x and y, then there are neither zero nor rst order conditional independence relationships between these variables. Obviously, these relationships neither hold if a direct connection between x and y exists. Moreover, these two situations (an active simple cycle with closing variables x and y, and a direct connection between x and y) represent the only cases where conditional independence relationships of order zero or one do not exist. This fact permits us to ensure that, after executing step 1.2 of the algorithm, the set x will include, in addition to the parents and children of x, only those nodes closing an active simple cycle with x. Moreover, given two variables x and y in a simple graph G such that there are at least two HH (x; y) simple trails, then we have I (x; y j x (y)), where x (y) is the set of nodes which are the parents of x in any trail HH (x; y). In this case, the node y could be excluded from the set of nodes connected directly to x. So, to nd the correct set of nodes adjacent to x, it is necessary to have a method to detect those variables y 2 x such that there are at least two simple trails of type HH (x; y). We shall successively re ne an initial set, Kx (y), of candidate nodes (obtained from x ), to nd a set of nodes that separates x and y (steps 1.3.1, 1.3.2 and 1.3.3 in the algorithm). If some of these sets of nodes, Kx (y); x (y) or x (y), becomes empty, this will mean that there is no set of nodes capable of separating y from x. However, if at the end of this re ning process the nal set
x (y) is not empty, this will mean that we can nd a set of nodes separating y from x, hence the link between y and x will be eliminated. This explains step 1.3 in the algorithm. Finally, in step 1.4, the nodes in x which are parents of x can be distinguished from the children by testing marginal independence relationships. Under the assumption that the graph is simple, the algorithm gives out as output a simple graph with the same skeleton and the same head-to-head connections than the original, i.e., a simple graph isomorphic to the original (for a proof, see [3]). Now, let us discuss how ecient the recovery algorithm is: 7
The recovery algorithm can be made in part locally, independent for each variable:
only the transition from x (y) to x (y) cannot be made in a purely local way. This would make possible the use of distributed computation. The algorithm needs a polynomial number of independence tests. The independence tests needed are marginal independence test and conditional independence tests given only one variable. Therefore, we can calculate the results of the tests in polynomial time. The algorithm recovers a simple graph in polynomial time. The following example shows how the recovery algorithm works. We use as hidden model the simple graph in Figure 5, where the set of variables is U = f1; : : :; 18g and we study how the algorithm constructs the set of nodes adjacent to node 12. 1
2
3 4
5
6 7
13
8 9
10
11
12 15
14
16
18
17
Figure 5: A simple graph which is not so `simple'. We nd that 12 = U n f2; 13; 14; 15g, and 12 = f4; 5; 6; 9; 10; 11; 16g. We obtain the sets 12(4) = f5; 6; 9; 10; 11g, 12(5) = f9; 10g, 12 (6) = f9; 11g and 12(9) = f5; 6g, with the other sets 12(:) being empty. Again we nd that the sets 12(:) are all non-empty, except for 12(9), in which nodes 7 and 8 verify the conditions that exclude the nodes 5 and 6, respectively. Finally, we nd that 12 = f9; 10; 11; 16g. When we use the algorithm given in Figure 4, it is assumed that the underlying dag is in fact a simple graph. In this case, the algorithm gives out a simple graph as the output (and does it eciently). The independence tests can be obtained either from empirical data or from expert judgments, or a combination thereof. The problem arises when we do not know whether or not the underlying dag is simple. In this case, the algorithm can be modi ed in such a way that it gives out an error code as the output, if the dag is not simple. There are two possibilities that must be checked:
A.- The output of the algorithm is not a simple graph. If the true dag were simple, then all the head-to-head patterns should have been detected in step 1.4. So, if when directing the remaining links in step 3, a new head-to-head pattern is obtained, then we would have a marginal independence relationship that is not true, so we can give out an error code as the output. But it may be possible to direct the remaining links without 8
including new head-to-head patterns. In that case, we must test if the resultant graph is simple, i.e., if there is no simple trail connecting x with x.
B.- The output of the algorithm is a simple graph. If the true dag is not simple, we nd that a link has been incorrectly eliminated by the algorithm. By looking at the algorithm, we see that links are eliminated in steps 1.2 or 1.3. In the rst case, the independence relationships are tested, and so the arcs are eliminated correctly. In step 1.3, a link is eliminated under the assumption that we are considering a simple graph. The problem arises from this assumption. So, in step 1.3.3, when there is a set x (y) 6= ;, we must test the relationship I (x; y j x (y)) before eliminating the arc; if this relationship does not hold we can give out an error code as the output. In this process we need to use higher order independence test, but we can delay these tests until the end of the algorithm. The modi ed algorithm is shown in Figure 6. Alternative A is checked in steps 3 and 4, whereas alternative B is tested in step 5. MCH Algorithm
1. For each x in U do 1.1. Calculate x 1.2. Calculate x 1.3. For each y 2 x do 1.3.1. Calculate Kx (y). If Kx (y) = ; go to step 1.3. 1.3.2. Calculate x (y). If x (y) = ; go to step 1.3. 1.3.3. Calculate x (y). If x (y) 6= ; exclude y from x 1.4. For each pair y; z 2 x , If I (y; z j ;) holds, put the nodes y; z as parents of x 2. Fuse every x , to obtain G 3. Direct the remaining links without introducing new head-to-head connections. If this orientation is not feasible, return `fail' 4. Test whether the output graph is simple. If it is not simple, return `fail' 5. For each x (y) 6= ;, if I (x; y j x (y) \ Parentsx) does not hold, return `fail' Figure 6: Modi ed algorithm.
5 Conclusions and Future Work.
In this paper, we have presented an ecient algorithm for recovering simple graphs. When the hidden graph is in fact a simple graph, the algorithm recovers a graph isomorphic to it, using only zero and rst order conditional independence tests. If the hidden graph is not simple, the algorithm detect this fact and gives out an error code as the output. We note that, in any case, the algorithm needs a polynomial number of independence tests, some of them of order greater than one. In future work, we should study the properties of the simple graphs that can be obtained without carrying out step 5 in the algorithm, and how well these graphs approximate a non-simple dag. Another direction of research is to use the ideas developed here to design 9
algorithms that would allow us to recover more general graphs, in which some other kinds of cycles might be allowed.
References
[1] D. Geiger, A. Paz, and J. Pearl. Learning simple causal structures. International Journal of Intelligent Systems, 8:231{247, 1993. [2] D. Heckerman. A tractable inference algorithm for diagnosing multiple diseases. In R.D. Shachter, T.S. Levitt, L.N. Kanal, and J.F. Lemmer, editors, Uncertainty in Arti cial Intelligence 5, pages 163{171. Eselvier Science Publishers B.V. North Holland, 1990. [3] J.F. Huete. Aprendizaje de redes de creencia mediante la deteccion de independencias: modelos no probabilisticos. Ph.D. Thesis, Universidad de Granada, 1995 [4] S.L. Lauritzen and D.J. Spiegelhalter. Local computations with probabilities on graphical structures and their applications to expert systems (with discussion). The Journal of the Royal Statistical Society (Ser B), 50:157{224, 1988. [5] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan and Kaufman, San Mateo, 1988. [6] J. Pearl and T.Verma. A theory of inferred causation. In J.A. Allen, R. Fikes, and E. Sandewall, editors, Principles of Knowledge Representation and Reasoning: Proceedings of the Second International Conference, pages 441{452. Morgan and Kaufmann, San Mateo, 1991. [7] P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction and Search. Lecture Notes in Statistics 81. Springer Verlag, New York, 1993.
10