Log-linear modelling for contingency tables by using marginal model structures Sung-Ho Kim1 Division of Applied Mathematics, Korea Advanced Institute of Science and Technology Daejeon, 305-701, South Korea. e-mail:
[email protected] Abstract An approach to log-linear modelling for a large contingency table is proposed in this paper. A main idea in this approach is that we group the random variables that are involved in the data into several subsets of variables with corresponding marginal contingency tables, build graphical log-linear models for the marginal tables, and then combine the marginal models using graphs of prime separators (section 2). When the true log-linear model for the whole table is decomposable, prime separators in a marginal model are also prime separators in a maximal combined model of the marginal models. This property plays a key role in model-combination. The result of the paper is applied to a simulated data set of 40 binary variables for illustration. Key words: combined model structure, decomposable graph, edge-subgraph, graph-separateness, interaction graph, Markovian subgraph, prime separator.
1
Introduction
Consider a problem of log-linear modelling of a set of categorical variables which is too large for a computing machine to handle as a single model. Since the computing time for modelling increases exponentially in the number of variables, a large number of variables means a heavy burden both in time and capacity of computing. We will address an issue of large modelling in this paper. Fienberg and Kim (1999) considered a problem of combining conditional graphical log-linear structures and derived a combining rule for them based on the relation between the log-linear model and its conditional version. A main feature of the relation is that conditional log-linear structures appear as parts of their original model structure [see Theorems 3 and 4 therein]. The relationship becomes more explicit when the distribution is multivariate normal. Let X be a normal random vector. The precision matrix of the conditional distribution of a subvector X1 given the remaining part of X is the same as the X1 part of the precision matrix of X [Section 5.7, Whittaker (1990)]. Marginals of a joint probability distribution are not in general represented as parts of the joint distribution. However, there is a way that we can express explicitly the relationship between joint and marginal distributions under the assumption that the joint (as against marginal) probability model is graphical and decomposable [Kim (2004)]. Suppose that we are given a pair (call it Pair-1) of simple graphical models where one model is of random variables X1 , X2 , X3 with their inter-relationship that X1 is independent of X3 conditional on 1
This work was supported by Korea Research Foundation Grant (KRF-2003-015-C00096).
1
Log-linear modelling
2 1 2 3 • • •
1 2 4 • • •
Figure 1: Two marginal models X2 and the other is of X1 , X2 , X4 with their inter-relationship that X1 is independent of X4 conditional on X2 . From this pair, we can imagine a model structure for the four variables X1 , · · · , X4 . The two inter-relationships are pictured in Figure 7. We will use the notation [·]· · · [·] as used in Fienberg(1980) to represent a model. The left graph is of the model [12][23] and the right one is of the model [12][24]. X1 and X2 are shared in both models, and assuming that none of the four variables are marginally independent of the others, the following models are possible corresponding to Pair-1: [12][24][23], [12][24][34], [12][23][34], [12][234].
(1)
Note that we can obtain the pair in Figure 7 from each of these models and that among these four models, the first three are submodels of the last one. We consider another pair (call it Pair-2) of simple marginals, [12][23] and [24][25], where only one variable is shared. In this case, we have a longer list of combined models as follows: [12][24][23][25], [124][23][25], [124][23][35], [124][25][35], [124][235], [125][23][34], [125][24][34], [125][234].
(2)
Model structures [124][235] and [125][234] are maximal in the sense of set inclusion among these eight models. It is important to note that some variable(s) are independent of the others, conditional on X2 in each of the two pairs of marginals, Pair-1 and Pair-2, and in all the models in (1) and (2). That conditional independence takes place conditional on the same variable in the marginal models and also in the combined (or joint) models underlies the main theme of the paper. In addressing the issue of combining graphical model structures, we can not help using independence graphs and related theories to derive desired results with more clarity and refinement. The conditional independence embedded in a distribution can be expressed to some level of satisfaction by a graph in the form of graph-separateness [see, for example, The separation theorem in p. 67, Whittaker (1990)]. Kim (2004) showed that the graph-separateness is invariant between a set of marginal models and some joint model under the decomposability assumption of the model. This invariance is closely related to the statements that a decomposable graph is triangulated [Darroch, Lauritzen, Speed (1980); Leimer (1989)] and that decomposability is preserved in graph-collapsing [Theorem 4.3 of Kim (2004)]. While we will consider a problem of learning decomposable graphical models from a collection of marginal models, there has been remarkable improvements in learning graphical models in the form of a Bayesian network [Pearl (1986, 1988)] from data. This learning however is mainly instrumented by heuristic searching algorithms since the model searching is usually NP-hard [Chickering (1996)]. A good review is given in Cooper (1999) on structural discovery of Bayesian or causal networks from data. Since a Bayesian network can be transformed into a decomposable graph [Lauritzen and Spiegelhalter (1988)], the method of model combination which is proposed and applied in this paper would lead to an improvement in learning graphical models from data.
Log-linear modelling
3
Section 2 introduces notation and graphical terminologies to use, and some of the terminologies are prime separator and Markovian subgraph. Stochastic properties are described in section 3 concerning the relation between a graph and a type of its Markovian subgraph. Section 4 introduces basic notions of and a tool for model combination and presents some important results that are instrumental for model combination. Section 5 then defines a special type of graph which is called a graph of prime separators or GOPS for short, and a general description of the modelling procedure that is proposed in this paper is presented in Section 6. The procedure is applied to a simulated data set in Section 7, and Section 8 concludes the paper.
2
Preliminaries
We will consider only undirected graphs in the paper. We denote a graph by G = (V, E), where V is the set of the indexes of the variables involved in G and E is a collection of ordered pairs, each pair representing that the nodes of the pair are connected by an edge. Since G is undirected, that (u, v) is in E is the same as that (v, u) is in E. We say that a set of nodes of G forms a complete subgraph of G if every pair of nodes in the set are connected by an edge. A maximal complete subgraph is called a clique of G, where the maximality is in the sense of set-inclusion. We denote by C(G) the set of cliques of G. Graph G can be represented in the same way as a graphical log-linear model is represented in terms of generators [Fienberg (1980)]. If G consists of cliques C1 , · · · , Cr , we will write G = [C1 ] · · · [Cr ]. For instance, if G is of five nodes and C1 = {1, 2}, C2 = {2, 3}, C3 = {3, 4, 5}, then G = [12][23][345]. In this context, the terms graph and model structure will be used in the same sense. All the notation and graphical terminology that are used in this paper are summarized in Appendix A. For a subset A ⊆ V , we denote by GA = (A, EA ) the subgraph of G = (V, E) confined to A where EA = (E ∩ A × A) ∪ {(u, v) ∈ A × A; u and v are not separated by A \ {u, v} in G}.
(3)
In particular, we will call GA the Markovian subgraph of G confined to A. While a Markovian subgraph GA of G is used to represent a submodel for (Xi )i∈A , an induced subgraph is simply a part of G. For A ⊂ V , an induced subgraph of G confined to A is defined ind = (A, E ∩ (A × A)). For A ⊂ V , we denote by J the collection of the connectivity as GA A ind and let components in GA c β(JA ) = {bd(B); B ∈ JA }. (4) Then EA in (3) can be expressed as EA = [E ∪ {B × B; B ∈ β(JA )}] ∩ A × A. Note that, if we regard {B × B; B ∈ β(JA )} as a set of edges, the set makes the boundary of Ac complete.
Log-linear modelling
4
If G = (V, E), G 0 = (V, E 0 ), and E 0 ⊆ E, then we say that G 0 is an edge-subgraph of G and write G 0 ⊆e G. A subgraph of G is either a Markovian subgraph, an induced subgraph, or an edge-subgraph of G. If G 0 is a subgraph of G, we call G a supergraph of G 0 . According to the definition of a decomposable graph (see Definition A.2 in Appendix A), we can find a sequence of cliques C1 , · · · , Ck of a decomposable graph G which satisfies the following condition [see Proposition 2.17 of Lauritzen (1996)]: with C(j) = ∪ji=1 Ci and Sj = Cj ∩ C(j−1) , for all i > 1, there is a j < i such that Si ⊆ Cj .
(5)
By this condition for a sequence of cliques, we can see that Sj is expressed as an intersection of neighboring cliques of G. If we denote the collection of these Sj ’s by χ(G), we have, for a decomposable graph G, that χ(G) = {a ∩ b; a, b ∈ C(G), a 6= b}. (6) The cliques are elementary graphical components and the Sj is obtained as intersection of neighboring cliques. So, we will call the Sj ’s prime separators of the decomposable graph G. The prime separators in a decomposable graph may be extended to separators of prime graphs in any undirected graph, where the prime graphs are defined as the maximal subgraphs without a complete separator in Cox and Wermuth (1999).
3
Distribution, interaction graph, and Markovian subgraph
Let XA denote a vector of random variables Xi , i ∈ A. We use the symbol ·⊥ ⊥ · |·, following Dawid (1979), to represent conditional independence. A distribution P is said to be globally Markov with respect to an undirected graph G if, for a triple (A, B, C) of disjoint subsets A, B, C of V , XA ⊥⊥XB |XC
(7)
whenever A is separated from B by C in G. For convenience’ sake, we will write (7) simply as A⊥⊥B|C. In addition to the global Markov property, we will consider another property for a probability distribution. A distribution P with probability function f is said to be factorized according to G [Lauritzen (1996), section 3.2] if for all c ∈ C(G) there exist non-negative functions ψc that depend on x through xc only such that Y f (x) = ψc (x). c∈C(G)
We will denote the collection of the distributions that are globally Markov with respect to G by MG (G), and by MF (G) the collection of the distributions that are factorized according to G. For a probability distribution P of XV , let the logarithm of the density of P be expanded into interaction terms and let the set of the maximal domain sets of these interaction terms be denoted by Γ(P ), where maximality is in the sense of set-inclusion. We will call the set, Γ(P ), the generating class
Log-linear modelling
5
of P and denote by G(Γ(P )) = (V, E) the interaction graph of P which satisfies, under the hierarchy assumption for probability models, (u, v) ∈ E ⇐⇒ {u, v} ⊆ a for some a ∈ Γ(P ).
(8)
When confusion is not likely, we will use Γ instead of Γ(P ). It is well known in literature [Pearl and Paz (1987)] that if a probability distribution on XV is positive, then the three types of Markov property, pairwise Markov (PM), locally Markov (LM), and globally Markov (GM) properties relative to an undirected graph, are equivalent. Furthermore, for any probability distribution, it holds that (GM ) =⇒ (LM ) =⇒ (P M ) [see Proposition 3.8 in Lauritzen (1996)]. So, we will write M (G) instead of MG (G) and we will simply say that a distribution P is Markov with respect to G when P ∈ MG (G). Theorem 3.1. (Corollary 3.4 in Kim (2004)) For a distribution P of XV and A ⊆ V , PA ∈ M (G(Γ(P ))A ). If we regard G as an interaction graph of a distribution P , then Theorem 3.1 says that PA ∈ M (GA ), which means that PA is Markov with respect to GA . We call GA a Markovian subgraph of G in this context. Let V be a set of subsets of V . We will define another collection of distributions, ˜ A , A ∈ V) = {P ; PA ∈ M (GA ), A ∈ V}. L(G ˜ A , A ∈ V) is the collection of the distributions each of whose marginals is Markov with respect to L(G its corresponding Markovian subgraph of G. Theorem 3.2. For a collection V of subsets of V with an undirected graph G, ˜ A , A ∈ V). M (G) ⊆ L(G Proof. See the proof of Theorem 3.6 in Kim(2004). The set M (G) of the probability distributions each of which is Markov with respect to G is contained ˜ A , A ∈ V) of the distributions each of which has its marginals Markov with respect to in the set L(G their corresponding Markovian subgraphs GA , A ∈ V. This result sheds light on our efforts in searching ˜ A , A ∈ V). In subsequent sections, we will explore for M (G) since it can be found as a subset of L(G how GA , A ∈ V, are used in search of G.
4
Combined model structures
Let G = (V, E) be the graph of a decomposable model and let V1 , V2 , · · · , Vm be subsets of V . The m Markovian subgraphs, GV1 , GV2 , · · · , GVm , may be regarded as the structures of m submodels of the
Log-linear modelling
6
decomposable model. In this context, we may refer to a Markovian subgraph as a marginal model structure. These terms reflect that our goal is to find the model structure G based on a collection of marginal models. For simplicity, we write Gi = GVi . Definition 4.1. Suppose there are m Markovian subgraphs, G1 , · · · , Gm . Then we say that graph H of a set of variables V is a combined model structure (CMS) corresponding to G1 , · · · , Gm , if the following conditions hold: (i) ∪m i=1 Vi = V. (ii) HVi = Gi , for i = 1, · · · , m. That is, Gi are Markovian subgraphs of H. We will call H a maximal CMS corresponding to G1 , · · · , Gm if adding any edge to H invalidates condition (ii) for at least one i = 1, · · · , m. Since H depends on G1 , · · · , Gm , we denote the collection of the maximal CMSs by MXS(G1 , · · · , Gm ). According to this definition, a CMS is a Markovian supergraph of each Gi , i = 1, · · · , m. There may be many CMSs that are obtained from a collection of Markovian subgraphs as we saw in (1) and (2) corresponding to Pair-1 and Pair-2 of marginals, respectively. In the lemma below, CG (A) is the collection of the cliques which include nodes of A in the graph G. The proof is intuitive. The symbol, h·| · |·i, is explained in Appendix A. Lemma 4.2. Let G 0 = (V 0 , E 0 ) be a Markovian subgraph of G and suppose that, for three disjoint subsets A, B, C of V 0 , hA|B|CiG 0 . Then (i) hA|B|CiG ; (ii) For W ∈ CG (A) and W 0 ∈ CG (C), hW |B|W 0 iG . Recall that if Gi , i = 1, 2, · · · , m, are Markovian subgraphs of G, then G is a CMS. For a given set S of Markovian subgraphs, there may be many maximal CMSs, and they are related with S through prime separators as in the theorem below. Theorem 4.3. Let there be Markovian subgraphs Gi , i = 1, 2, · · · , m, of a decomposable graph G. Then (i)
∪m i=1 χ(Gi ) ⊆ χ(G);
(ii) for any maximal CMS H, ∪m i=1 χ(Gi ) = χ(H). Proof. See Appendix B. For a given set of Markovian subgraphs, we can readily obtain the set of prime separators under the decomposability assumption. By (6), we can find χ(G) for any decomposable graph G simply by taking all the intersections of the cliques of the graph. An apparent feature of a maximal CMS in contrast to a CMS is stated in Theorem 4.3.
Log-linear modelling
7
For a set M of Markovian subgraphs of a graph G, there can be more than one maximal CMS of M. But there is only one such maximal CMS that contains G as its edge-subgraph. Theorem 4.4. Suppose there are m Markovian subgraphs G1 , · · · , Gm of a decomposable graph G. Then there exists a unique maximal CMS H∗ of the m Markovian subgraphs such that G ⊆e H∗ . Proof. See Appendix B. The unique existence of maximal CMS for a given model structure provides us with a useful guideline for searching, based on a set of marginal model structures, the maximal CMS which contains the actual model structure as an edge-subgraph. Since the maximal CMS is at least as large as the actual model G, we have only to work with the maximal CMS to find G by removing edges as a measure of goodness-of-fit suggests. Let P be a distribution of XV and V be a collection of subsets of V , and let S be the collection of G(Γ(P ))A , A ∈ V. Note that G(Γ)A are Markovian subgraphs of G(Γ). Thus, under the decomposability assumption, there is a unique maximal CMS, H∗ , by Theorem 4.4, based on S which contains ∗ = G(Γ(P )) , if we put G = G(Γ(P )) in Theorem 3.2, we G(Γ(P )) as an edge-subgraph. Since HA A end up with the summarizing expression ˜ (G(Γ(P ))A , A ∈ V) , M (G(Γ(P ))) ⊆ M (H∗ ) ⊆ L
(9)
where the first inequality follows since G(Γ(P )) ⊆e H∗ . Since P ∈ M (G(Γ(P ))), expression (9) implies that P is also Markov relative to the maximal CMS, H∗ . In reality, G(Γ(P ))A ’s are determined based on data. Once we obtain H∗ , the model for XV can be determined by removing edges from H∗ as the data suggest.
5
Graph of prime separators
In this section, we will introduce a graph of prime separators which consists of prime separators and edges connecting them. The graph is the same as the undirected graphs that are considered so far in this paper, the nodes being replaced with prime separators. For example, there are three prime separators, {3, 4}, {3, 5}, and {4, 8}, in graph G1 in Figure 4. If G1 is an interaction graph, then none of the prime separators is conditionally independent of any other among the three prime separators. We represent this phenomenon with the graph at the top-left corner in Figure 5. We will simply call a graph of prime separators a GOPS. We can see conditional independence among the prime separators, {13, 14}, {10, 13}, {10, 19}, and {10, 21}, in graph G3 in Figure 4. This conditional independence is depicted in GOP S3 in Figure 5. As connoted in GOP S1 in Figure 5, a GOPS may contain a clique of more than 2 prime separators, but it cannot contain a cycle of length 4 or larger if the prime separators are from a decomposable graph. Let G 0 be a Markovian subgraph of G and suppose that, for three prime separators, A, B, and C, of A \ C and B \ C are separated by C in G 0 . Then, by Lemma 4.2, the same is true in G. This implies the following theorem. G0,
Log-linear modelling
8
Theorem 5.1. Let V be a set of subsets of V and let H be a maximal CMS of GA , A ∈ V. Suppose that there is no cycle of prime separators of length 4 or larger in any of the GOPS of GA , A ∈ V. Then, there does not exist a cycle of length 4 or larger in the GOPS of H, either. Proof. Suppose there is a cycle of length 4 or larger in H and let W be the set of the prime separators that form the cycle. According to the condition of the theorem, there does not exist any set A in V such that GA contains 4 or more prime separators of W. This means that the nodes in the prime separators in W may form a clique and this clique may replace the cycle in H. This is contradictory to that H is a maximal CMS. Therefore, H can not contain a cycle of length 4 or larger. For three sets, A, B, and C, of prime separators of an interaction graph G, if A and B are separated by C, then we have that (∪a∈A a) ∩ (∪b∈B b) ⊆ (∪c∈C c) . (10) When A, B, and C are all singletons of prime separators, the set-inclusion is expressed as A ∩ B ⊆ C.
(11)
This is analogous to the set-inclusion relationship among cliques in a junction tree of a decomposable graph (Lauritzen (1996)). A junction tree is a tree-like graph of cliques and intersection of them, where the intersection of neighboring cliques lies on the path which connects the neighboring cliques. As for a junction tree, the sets in (11) are either cliques or intersection of cliques. In the context of a junction tree, the property as expressed in (11) is called the junction property. We will call the property expressed in (10) PS junction property, where ‘PS’ is from ‘prime separator.’ The GOPS and the junction tree are different in the following two senses: First, the basic elements are prime separators in the GOPS while they are cliques in the junction tree; secondly, the GOPS is an undirected graph of prime separators while the junction tree is a tree-like graph of cliques. Some prime separators may form a clique in an undirected graph as in graphs G1 and G4 in Figure 4. This is why GOPS may not necessarily be tree-like graphs. So, two prime separators may be separated by a set of prime separators. But, since all the prime separators in a decomposable graph G are obtained from the intersections of neighboring cliques in G, the GOPS of G is the same as the junction tree of G with the clique-nodes removed from the junction tree. Whether G is decomposable or not, expression (10) holds in general.
6
Description of modelling procedure
Before applying the results of the paper for modelling, we will describe a procedure of log-linear modelling by using log-linear models that are obtained from marginal contingency tables. Suppose that we have a contingency table of a large number of random variables whose model structure is graphical and that we cannot handle the whole table at once for modelling. In this situation, we propose to develop several marginal log-linear models and use them for building a model for the whole data set.
Log-linear modelling
9
In selecting subsets of random variables, it is important to have the random variables associated more highly within subsets than between subsets. This way of subset-selection would end up with subsets of random variables where random variables that are neighbors in the graph of the model structure of the whole data set are more likely to appear in the same marginal model. Once marginal models are obtained, we construct a maximal CMS based on the collection of the GOPS of the marginal models. Suppose there are m marginal models. For operational convenience, it is recommended that we start combining a pair of marginal models which shares more random variables than any other pairs of marginal models. As more random variables are shared between a pair of marginal models, say G1 and G2 , it is more likely that the model-combination gets easier, since the shared variables restrict the locations of the other random variables in V1 ∪V2 in the model-combination. Once the first pair of marginal models are combined, it is desirable that a marginal model is selected which shares most random variables with V1 ∪ V2 , continue model-combination until all the marginal models are combined into a model which is a maximal CMS of the m marginal models. There may be more than one maximal CMS corresponding to a pair of marginal models, say G1 and G2 . To choose a most appropriate maximal CMS, we need to look at a certain subset of random variables. If a marginal model, say G 0 , is available for the subset, then we can use this model in addition to G1 and G2 to choose a most appropriate maximal CMS based on the marginal models, G1 , G2 , and G 0 . Under the assumption that we have data for all the variables that we are interested in, we can find a most appropriate maximal CMS corresponding to a given set of marginal models.
7
Applications
In this section, we use a simulated data set of 40 binary variables which is obtained from the graphical log-linear model as in Figure 2, and demonstrate a modelling procedure with the data. The number of categorical variables that can be handled at once for log-linear modelling and the complexity of a model are limited up to the computational capacity of a computing machine. Our computing machine (IBM PC) could handle up to 10 binary variables at once at a relatively good speed of a few seconds or minutes. For any larger model with more than 10 variables, it would take hours or days with the computing machine. So, we applied the approach as described in the preceding section.
7.1
Selection of marginal contingency tables
The log-linear model is a model of association among the variables. The association between a pair of categorical variables is reflected in the coefficient of the logistic regression model of one variable of the pair upon the other (Chapter 3 of Hosmer and Lemeshow (1989)). It is thus reasonable to apply a regression method to select subsets of variables that are associated higher with each other within subsets than they are between subsets. Since the logistic regression analysis requires dealing with the whole set of variables, we run into the same, large modelling problem. So, we apply a nonparametric, tree regression method as an effective alternative by using the tree and its related commands in S-plus (Chambers and Hastie, 1992).
Log-linear modelling
10
Using the simulated data set of size 250,000, we applied the tree regression to every of the 40 variables and selected a set of regressor variables which explains about 80% of the total variation of the response variable. The selected regressor variables are listed in the form of a matrix in Figure 3. The entries in the matrix are either 0 or 1. The regressor variables for the variable Xj are listed as 1’s in the jth column. For instance, variables X4 , X6 , X7 , X8 are selected as the regressor variables for X5 . We denote by ψ(j) the selected set of the indexes of the regressor variables for Xj . That i ∈ ψ(j) does not necessarily imply that j ∈ ψ(i). For instance, ψ(7) = {3, 5}, while ψ(3) = {2, 4, 5}. This asymmetry in the regressor-response relationship is not unusual. The regression tree we consider is constructed by selecting regressor variables one after another in such a way that the most informative variable for a particular response variable may be selected. So, the fact that ψ(3) = {2, 4, 5} means that X7 is less informative for X3 than those in ψ(3). This asymmetry can also be due to a fluctuation in the data set. Although we have seen some instances of asymmetry in the regressor-response relationship, symmetry is a prevalent feature in the relationship, as we can see in the matrix. We grouped the 40 variables into 6 subsets of 10 or less variables in such a way that the variables share more variables as regressor variables within subsets of variables than the variables share between subsets. The six subsets are listed in Table 1. As noted in the table, subsets i and i + 1, i = 1, 2, · · · , 5, share a nonempty set of variables. In particular, subsets 5 and 6 share as many as 7 variables.
11 2
23 9
4
1
26
25
12
27
39
22
21
38 37
24
3
40 10
8
6
19
36
5
34 14
13
35
20
7
29 17
15 28
32
16 30
18 31
33
Figure 2: A true model of the 40 variables that are used for the simulation study
Log-linear modelling
11
Figure 3: A matrix of regressor-response relationships for the 40 variables as obtained from a tree regression analysis. The 1’s in column j are the indicators of the regressor variables for the response variable Xj .
Table 1: The indexes of the variables in the 6 subsets, V1 , · · · , V6 . V1 = {1, 2, 3, 4, 5, 6, 7, 8, 11, 12} V2 = {8, 9, 10, 11, 12, 14, 15, 16, 17, 18} V3 = {10, 13, 14, 15, 19, 20, 21, 22, 23, 24} V4 = {13, 20, 21, 22, 25, 26, 27, 28, 29, 34} V5 = {28, 29, 30, 31, 32, 34, 35, 36, 37, 38} V6 = {30, 31, 32, 33, 35, 36, 37, 38, 39, 40}
Log-linear modelling
12 Table 2: Goodness-of-fit levels of the six marginal models Marginal model 1 2 3 4 5 6
7.2
d.f. 567 645 601 649 617 604
Pearson χ2 547.50 667.41 589.07 679.25 591.89 621.53
p-value 0.714 0.263 0.628 0.199 0.760 0.302
Marginal log-linear models for the 6 subsets of variables
For every subset of 10 variables, we built a decomposable log-linear model that fits well to the corresponding marginal set of data. Figure 4 displays the model structures for the six subsets of variables and their goodness-of-fit results are listed in Table 2. The p-values of the goodness-of-fit tests are all larger than or equal to 0.199. We denote by Gi the model structure of the variables indexed in Vi and by c(k1 , k2 , · · · , kr ) a prime separator which consists of variables, Xk1 , Xk2 , · · · , Xkr . From G1 , we have χ(G1 ) = {c(3, 4), c(4, 8), c(3, 5)}. The three prime separators are associated with each other among themselves. If we regard a prime separator as a random variable, we can represent the conditional independence relationship among the prime separators via an independence graph based on the corresponding marginal models Gi . The independence graphs of the prime separators are displayed for each marginal model in Figure 5. We will call a variable that is a member of a prime separator a separator variable and a variable that is not a member of a prime separator a non-separator variable. Since every non-separator variable is separated from other variables by its neighbor prime separators, we can represent Gi by linking each non-separator variable to its neighbor prime separators. For example, the graph in Figure 6, which is obtained by adding non-separator variables of G3 to the graph, GOP S3 , of the prime separators of G3 .
7.3
Combination of marginal models
According to Theorem 4.3 (i), all the prime separators that appear in a submodel of G are found in χ(G). This means that we must make sure that all the prime separators of the submodels appear as prime separators in G. This fact is instrumental to constructing the independence graph of the prime separators of the submodels. In section 1, we considered two simple problems of model combination. Variables X1 and X2 are shared by the marginals in Pair-1, and only X2 is shared in Pair-2. Although these examples are very simple, we can see in them that the shared variables are like road signs in constructing a model structure of the variables that are involved in the variable-sharing marginal models. When combining a pair of marginal models, we find a location of each variable involved in either of the marginal models so that the Markov or conditional-independence properties that are embedded in the two models are satisfied. As more variables are shared between a pair of marginal models, the possible locations of each variable
Log-linear modelling
13
12
7 2
11
9
3 6
10
1 5
4
14
17
8 8
15
12
16 18
11
G1
G2 15
14
34 29
13
19 20
10
25
22 28
23
13
20
21
21 22
24
27
G3
G4
31
35
28 38
34
30
30 36
37
37
32
29
31
35
39 38
33
32
G5
26
36
40
G6
Figure 4: Marginal models for the 6 subsets of variables. Gi is the decomposable log-linear model for subset Vi . Prime separators are represented by thick lines. See Figure 5 for the prime separators of the 6 marginal models.
Log-linear modelling
14
13, 14
8, 10
3, 4 3, 5
10, 13
14, 16
10, 21
4, 8
10, 19
15, 16
GOP S1
GOP S2
GOP S3
28, 30 20, 29
28, 29, 30 20, 28
36, 38 29, 34
13, 20
37, 38
34, 36
21, 22 GOP S4
30, 32
GOP S5
GOP S6
Figure 5: The independence graphs of the prime separators for the six marginal models. GOP Si denotes the independence graph of the prime separators of Gi .
15 13, 14 10, 13 20
10, 21 10, 19 23 22
24
Figure 6: The graph obtained by linking non-separator variables (bullets) to the separator variables of G3 .
Log-linear modelling
15
are more limited and thus model construction for the variables that are involved in either of the two marginal models becomes easier. The variable-sharing between Vi and Vi+1 , i = 1, 2, · · · , 5, is as follows: |V1 ∩ V2 | = 3, |V2 ∩ V3 | = 3, |V3 ∩ V4 | = 4, |V4 ∩ V5 | = 3, |V5 ∩ V6 | = 7. |Vi ∩ Vj | = 0 when |i − j| > 1. So, it is desirable that we begin combining marginal models from the pair of G5 and G6 , and then keep combining marginal models in the order of G4 , G3 , G2 , G1 . Figure 8 shows a procedure of combining the prime separators of submodels. The top-right corner is the result of combining the prime separators of G5 and G6 . Note that all the variables in V5 ∩ V6 form a clique in G6 and {30, 31, 32} ⊥ {35, 36, 37, 38}|{29, 34} in G5 . The three prime separators, c(30, 32), c(28, 29, 30), and c(28, 30), form a clique in G5 and so do the prime separators, c(34, 36), c(36, 38), and c(37, 38). This analysis ends up with the GOPS, Γ(5, 6), at the top-right corner of Figure 8. Any addition of edges to the GOPS at the top-right corner destroys the Markov property among the prime separators in χ(G5 ) ∪ χ(G6 ). In this sense, the GOPS is maximal. For a given collection of the GOPSs of Gi , i ∈ A, there can be more than one maximal independence graphs of the prime separators, and we will denote the set of the maximal independence graphs by Γ(A). When |Γ(A)| = 1, the independence graph itself will be represented by Γ(A). The independence graph at the top-left corner of Figure 8 is Γ(4, 5, 6). Since the prime separator c(20, 29) of G4 shares node 29 with the prime separators c(29, 34) and c(28, 29, 30) of G5 , the 3 prime separators may possibly form a clique in a GOPS in Γ(4, 5, 6). However, according to G4 , c(29, 34) ⊥ c(20, 28)|c(20, 29). So, either c(29, 34), c(20, 29), and c(28, 29, 30) form a clique and there is no edge between c(20, 28) and c(28, 29, 30) as graph (a) in Figure 7 or c(29, 34), c(20, 29), c(28, 29, 30), and c(20, 28) are located as in Γ(4, 5, 6). But the former situation is in violation of the PS junction property (10) since {20, 28} ∩ {28, 29, 30} 6⊆ {20, 29}. This leads to the GOPS Γ(4, 5, 6). In combining G3 with Γ(4, 5, 6), we note that {13, 20, 21, 22} ⊆ V3 ∪ V4 and {13, 20} and {21, 22} are prime separators in GOP S4 . In G3 , ne(20) = {10, 13, 19} and ne(22) ⊃ {10, 19, 21}. So, c(13, 20) is adjacent to c(10, 13) and c(10, 19), and c(21, 22) is adjacent to c(10, 21) and c(10, 19). {13, 20} ⊥ {21, 22} | {10, 19} in G3 . Thus, we have the GOPS as in panel (b) of Figure 7. Since {20, 28} ⊥ {21, 22} | {13, 20} in GOP S4 , only c(13, 20) is adjacent to c(20, 28) as in Γ(3, 4, 5, 6) in
13,14
10,13
3
3
29, 34 5
20, 29
28, 29, 30
4
20, 28 4
10,19
13,20
3
4
5
21,22
10,21 3
(a)
4
(b)
Figure 7: Some GOPS’s considered in the model combining process
Log-linear modelling
16
37, 38 6
36, 38 6
37, 38 6
34, 36 5
34, 36 5 5
5
28, 29, 30
4
20, 28 4
28, 29, 30 5
30,32 28, 30 5
13,20
Γ(5,6)
29, 34
29, 34 20, 29
36, 38 6
5
30,32 28, 30 5
6
6
4
21,22
Γ(4,5,6)
4
36, 38 6
37, 38 6 34, 36 5 29, 34 5
13,14
3
20, 29
4
10,13
28, 29, 30
5
3
20, 28 4 10,19
30,32 28, 30 5
3
13,20
6
4
10,21 3
21,22
Γ(3,4,5,6) 4
Figure 8: A procedure of combining prime separators of submodels Gi . Γ(A) denotes a maximal independence graph of the prime separators of Gi , i ∈ A. The small numbers at the bottom-right of the ovals are the submodel labels of which the ovals are prime separators.
Log-linear modelling
17
2
39
6
1 3, 4
7
3, 5 1
40 37, 38 6
1
35
4, 8
34, 36 5
1
11 12
17
8, 10 2
29, 34 5
9 13,14 14,16
36, 38 6
3
20, 29
2
4
10,13
28, 29, 30
5
3
18
20, 28 4
15,16 2
10,19
13,20
24
6
4
25 23
30,32 28, 30 5
3
10,21 3
21,22
4
26
33 31
27
Figure 9: An independence graph of prime separators and non-separator variables. The prime separators are in ovals and the dots are for the non-separator variables, and the small numbers at the bottom-right of the ovals are the submodel labels of which the ovals are prime separators.
Figure 8. We can apply the same argument in combining G1 and G2 with Γ(3, 4, 5, 6}, which ends up with the GOPS in Figure 9. In other words, |M XS(G1 , · · · , G6 )| = 1. The bullets in the figure represent the non-separator variables, which are connected to their neighboring prime separators. Note that those neighboring prime separators are classified as such mostly due to the non-separator variables. A prime separator is itself a complete subgraph and so is a clique of prime separators. So we can easily transform the graph in Figure 9 into the undirected graph in Figure 10. This is the maximal CMS of the six marginal models as listed in Figure 4. The true model in Figure 2 is fully recovered in the maximal CMS except the 5 thick edges appearing in Figure 10. These additional edges were created because X4 were missing in G2 . If X4 had been added to G2 , then X{4,9} would have separated X11 , X12 , and X{8,10} from each other, making those additional edges unnecessary. To sum up, in combining marginal models, it is important to make use of the locations of the variables that are shared by the marginal models to be combined. While we use GOPS’s of marginal models to construct another GOPS, the locations of the non-separators that are shared by the marginal models to be combined are as important as the prime separators in the marginal models. The PS junction property (10) is instrumental for locating prime separators in model-combination.
Log-linear modelling
11
18
25
12
26
27
39
23
2
9
38
4
1
21
22
37
3 10 6
40
24 36
19
8 13
5 14
20
29
7 17
35
34
15
16
28
32 33 30
18 31
Figure 10: The combined model structure which is obtained from the independence graph in Figure 9. The thick edges are additional to the true model in Figure 2.
8
Concluding remarks
In this paper, we propose an approach for log-linear modelling for a large number of categorical random variables which cannot be handled at once. A main idea behind the approach is that we group the random variables into several subsets such that the association among the variables are higher within subsets than between subsets. The sizes of the subsets of random variables are bounded by the computing capacity of the computing machine to use. In combining marginal models, we need to take into consideration the number of variables that are shared by neighboring marginal models. As more variables are shared, the model-combination becomes easier since the shared variables are used as road signs in constructing a maximal CMS of a given set of marginal models. The GOPS of a decomposable graph G is the same as the junction tree of G with the nodes, each of which represents a clique of G, replaced with prime separators. Thus the GOPS is not usually
Log-linear modelling
19
tree-shaped since more than two prime separators can be contained in a clique of G. The PS junction inequality (10) is very helpful for locating prime separators in combining marginal models. When more than one maximal CMS are obtained, it is desirable that we collect information about the marginal model on a certain subset of random variables by which we can select a most appropriate maximal CMS. The simulation study strongly suggests the proposed modelling method as a convenient approach for a large scale log-linear modelling.
Appendix A: Graphical terminology Let G = (V, E) be an undirected graph. If (u, v) ∈ E, we say that u is a neighbor node of v or vice versa and write it as u ∼ v. A path of length n is a sequence of nodes u = v0 , · · · , vn = v such that (vi , vi+1 ) ∈ E, i = 0, 1, · · · , n − 1 and u 6= v. If u = v, the path is called an n-cycle. If u 6= v and u and v are connected by a path, we write u v. We define the connectivity component of u as [u] = {v ∈ V ; v u} ∪ {u}. So, we have v ∈ [u] ⇐⇒ u v ⇐⇒ u ∈ [v]. For u ∈ V , we define ne(v) = {u ∈ V ; v ∼ u in G} and, for A ⊆ V , bd(A) = ∪v∈A ne(v) \ A. We say that a path, v1 , · · · , vn , v1 6= vn , is intersected by A if A ∩ {v1 , · · · , vn } 6= ∅ and neither of the end nodes of the path is in A. We say that nodes u and v are separated by A if all the paths from u and v are intersected by A. In the same context, we say that, for three disjoint sets A, B, and C, A is separated from B by C if all the paths from A to B are intersected by C and write hA|C|BiG . The notation h·| · |·iG follows Pearl (1988). A non-empty set B is said to be intersected by A if B is partitioned into three sets B1 , B2 , and B ∩ A and B1 and B2 are separated by A in G. The complement of a set A is denoted by Ac . The cardinality of a set A will be denoted by |A|. For two collections A, B of sets, if, for every a ∈ A, there exists a set b in B such that a ⊆ b, we will write A ¹ B. Although decomposable graphs are well known in literature, we define them here for completeness. Definition A.1. A triple (A, B, C) of disjoint, nonempty subsets of V is said to form a decomposition of G if V = A ∪ B ∪ C and the two conditions below both hold: (i) A and B are separated by C; (ii) GCind is complete. By recursively applying the notion of graph decomposition, we can define a decomposable graph. Definition A.2. G is said to be decomposable if it is complete, or if there exists a decomposition ind and G ind . (A, B, C) into decomposable subgraphs GA∪C B∪C
Log-linear modelling
20
Appendix B: Proofs We will first present some theorems that are used in the proof of Theorem 4.3 and then Theorems 4.3 and 4.4 are proved. The following theorem is similar to Corollary 2.8 in Lauritzen (1996), but it is different in that an induced subgraph is considered in that corollary while a Markovian subgraph is considered here. Theorem B.1. Every Markovian subgraph of a decomposable graph is decomposable. Proof. Suppose that a Markovian subgraph GA of a decomposable graph G is not decomposable. Then there must exist a chordless cycle, say C, of length ≥ 4 in GA . Denote the nodes on the cycle by ν1 , · · · , νl and assume that they form a cycle in that order where ν1 is a neighbor of νl . We need to show that C itself forms a cycle in G or is contained in a chordless cycle of length > l in G. By Lemma 4.2, there is no edge in G between any pair of non-neighboring nodes on the cycle. If C itself forms a cycle in G, our argument is done. Otherwise, we will show that the nodes ν1 , · · · , νl are on a cycle which is larger than C. Without loss of generality, we may consider the case where there is no edge between ν1 and ν2 . If there is no path in G between the two nodes other than the path which passes through ν3 , · · · , νl , then, since C forms a chordless cycle in GA , there must exist a path between v1 and v2 other than the path which passes through v3 , · · · , vl . Thus the nodes ν1 , · · · , νl must lie in G on a chordless cycle of length > l. This completes the proof. This theorem and expression (6) imply that, as for a decomposable graph G, the prime separators are always given in the form of a complete subgraph in G and in its Markovian subgraphs. Lemma 4.2 states that a separator of a Markovian subgraph of G is also a separator of G. We will next see that every maximal CMS is decomposable provided that all the Markovian subgraphs, G1 , · · · , Gm , are decomposable. Theorem B.2. Let G1 , · · · , Gm be decomposable. Then every MXS(G1 , · · · , Gm ) is also decomposable. Proof. Suppose that there is a maximal CMS, say H, which contains an n-cycle (n ≥ 4) and let A be the set of the nodes on the cycle. Since H is maximal, we can not add any edge to it. This implies that no more than three nodes of A are included in any of the Vi ’s, since any four or more nodes of A that are contained in a Vi form a cycle in Gi , which is impossible due to the decomposability of the Gi ’s. Hence, the cycle in H may become a clique by edge-additions on the cycle, contradicting that H is maximal. Therefore, H must be decomposable. The next theorem show that if a set of nodes is a prime separator in a Markovian subgraph, then it is not intersected in others. Theorem B.3. Let G be a decomposable graph and G1 and G2 be Markovian subgraphs of G. Suppose that a set C ∈ χ(G1 ) and that C ⊆ V2 . Then C is not intersected in G2 by any other subset of V2 .
Log-linear modelling
21
Proof. Suppose that there are two nodes u and v in C that are separated in G2 by a set S. Then, by Lemma 4.2, we have hu|S|viG . Since C ∈ χ(G1 ) and G1 is decomposable, C is an intersection of some neighboring cliques of G1 by equation (6). So, S can not be a subset of V1 but a proper subset of S can be. This means that there are at least one pair of nodes, v1 and v2 , in G1 such that all the paths between the two nodes are intersected by C in G1 , with v1 appearing in one of the neighboring cliques and v2 in another. Since v1 and v2 are in neighboring cliques, each node in C is on a path from v1 to v2 in G1 . From hu|S|viG , it follows that there is an l-cycle (l ≥ 4) that passes through the nodes u, v, v1 , and v2 in G. This contradicts the assumption that G is decomposable. Therefore, there can not be such a separator S in G2 . This theorem states that, if G is decomposable, a prime separator in a Markovian subgraph of G is either a prime separator or a complete subgraph in any other Markovian subgraph of G. If the set of the prime separator is contained in only one clique of a Markovian subgraph, the set is embedded in the clique. For a subset V 0 of V , if we put G1 = G and G2 = GV 0 in Theorem B.3, we have the following corollary: Corollary B.4. Let G be a decomposable graph and suppose that a set C ∈ χ(G) and that C ⊆ V 0 ⊂ V . Then C is not intersected in a Markovian subgraph GV 0 of G by any other subset of V 0 . Proof of Theorem 4.3: We will first prove result (i) of the theorem. For a subset of nodes Vj , the following holds: (i’) If Vj does not contain a subset which is a prime separator of G, then χ(Gj ) = ∅. (ii’) Otherwise, i.e., if there are prime separators, C1 , · · · , Cr , of G as subsets of Vj , (ii’-a) if there are no nodes in Vj that are separated by any of C1 , · · · , Cr in G, then χ(Gj ) = ∅. (ii’-b) if there is at least one of the prime separators, say Cs , such that there are a pair of nodes, say u and v, in Vj such that hu|Cs |viG , then χ(Gj ) 6= ∅. We note that, since G is decomposable, the condition that Vj contains a separator of G implies that Vj contains a prime separator of G. As for (i’), every pair of nodes, say u and v, in Vj have at least one path between them that bypasses Vj \ {u, v} in the graph G since Vj does not contain any prime separator of G. Thus, (i’) follows. On the other hand, suppose that there are prime separators, C1 , · · · , Cr , of G as a subset of Vj . The result (ii’-a) is obvious, since for each of the prime separators, C1 , · · · , Cr , the rest of the nodes in Vj are on one side of the prime separator in G. As for (ii’-b), let there be two nodes, u and v, in Vj such that hu|Cs |viG . Since G is decomposable, Cs is an intersection of neighboring cliques in G, and the nodes u and v must appear in some (not necessarily neighboring) cliques that are separated by Cs . Thus, the two nodes are separated by Cs in Gj with Cs as a prime separator in Gj . Any proper subset of Cs can not separate u from v in G and in any of its Markovian subgraphs. From the results (i’) and (ii’), it follows that (iii’) if C ∈ χ(G) and C ⊆ Vj , then either C ∈ χ(Gj ) or C is contained in only one clique of Gj .
Log-linear modelling
22
(iv’) the fact that χ(Gj ) = ∅ does not necessarily imply that χ(G) = ∅. To check if χ(Gj ) 6⊆ χ(G) for any j ∈ {1, 2, · · · , m}, suppose that C ∈ χ(Gj ) and C 6∈ χ(G). This implies, by Lemma 4.2, that C is a separator but not a prime separator in G. Thus, there is a proper subset C 0 of C in χ(G). By (iii’), C 0 ∈ χ(Gj ) or is contained in only one clique of Gj . However, neither is possible, since C 0 ⊂ C ∈ χ(Gj ) and C is an intersection of cliques of Gj . Therefore, χ(Gj ) ⊆ χ(G) for all j. This proves result (i) of the theorem. We will now prove result (ii). If ∪m i=1 χ(Gi ) = ∅, then, since all the Gi ’s are decomposable by Theorem B.1, they are complete graphs themselves. So, by definition, the maximal CMS must be a complete graph of V . Thus, the equality of the theorem holds. Next, suppose that ∪m i=1 χ(Gi ) 6= ∅. Then there must exist a marginal model structure, say Gj , such that χ(Gj ) 6= ∅. Let A ∈ χ(Gj ). Then, by Theorem B.3, A is either a prime separator or embedded in a clique of Gi if A ⊆ Vi for i 6= j. Since a prime separator is an intersection of cliques by equation (6), the prime separator itself is a complete subgraph. Thus, by the definition of maximal CMS and by Lemma 4.2, A ∈ χ(H). This implies that ∪m i=1 χ(Gi ) ⊆ χ(H). To show that the set inclusion in the last expression comes down to equality, we will suppose that there is a set B in χ(H) \ (∪m i=1 χ(Gi )) and show that this leads to a contradiction to the condition that H is a maximal CMS. H is decomposable by Theorem B.2. So, B is the intersection of the cliques in CH (B) which is defined right before Lemma 4.2. By supposition, B 6∈ χ(Gi ) for all i = 1, · · · , m. This means either (a) that B ⊆ Vj for some j and B ⊆ C for only one clique C of Gj by Corollary B.4 or (b) that B 6⊆ Vj for all j = 1, · · · , m. In both of the situations, B need not be a prime separator in H, since the Gi are decomposable and so B ∩ Vi are complete in Gi in both of the situations. In other words, edges may be added to H so that CH (B) becomes a clique, which contradicts that H is a maximal CMS. This completes the proof. Proof of Theorem 4.4: By Theorem 4.3 (i), we have ∪m i=1 χ(Gi ) ⊆ χ(G). If ∪m i=1 χ(Gi ) = χ(G), then since G is decomposable, G itself is a maximal CMS. Otherwise, let m χ0 = χ(G) − ∪m i=1 χ(Gi ) = {A1 , · · · , Ag }. Since A1 6∈ ∪i=1 χ(Gi ), we may add edges so that (1) ∪C∈CG (A1 ) C becomes a clique, and the resulting graph G becomes a CMS of G1 , · · · , Gm with χ(G (1) ) − ∪m i=1 χ(Gi ) = {A2 , · · · , Ag }. We repeat the same clique-merging process for the remaining Ai ’s in χ0 . Since each clique-merging makes the corresponding prime separator disappear into the merged, new clique while maintaining the resulting graph as a CMS of G1 , · · · , Gm , the clique-merging creates a CMS of G1 , · · · , Gm as an edgesupergraph of the preceding graph. Therefore, we obtain a maximal CMS, say H∗ , of G1 , · · · , Gm at the end of the sequence of the clique-merging processes for all the prime separators in χ0 . H∗ is the desired maximal CMS as an edge-supergraph of G. Since the clique-merging begins with G and, for each prime separator in G, the set of the cliques which meet at the prime separator only is uniquely defined, the uniqueness of H∗ follows.
Log-linear modelling
23
References Hosmer, D.W. and Lemeshow, S. 1989. Applied Logistic Regression. Wiley. New York, NY. Asmussen, S. and Edwards, D. (1983). Collapsibility and response variables in contingency tables, Biometrika, 70(3), 567-578. Bergsma, W.P. and Rudas, T. (2002). Marginal models for categorical data, The Annals of Statistics, 30(1), 140-159. Breiman, L., Friedman, J.H., Olshen, R.A., and Stone, C.J. (1984). Classificatoin and Regression Trees. Belmont, CA.: Wardsworth International Group. Chambers, J.M. and Hastie, T.J. (1992). Statistical Models in S. Pacific Grove, CA.: Wadsworth & Brooks/ Cole Advanced Books & Software. Chickering,D. (1996). Learning Bayesian networks is NP-complete, in D. Fisher and H. Lenz (Eds.) Learning from Data Springer-Verlag, 121-130. Chickering, D.M. and Heckerman, D.(1999). Fast learning from sparse data, Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, San Francisco, CA: Morgan Kaufmann, 109-115. Cooper, G.F.(1999). An overview of the representation and discovery of causal relationships using Bayesian networks, in C. Glymour and G.F. Cooper (Eds.) Computation, Causation, and Discovery, The MIT Press, 3-62. Cox, D.R. and Wermuth, N. (1999). Likelihood factorizations for mixed discrete and continuous variables, Scandinavian Journal of Statistics, 26, 209-220. Darroch, J.N., Lauritzen, S.L. and Speed, T.P. (1980). Markov fields and log-linear interaction models for contingency tables, The Annals of Statistics, 8(3), 522-539. Dawid, A.P. (1979). Conditional independence in statistical theory, Journal of Royal Statistical Society, B, 41(1), 1-31. Didelez, V. and Edwards, D. (2004). Collapsibility of graphical CG-regression models. Scandinavian Journal of Statistics. To appear. Edwards, D. (1995). Introduction to Graphical Modelling. Springer-Verlag. Fienberg, S.E. (1980). The Analysis of Cross-Classified Categorical Data Cambridge, MA: The MIT Press. Fienberg, S.E. and Kim, S.-H. (1999). Combining conditional log-linear structures, Journal of The American Statistical Association, 445(94), 229-239. Friedman, N. and Goldszmidt, M. (1998). Learning Bayesian networks with local structure, in M.I. Jordan, (Ed.), Learning in Graphical Models Dordrecht, Netherlands: Kluwer, 421-460.
Log-linear modelling
24
Heckerman, D., Geiger, D. and Chickering, D.M. (1995). Learning Bayesian networks: The combination of knowledge and statistical data, Machine Learning, 20, 197-243. Kiiveri, H., Speed, T.P. and Carlin, J.B. (1984). Recursive causal models, Journal of The Australian Mathematical Soc. A, 36, 30-52. Kim, S.-H. 2004. Combining decomposable model-structures, Research Report 04-15, Division of Applied Mathematics, KAIST, Daejeon, 305-701, S. Korea. Lauritzen, S.L., Speed, T.P. and Vijayan, K. (1984). Decomposable graphs and hypergraphs, Journal of the Australian Mathematical Soc. A, 36, 12-29. Lauritzen, S.L. and Spiegelhalter, D.J. (1988). Local Computations with Probabilities on Graphical Structures and their Application to Expert Systems, Journal of Royal Statistical Soc. B, 50(2), 157-224. Leimer, H.-G. (1989). Triangulated graphs and marked vertices, in Graph theory in memory of G. A. Dirac (eds: L. D. Andersen et al.). Annals of Discrete Mathematics, 41, 311-324. Neil, J.R., Wallace, C.S. and Korb, K.B. (1999). Learning Bayesian networks with restricted causal interactions, in K.B. Laskey and H. Prade, (Eds.), Uncertainty in Artificial Intelligence San Francisco, California: Morgan Kaufmann Publishers, 486-493. Pearl, J. (1986). Fusion, propagation and structuring in belief networks, Artificial Intelligence, 29, 241-288. Pearl, J. (1988). Probabilistic Reasoning In Intelligent Systems: Networks of Plausible Inference, San Mateo, CA.: Morgan Kaufmann. Pearl, J. and Paz, A. (1987). Graphoids: a graph based logic for reasoning about relevancy relations. In Advances in Artificial Intelligence II, (ed. B.D. Boulay, D. Hogg, and L. Steel), Amsterdam: North-Holland, pp. 357-63. Robins, J.M., Scheines, R., Spirtes, P., and Wasserman, L. (2003). Uniform consistency in causal inference, Biometrika, 90, 3, 491-515. Spirtes, P., Glymour, C., and Scheines, R. (2000). Causation, Prediction, and Search, 2nd ed. Cambridge, MA: MIT Press. Wermuth, N. (1980). Linear recursive equations, covariance selection, and path analysis. Journal of The American Statistical Assoc., 75(373), 963-972. Wermuth, N. and Lauritzen, S.L. (1983). Graphical and recursive models for contingency tables, Biometrika, 70(3), 537-552. Whittaker, J. (1990). Graphical Models in Applied Multivariate Statistics, New York: Wiley