(2001), Simpson (1951), Wermuth and Cox (1998), Whittemore (1978). In particular, Wermuth and Cox (1998) have extensively illustrated, through a series of ...
Categorical data squashing by combining factor levels
Petros Dellaportas (Athens University of Economics and Business)
Claudia Tarantola (Università di Pavia)
# 153 (03-03)
Dipartimento di economia politica e metodi quantitativi Università degli studi di Pavia Via San Felice, 5 I-27100 Pavia Marzo 2003
Categorical data squashing by combining factor levels Petros Dellaportas Athens University of Economics and Business, Greece and Claudia Tarantola University of Pavia, Italy
Abstract We deal with contingency table data that are used to examine the relationships between a set of categorical variables or factors. We assume that such relationships can be adequately described by the conditional independent structure imposed by an undirected graphical model. If the contingency table is large, a desirable simplified interpretation can be achieved by combining some categories, or levels, of the factors. We introduce conditions under which such an operation does not alter the Markov properties of the graph. Implementation of these conditions leads to likelihood ratio tests and Bayesian model uncertainty procedures based on reversible jump MCMC. The methodology is illustrated with reference to a 2 × 3 × 4 contingency table.
Keywords: Bayesian inference, Data disclosure, Graphical models, Likelihood ratio test, Loglinear models, Reversible jump MCMC.
1
Introduction
In this paper we consider the problem of reducing the dimension of a given contingency table by merging categories, or levels, of variables. Such aggregation may be natural when ordering of the variables is present, but can be also desirable when nominal variables are considered. Since we do not deal with the problem of collapsing a contingency table by marginalizing one variable, we assume that at least one of the factors in the contingency table has more than two levels. This problem has attracted the interest of many authors in the past; see Bishop (1971), Ducharme and Lepage (1986), Goodman (1981, 1985), Guo and Geng (1995), Guo, Geng and Fung (2001), Simpson (1951), Wermuth and Cox (1998), Whittemore (1978). In particular, Wermuth and Cox (1998) have extensively illustrated, through a series of real data examples, the resulting 1
simplified interpretation of the association between variables when level merging is possible. Apart from these obvious advantages, level aggregation may be useful in designing level borders in future studies when, for example, precise measurement is expensive. It can also be a powerful tool in data squashing where data is compressed in such a way that a statistical analysis carried out on the squashed data provides the same outputs that would have resulted from analyzing the entire dataset (DuMouchel, 2001). Finally, level merging is the common data disclosure technique called ‘table redesign’ (Willenborg and de Waal, 2000). It may replace other strategies such as cell suppression, interval protection or perturbation (Fienberg, 1997) and it is especially useful when marginal cells themselves might obtain sensitive cells. Goodman (1981) examined this problem for two-way contingency tables and he presented homogeneity and structural criteria that can be used to determine whether row or column categories can be combined. Wermuth and Cox (1998) presented a log-linear parameterization and a likelihood ratio test that can be applied to higher dimensional tables. The methodology we propose is based on the same log-linear parameterization but our suggested likelihood ratio test differs from that of Wermuth and Cox (1998). Our theoretical tool to derive criteria that allow level merging is the conditional independence structure properties of the full and the aggregated table. We present an aggregation rule that allows merging of variable levels without changing the undirected graphical model that fits the data before the data aggregation. We investigate both classical and Bayesian inference approaches to implement our rule. In the classical approach a series of likelihood ratio tests can be used to decide first the undirected graphical model that fits the full set of data and then the level aggregations that result to an aggregated table that is explained by the same graphical model. In the Bayesian approach, model uncertainty can be determined by estimating the posterior probability of each model under the full data set and under all possible aggregations. Then, there are (at least) two different strategies to proceed. Similarly to the classical approach, one can pick up the aggregation (if it exists) that gives highest posterior probability to the same model as the full data table. Alternatively, one can choose the aggregation in which its posterior model probability distribution, when it is compared with that of the full data table, maximizes some score function. This requires extensive search in the space of all models/level aggregations. This space is much larger than the space of full-data models considered in, for example, Dellaportas and Forster (1999). Therefore, following their arguments, the use of efficient model-searching algorithms such as the reversible jump MCMC procedure of Green (1995) is necessary. The plan of the paper is the following. In Section 2 we present in detail the log-linear 2
parameterization suggested by Wermuth and Cox (1998). In Section 3 we define the notion of level-hierarchy and we look closely at the parameters of the design matrix by discussing their interpretation and importance. Section 4 contains our main results that are described by two propositions which connect level merging and graphical models. Section 5 utilizes these results and presents classical and Bayesian model choice methodologies. Section 6 contains some concluding remarks.
2
A suitable log-linear parameterization In this section we present in some detail a log-linear parameterization proposed by Wermuth
and Cox (1998). This is particularly useful since the contrast matrix, the inverse of the design matrix, identifies parameters that are interpreted as functions of odd ratios or conditional odd ratios. We focus on these parameters since their interpretation leads to a clear exposition of our main result in Section 3. We consider categorical variables, or factors, X1 , X2 , . . . , XN ∈ V, for which the outcomes are labelled by the integers in label sets Ii , i = 1, . . . , N , with Ii = {1, 2, . . . , |Ii |}. Assume that we are only interested in merging adjacent labels in each factor; otherwise just permute the members of Ii appropriately. The joint distribution of the variables is represented by a (|I1 ||I2 | · · · |IN | × 1) vector 2 ...XN , ij ∈ Ij , j = 1, . . . , N }. π X1 X2 ...XN = {πiX1 i12X...i N
In the following, when is clear from the context, we omit the superscript and use πi1 i2 ...iN instead 2 ...XN . The elements of π X1 X2 ...XN follow a lexicographic order so that levels of variable Xk of πiX1 i12X...i N
change faster than that of X` if k < `. For example, for N = 2, 0
π X1 X2 = (π11 , π21 , . . . π|I1 |1 , . . . , π|I1 ||I2 | ) . A log-linear parameterization in terms of a suitable matrix C|I1 ||I2 |...|IN | of dimension |I1 ||I2 | · · · |IN |× |I1 ||I2 | · · · |IN | is given by γ X1 X2 ...XN = C|I1 ||I2 |...|IN | log π X1 X2 ...XN
(1)
where γ X1 X2 ...XN is a (|I1 ||I2 | · · · |IN | × 1) vector of identifiable log-linear parameters. The matrix C|I1 ||I2 |...|IN | is called contrast matrix because it contains weights that define contrasts of log probabilities. It can be computed via the Kronecker product C|I1 ||I2 |...|IN | = C|IN | ⊗ C|IN −1 | ⊗ · · · ⊗ C|I1 | 3
where the generic matrices C|I| have dimension |I| × |I| with elements C ij given by
C ij =
1
if
i = 1 or i = j
−1
if
i=j+1
0
otherwise.
The contrast matrix is not singular (the proof is given in Appendix A). Therefore, equation (1) can be rewritten as logπ X1 X2 ...XN
= C|I−1 γ X1 X2 ...XN , or 1 ||I2 |...|IN |
(2)
logµX1 X2 ...XN
= C|I−1 γ X1 X2 ...XN − log n 1 ||I2 |...|IN |
(3)
where µX1 X2 ...XN represents the corresponding cell mean vector and n is the total sample size. It is therefore possible to transform (1) to the standard log-linear parameterization.
3 3.1
The γ-vector Level hierarchy
Standard notation of log-linear interaction models corresponds different sets of parameters in the γ-vector in (1) to a subset of V, or a set of factors. Following Darroch, Lauritzen and Speed (1980), these are called terms of a log-linear model, are denoted by γa for a ⊆ V, and their parameters are appropriately constrained to ensure identifiability. Moreover, recall that the usual definition of hierarchical models imposes that if γa = 0 then γb = 0 for all a ⊆ b. We avoid the use of terms, since ordering of the γ-vector in (1) is important in our context, but we borrow the structure of the definition of hierarchy to define level-hierarchy as follows. For each factor Xi , i = 1, . . . , N , let Li = {12, 23, . . . , (|Ii | − 1)|Ii |} be the corresponding level-merging set which has |Ii | − 1 elements. Then, for any subset a ⊆ V with a 6= ∅, construct the Q cartesian product L = Xi ∈a Li . Each element of γ X1 X2 ...XN can be identified by a set noted as ©aª ¡a¢ ` , with elements 2 × 1 vectors of the form ` , a ⊆ V and ` ∈ L. We will denote the γ-parameters X1 .X2 refers to variables X1 and X2 at levels 12 and 23 respectively, as γ`a . For example, parameter γ12.23 ¡X1 ¢ ¡X2 ¢ and is identified by the set { 12 , 23 }. For a = ∅ we just write γ ∅ . We are now in a position to
provide the following definition: Definition 1 A log-linear model is level hierarchical if γ`a = 0 then γrb = 0 for Note that hierarchy implies level hierarchy but not vice-versa .
4
©aª `
⊆
©bª r .
3.2
Interpretation
The γ-parameters correspond to main effects and interactions of different orders for specific adjacent factor levels. Thus, for example, a model for a 3 × 2 contingency table is given by γ X1 X2 = C23 log π X1 X2 ; specifically,
γ∅ X1 γ12 X1 γ23 X2 γ12 X1 .X2 γ12.12
=
X1 .X2 γ23.12
1
1
1
1
1
1
-1
1
0
-1
1
0
0
-1
1
0
-1
1
-1
-1
-1
1
1
1
1
-1
0
-1
1
0
0
1
-1
0
-1
1
logπ11 logπ21 logπ31 logπ12 logπ22
.
logπ32
Note that the γ-parameters can be expressed as functions of odds and odds ratio. In particular, in the case in which we consider only two variables, the main factors are functions of odds whereas the interaction terms are functions of (local) odd ratios. For example, X1 γ12 = log
π22 π11 π22 π21 X1 .X2 + log , γ12.12 = log . π11 π12 π21 π12
The type of lexicographic order that changes the sequence of the indices in γ can be made clear by the example below. Consider a 3 × 2 × 2 table, with a log-linear representation γ X1 X2 X3 = C322 log π X1 X2 X3 , where 0
γ X1 X2 X3
X1 X1 X2 X1 .X2 X1 .X2 X3 X1 .X3 X1 .X3 X2 .X3 X1 .X2 .X3 X1 .X2 .X3 = (γ ∅ , γ12 , γ23 , γ12 , γ12.12 , γ23.12 , γ12 , γ12.12 , γ23.12 , γ12.12 , γ12.12.12 , γ23.12.12 )
π X1 X2 X3
= (π111 , π211 , π311 , π121 , π221 , π321 , π112 , π212 , π312 , π122 , π222 , π322 )
0
and C322 = C2 ⊗ C2 ⊗ C3 = C2 ⊗ C32 = C22 ⊗ C3 . In this case the interaction parameters can be interpreted in terms of conditional odds ratios; for example X1 .X2 γ12.12 = log
π112 π222 π111 π221 + log π211 π121 π212 π122
(4)
is the sum of two log conditional-odds ratios. The first term in the right-hand-side of equation (4) is the logarithm of the conditional odds ratio of X1 versus X2 given X3 = 1, with respect to the sub-table with levels (1, 2) of X1 and levels (1, 2) of X2 ; the second term is interpreted similarly. We now fix our attention on the interpretation of the second order interaction parameters that are functions of conditional odds ratios. For simplicity, consider first three generic variables 5
X1 .X2 X1 , X2 and X3 . In this case, the second order interaction term, γi(i+1).j(j+1) , is the sum of |I3 |
log-conditional-odds ratios, one for each value of X3 . The conditional odds ratio of X1 versus X2 given X3 = k is given by P r(X1 = i, X2 = j|X3 = k) =
πijk π(i+1)(j+1)k . πj(j+1)k π(i+1)jk
(5)
If this ratio is one, there is independence between X1 and X2 in the 2 × 2 sub-table specified by levels i and i + 1 of X1 , j and j + 1 of X2 and k of X3 . If the ratio in (5) is one for all values of k, X1 .X2 then γi(i+1).j(j+1) = 0. Thus, absence of a second order interaction term is related to a conditional
independence. Following Wermuth and Cox (1998), there is conditional independence for the levels X1 .X2 i and i + 1 of X1 and j and j + 1 of X2 if and only if γi(i+1).j(j+1) = 0 and all higher order terms
involving levels i and i + 1 of X1 , and j and j + 1 of X2 , are also zero. According to these authors, such conditional independence is an adequate criterion to impose merging of levels. More precisely X1 .X2 they suggest to merge levels i, i + 1 of X1 if and only if γi(i+1).j(j+1) = 0 for all j and if all higher
order terms involving levels i, i + 1 and j, j + 1 are also zero. We diversify from this rule imposed by Wermuth and Cox (1998) since we require both X1 γi(i+1) = 0 and level-hierarchy. For now, notice that our extra requirement seems natural: if levels
i and i + 1 of variable X1 are merged together, there seems to be no need to exist any γ-parameter X1 involving these two levels, so γi(i+1) should be omitted. In the following Section we give a theoretical
justification of this rule based on graphical models.
4
Conditions for level merging
Since our level merging procedure is based on (undirected) graphical models theory, we give here some necessary definitions; for more details see Lauritzen (1995) and Whittaker (1990). Let G = (V, E) be an undirected graph, where V = {1, . . . , N } is the vertex set and E ⊆ V ×V is the edge set. Given a graph G, a path connecting two nodes, X0 and Xm , is a finite sequence of distinct vertices X0 , . . . Xm such that (Xi−1 , Xi ), i = 1, . . . , m is an edge of the graph. A graph is connected if any two nodes are connected, i.e. there is a path between them, and a connectivity component is a maximal connected subgraph. If A ⊆ V is a subset of the vertex set V , it induces the subgraph GA = (A, EA ), where EA = E ∩ (A × A). A subgraph is complete if there is an edge between each of his vertices, a clique is a maximal complete subgraph. Consider a graph G = (V, E) and a random vector X = (X1 , . . . , XN ) taking values in the probability spaces (Xi )i∈V ; X is Markov with respect to G if for any pair of vertices (i, j), with (i, j) ∈ / E, Xi and Xj are independent conditionally on the remaining variables XV \{i,j} , 6
Xi ⊥ ⊥Xj |XV \{i,j} . A graphical model is a family of probability distributions which are Markov with respect to a graph G. A log-linear model is graphical if it is hierarchical and the maximum interaction terms correspond to the cliques of the graph. One of the advantages derived by the use of graphical models is that we can employ the modular structure of the graph in order to construct the corresponding joint distribution. This turns out to be very useful in the proof of the following proposition: Proposition 1 Consider a variable Xk with levels labelled by the elements of Ik . A sufficient condition for merging levels j and j + 1 of Xk without modifying the Markov properties of the corresponding graph is that πi1 ,...,ik−1 ,j,ik+1 ,...,iN = πi1 ,...ik−1 ,j+1,ik+1 ,...,iN ; i` ∈ I` , ` = 1, . . . , k − 1, k + 1, . . . , N. Proof: see appendix B. Assume now that we are interested in merging levels j and (j + 1) of variable Xk . Then, the following proposition holds: Proposition 2 The following two conditions are equivalent. For levels j and j + 1 of variable Xk © ª ¡ Xk ¢ Xk = 0 and γ`a = 0 for all a` ⊇ j(j+1) (i) γj(j+1) (level hierarchical principle) and (ii) πi1 ,i2 ,...,ik−1 ,j,ik+1 ,...,iN = πi1 ,i2 ,...ik−1 ,j+1,ik+1 ,...,iN ; i` ∈ I` , ` = 1, . . . , k − 1, k + 1, . . . , N. Proof: see appendix C. Thus, condition (i) of Proposition 2 allows level merging without altering the conditional independence structure of the model. This is the basis of the development of model selection procedures, more details of which are given in the next Section.
5
Model selection procedures
This section suggests classical and Bayesian model selection methodologies that incorporate condition (i) of Proposition 2. Our target is to develop an efficient methodology to deal simultaneously with uncertainty in both the model and the aggregation level. We consider graphical models and N aggregations that obey to level-hierarchy. Thus, there are 2( 2 ) possible graphical models since ¡ ¢ there are N2 possible edges in the graph. The number of level aggregations depends on whether
we deal with ordered or nominal variables. For an ordinal variable Xk , with |Ik | > 2, there are
7
(|Ik | − 1)! possible level aggregations. For a nominal variable Xk , with |Ik | > 2, there are |IK |−3 µ
Y
j=0
|Ik | − j 2
¶
possible level aggregations. It is evident that, no matter whether one wishes to choose a classical or a Bayesian approach, exploration of the space of all models under all possible data aggregations is an extremely demanding computational task: for M possible graphical models and L possible level aggregations, the possible models under consideration is M L. It is more convenient to develop our procedures by operating with the standard log-linear model formulation (3). The design matrix for the full (saturated) model is given by Z = C|I−1 1 ||I2 |...|IN | so we can re-write (3) as logµX1 X2 ...XN = log µ = Zγ X1 X2 ...XN − log n.
(6)
Thus, both standard likelihood ratio tests (Agresti, 1990) or Bayesian model determination MCMC procedures (Dellaportas and Forster, 1999) can be immediately applied.
5.1
Likelihood ratio tests
We assume that the problem posed is to find a data aggregation that has the same graphical model as the full data set. We propose the following procedure. First, the best graphical model for the full data set (without any aggregation) is chosen through the usual likelihood ratio tests that operate on a series of nested models; see, for example, Agresti (1990). Then, for every level merging that is proposed, the corresponding γ-parameters that are derived from condition (i) of Proposition 2 are removed from the contrast matrix C|I1 ||I2 |...|IN | and the design matrix Z that corresponds to formulation (6) is derived as its inverse. Again, standard likelihood ratio tests can then be applied, since the level aggregation defines a natural hierarchy of models. This procedure has been advocated by Wermuth and Cox (1998), although, as pointed out earlier, the γ-parameters set to zero for level merging were different.
5.2
Bayesian model determination
From a Bayesian perspective, all the information required concerning the model uncertainty is contained in the posterior model probability density function f (m|y) =
f (m)f (y|m) P , f (m)f (y|m) m∈M
8
m∈M
(7)
where the data y are considered to have been generated by a model m, one of a set M of competing models, f (y|m) is the marginal likelihood of model m, and f (m) is the prior probability of model m. For contingency tables of moderate size, the number of graphical models |M | can be very large and the marginal likelihood of each model is unavailable (see, however, Madigan and York (1995) for an algorithm for the smaller class of decomposable graphical models). Therefore, reversible jump Markov chain Monte Carlo (MCMC) techniques (Green, 1995) that search the model space and provide estimates of f (m|y), m ∈ M , offer an alternative to calculating (7); see, for example, Dellaportas and Forster (1999) and King and Brooks (2001). Having calculated the posterior model probabilities f (m|y), m ∈ M , of the full data set, assume that we are asked to pick up the best data aggregation. We suggest to choose the data aggregation that produces posterior model probability density functions g(m|y), m ∈ M , that maximize a (proper) score function. Since |M | is large, it is unlikely that all estimates of f (m|y), m ∈ M are positive, so the score function must be defined for non-positive probability mass functions. We suggest the quadratic (Brier) score " # X X − g 2 (m|y) − 2 f (m|y)g(m|y) + 1 m∈M
or the spherical score
(8)
m∈M
P
f (m|y)g(m|y) ¤1/2 . 2 m∈M g (m|y)
m∈M
£P
(9)
A second question that should be answered is whether the ‘best’ aggregation should be made, or, equivalently, how large scores (8) or (9) should be to allow an aggregation. This should be based upon relative interests in pursuing data aggregation. For example, a rule that focuses on ‘best model’ could require that both f and g have the same mode. On the other hand, an aggregation that changes the ‘best model’ but preserves a high score would make only little difference if prediction, possibly performed via model averaging, is of primary interest. To be able to calculate (8) or (9) for all models and all aggregations, we use a reversible jump MCMC algorithm that explores efficiently the space of all possible models under all possible level aggregations. This algorithm resembles closely that of Dellaportas and Forster (1999), although now the algorithm visits a model by visiting a combination of a graphical model and a level aggregation. At each step we propose to either stay or jump to another model with equal probability. Our proposal densities for jump moves, unlike those of Dellaportas and Forster (1999) that are derived from pilot runs, are based on maximum likelihood estimates that are calculated with NewtonRaphson algorithm. This procedure is not performed for all M × L models but only for the models that are proposed to be visited by the MCMC algorithm. This is a necessary alteration from the 9
algorithm of Dellaportas and Forster (1999) since the design matrix Z does not possess the nice block-diagonal property of the sum-to-zero design matrix, so it is unlikely that parameter proposals from the saturated model can serve as good proposals for all models. To update the parameters within the same model, we use a block random walk Metropolis-Hastings move with proposal density a multivariate Normal with mean equal to the current point and variance equal to the (properly scaled) maximum likelihood estimated covariance matrix; see, for example, Dellaportas and Roberts (2003). We believe that since the maximum likelihood estimates are obtained in order to serve as proposals for moves between models, this MCMC strategy is superior to the one proposed by Dellaportas and Smith (1993) and used by Dellaportas and Forster (1999). The prior density on the model space is a discrete uniform prior with probability 1/|M L| for each model. However, the prior densities on the parameter space require special care to avoid Lindley’s paradox. Dellaportas and Forster (1999) have argued that a default prior for the parameters β of a log-linear model of the form log µ = Xβ, where µ is the mean vector and X is a design matrix satisfying sum to zero constrains, is given by β ∼ N (θ, 2d(X T X)−1 ) where d is the number of cells and θ is a vector of all zeros except the first element that is equal to log n. Direct correspondence of this model with (6) leads to Xβ = Zγ + log n or ¡ ¢−1 T ¡ ¢−1 T γ = ZT Z Z Xβ − Z T Z Z log n.
(10)
Therefore, a default prior is given by γ ∼ N (θ − (Z T Z)−1 Z T logn, 2d(Z T Z)−1 ).
5.3
(11)
Alcohol data
The performance of our model selection procedures have been tested with reference to a well known data set presented by Knuiman and Speed (1988). This data set regards a small study held in Western Australia on the relationship between Alcohol intake (A), Obesity (O) and High blood pressure (H); see Table 1. For this contingency table there exist 21 possible level aggregations and each of them can be represented by 8 different graphs. For the full data set, likelihood ratio tests support strongly the no-interaction model H + O + A. Likelihood ratio tests against all aggregations under this model suggest that no aggregation can be accepted. The reversible jump MCMC under the full data set gives positive probabilities to only four models; see Table 2 and Dellaportas and Forster (1999). We performed a reversible jump MCMC that searches among all models and data aggregations. The resulting frequencies of visits were appropriately normalized within aggregations to give estimates of posterior model 10
Table 1: Knuiman and Speed’s data Alcohol intake (drinks/days) Obesity
High BP
0
1-2
3-5
6+
Low
Yes
5
9
8
10
No
40
36
33
24
Yes
6
9
11
14
No
33
23
35
30
Yes
9
12
19
19
No
24
25
28
29
Average
High
probabilities for every aggregation. The results based on 100, 000 iterations are given in Table 2. Note that only four different data aggregations present positive probabilities. In Table 3 the values of the Brier score and the Spherical score for these aggregations are reported; both scores are maximized by the last aggregation that corresponds to the data structure in Table 4. The mixing of the algorithm was also very good. To verify that the space of posterior model probabilities is quickly explored, we started the algorithm 100 times from randomly chosen combinations of models × aggregations. Within 1000 iterations the algorithm had visited all combinations with the highest posterior probabilities. In every MCMC run, the frequencies reported in Table 2 were obtained with relative precision 0.01.
6
Concluding remarks
We have provided a methodology to deal with the problem of merging levels in multi-way contingency tables. To our knowledge, this is the first attempt to justify such a procedure with graphical models. There are two problems emerging in this scenario. The first seeks an answer as to whether any merging should be made at all. The second, imposed for instance by data disclosure of data squashing aspects of the data, assumes that a merging is necessary and seeks the best aggregation. We feel that the likelihood ratio test can provide an answer to the first problem, but the reversible jump MCMC algorithm is more suited to answer the second. At any case, the criteria we presented are just indicative and depend on the current real situation. We have outlined a series of fields that level merging is useful. Although initial applications of such techniques (Goodman, 1981, Wermuth and Cox, 1998) focused in social science applications,
11
Table 2: RJMCMC selection among all models and data aggregations
Data structure for variables
Model
Posterior probabilities
O, H and A respectively (1.2.3)(1.2)(1.2.2.3.4)
(1.2.3)(1.2)(1.23.4)
(12.3)(1.2)(123.4)
(12.3)(1.2)(1.234)
(1.23) (1.2) (1.234) (21)
O+H+A
0.6802
OH+A
0.3132
O+HA
0.0046
OH+HA
0.0020
OH+OA
0.5000
OHA
0.5000
OH+OA
0.8800
OHA
0.1200
O+H+A
0.0332
OH+A
0.2056
O+HA
0.0904
OA+H
0.0040
OH+OA
0.0221
OH+HA
0.6282
OA+HA
0.0120
OHA
0.0045
O+H+A
0.1238
OH+A
0.1494
O+HA
0.2563
OA+H
0.0078
OH+OA
0.0105
OH+HA
0.4254
OA+HA
0.0266
OHA
0.0002
12
Table 3: Brier score and Spherical score Data structure for variables
Brier score
Spherical score
O, H and A respectively (1.2.3)(1.2)(1.23.4)
-1.5
0
(12.3)(1.2) (123.4)
-1.7888
0
(12.3)(1.2) (1.234)
-1.269552
0.1326148
(1.23)(1.2)(1.234)
-1.019119
0.2491104
Table 4: The new data aggregation Alcohol intake (drinks/days) Obesity
High BP
0
1+
Low
Yes
5
27
No
40
93
Yes
15
84
No
57
170
Avarage + High
it seems to us that modern data handling necessitates the development of such methods for modern research areas such as data squashing and data disclosure.
7
Acknowledgment
This work has been supported by EU TMR network ERB-RFMRX-CT96-0095 on “Computational and Statistical methods for the analysis of spatial data”. The second author also acknowledges support from “Progetto Giovani Ricercatori 2001”, University of Pavia.
References Agresti, A. (1990). Categorical data analysis, Wiley, New York. Bishop, Y.M.M.(1971). Effects of collapsing multidimensional contingency tables, Biometrics, 27, 545–562.
13
Darroch, J. N. , Lauritzen, S. L. , and Speed, T. P. (1980). Markov fields and log-linear interaction models for contingency tables. Ann. Statist., 8 , 522-539. Dellaportas, P. and Forster, J.J.(1999). Markov Chain Monte Carlo model determination for hierarchical and graphical log-linear models. Biometrika, 86, 615–633. Dellaportas, P. and Roberts, G. (2003). Introduction to MCMC. In Spatial Statistics and Computational Methods, J. Møller editor, Springer Verlag, New York. Dellaportas, P. and Smith, A.F.M.(1993). Bayesian inference for generalised linear and proportional hazards models via Gibbs Sampler. J. R. Statist. Soc. C, 42, 443–459. Ducharme, G. R. and Lepage, Y. (1986). Testing collapsibility in contingency tables. J. R. Statist. Soc. B, 48 , 197–205. DuMouchel, W. (2001). Data squashing: constructing summary data sets. In Handbook of Massive Data Sets. J. Abello editor, Kluwer Academic Publishers, USA. Fienberg, S.E. (1997). Confidentiality and Disclosure Limitation Methodology: Challenges for National Statistics and Statistical Research. Technical report, 668, Carnegie Mellon University, USA. Green, P. J. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination, Biometrika, 82, 711–32. Goodman, L. A. (1981). Criteria for determining whether certain categories in a crossclassification table should be combined, with special reference to occupational categories in an occupational mobility table. American J. Sociology, 87 , 612–650. Goodman, L A. (1985). The analysis of cross-classified data having ordered and/or unordered categories: Association models, correlation models, and asymmetry models for contingency tables with or without missing entries. Ann. Statist., 13 , 10–69. Guo, J. , and Geng, Z. (1995). Collapsibility of logistic regression coefficients. J. R. Statist. Society Series B, 57, 263–267. Guo, J., Geng, Z. and Fung, W. (2001). Consecutive Collapsibility of Odds Ratios over an Ordinal Background Variable. J. Multiv. Anal., 79, 89–98.
14
King, R. and Brooks, S.P. (2001). Prior Induction in Log-linear Models for General Contingency Table Analysis. Ann. Statist., 29, 715–747. Knuiman, M. W. and Speed, T. P. (1988). Incorporating prior information into the analysis of contingency tables. Biometrics, 44 , 1061–1071. Lauritzen, S.L. (1996). Graphical Models. Oxford University Press, Oxford. Madigan, D. and York, J. (1995). Bayesian graphical models for discrete data. Inter. Statist. Review, 63 , 215–232. Simpson, E. (1951). The interpretation of the interaction in contingency tables. J. R. Statist. Society B, 13, 238–241. Wermuth, N. and Cox, D.R. (1998). On the application of conditional independence to ordinal data. Inter. Statist. Review, 66, 181–199. Whittaker, J. (1990). Graphical models in applied multivariate statistics. John Wiley and Sons, New York; Chichester. Whittemore, A. S. (1978). Collapsibility of multidimensional contingency tables. J. R. Statist. Society B,40 , 328–340. Willenborg, L. and de Waal T. (2000). Elements of Statistical Disclosure Control. SpringerVerlag, New York.
Appendix A: Proof of the non-singularity of the contrast matrix First note that det(C2 ) = 2 > 0. Assume that det(CN −1 ) > 0 for some positive integer N . Then det(CN ) = (−1)N +1 × det(A) + det(CN −1 ), where A is an upper triangular matrix of dimension N − 1 with all −1 on the main diagonal. Thus, det(CN ) = det(CN −1 ) + 1 > 0. The result follows by induction and from the fact that for positive integers N and M , det(CN ⊗ CM ) = det(CN )M × det(CM )N .
Appendix B: Proof of Proposition 1 Preliminaries Our proof is based on the concepts of graph decomposition and maximal irreducible components. We start by providing the basic terminology and an important result. Given a random vector X V , 15
there exists a decomposition, or equivalently X V is reducible, if and only if there exists a partition of X V into three subsets (X A , X B , X C ), such that X B ⊥ ⊥X C |X A , neither B nor A are empty, and the subgraph on A is complete. If such a decomposition does not exist then X V is said to be irreducible. The random vectors X d1 , . . . , X dm are called the maximal irreducible components of X V if and only if each vector X di , i = 1, . . . , m, is an irreducible component of X V , no subset di is a proper subset of any other dj and d1 ∪ d2 · · · ∪ dm = V . Then, (see for example Whittaker, 1995, pp=385) the maximal irreducible components of X V corresponding to the subsets {d1 , d2 , . . . , dm } are unique and the density function of X V , π V , factorizes uniquely into π d1 π d2 . . . π dm , (12) g Q where g is a product of marginal density functions, g = π s , in which each subset s is an interπV =
section of irreducible components, and is complete. Proof We assume throughout that the examined graph is connected; if it is not, each single connected component can be dealt with independently. Suppose, without loss of generality, that we want to merge levels j and j + 1 of variable X1 . The new level created after the merging is denoted by j ∗ . Clearly, the merging will influence π X1 ,...,XN through the factors in (12) that involve πj,i2 ,...,iN and πj+1,i2 ,...,iN for all possible values of i2 , . . . , in . If πj,i2 ,...,iN = πj+1,i2 ,...,iN then πj ∗ ,i2 ...,iN = 2πj,i2 ...,iN .
(13)
We now proceed by examining two different situations separately. (1) In the graph there is only one maximal irreducible component. If the graph is complete, the joint probability density cannot be factorized neither before nor after merging, so the graphical structure does not change. If the graph is not complete, there is at least one node, say X` , such that X1 ⊥ ⊥X` |XV \{1,`} . Therefore, the joint probability density factorizes as π X1 ,...,XN = π X1 |XV \{1,`} π X` |XV \{1,`} π XV \{1,`}
(14)
and after the merging we obtain X1 |X
X` |X
X
V \{1,`} V \{1,`} πj ∗ ,i2 ...,iN = 2πj,i2 ...,iN = 2 πj|i2 ,...,i π π V \{1,`} . `−1 ,i`+1 ,...,iN `|i2 ,...,i`−1 ,i`+1 ,...,iN i2 ,...,i`−1 ,i`+1 ,...,iN
16
(15)
Now notice that the first factor of the RHS expression above can be written as X
X1 |XV \{1,`} πj|i2 ,...,i `−1 ,i`+1 ,...,iN
2
\` 2 πj,iV2 ,...,i `−1 ,i`+1 ,...,iN
=
X
V \{1,`} πi2 ,...,i `−1 ,i`+1 ,...,iN
X
X
\` V \` πj,iV2 ,...,i + πj+1,i 2 ,...,i`−1 ,i`+1 ,...,iN `−1 ,i`+1 ,...,iN
=
X
V \{1,`} πi2 ,...,i `−1 ,i`+1 ,...,iN
X
πj ∗V,i\` 2 ,...,i`−1 ,i`+1 ,...,iN
=
X
V \{1,`} πi2 ,...,i `−1 ,i`+1 ,...,iN
X1 |X
V \{1,`} = πj ∗ |i2 ,...,i `−1 ,i`+1 ,...,iN
so (15) can be written as X1 |X
X` |X
X
V \{1,`} V \{1,`} πj ∗ ,i2 ...,iN = πj ∗ |i2 ,...,i π π V \{1,`} , `−1 ,i`+1 ,...,iN `|i2 ,...,i`−1 ,i`+1 ,...,iN i2 ,...,i`−1 ,i`+1
(16)
which implies that the conditional independence structure has not changed. (2) In the graph there is more than one maximal irreducible component. Assume that there are m > 1 maximal irreducible components. In this case the factorization (12) holds, and every variable, and thus X1 , appears in k (k = 1, . . . , m) maximal irreducible components and k − 1 separators. We first consider the case k = 2 and without loss of generality let d1 = {1, 2, . . . , ξ} and d2 = {1, ξ + 1, ξ + 2, . . . , N }. Then we obtain X ,...,X
X1 ,X2 ,...,XN πj,i 2 ,...,iN
=
X ,X
,...,XN
πj,i12 ,...,iξξ πj,i1ξ+1ξ+1 ,...,iN πjX1
(17)
By aggregating levels j and j + 1 of variable X1 we obtain the following relations: X1 ,X2 ,...,XN 2 ,...,XN (i) πjX∗1,i,X = 2 πj,i 2 ,...,iN 2 ,...,iN X ,X ,...,Xξ
2 (ii) πj ∗1,i2 ,...,i ξ
X ,X
X ,X ,...,Xξ
2 = 2 πj,i12 ,...,i ξ
,...,XN
(iii) πj ∗1,iξ+1ξ+1 ,...,iN
X ,X
,...,XN
= 2 πj,i1ξ+1ξ+1 ,...,iN
(iv) πjX∗1 = 2 πjX1 Inserting equations (i)-(iv) in (17) we have ,...,XN πjX∗1,...,i N
=
X ,X ,...,Xξ X1 ,Xξ+1 ,...,XN πj ∗ ,iξ+1 ,...,iN πjX∗1
2 πj ∗1,i2 ,...,i ξ
(18)
which implies that the conditional independence structure has remained the same after merging. 17
If k > 2, the RHS fraction in (12) has k terms in the numerator and k − 1 terms in the denominator involving X1 . Thus, 2k expressions similar to (i)-(iv) can be obtained, resulting to an equation of the form (18) that proves that the graphical structure remains unaltered.
Appendix C: Proof of Proposition 2 Let X1 , X2 , X3 , . . . XN be a set of variables with levels taking values in the sets I1 , I2 , . . ., IN and let C|I1 | , C|I2 | , . . . C|IN | be the corresponding set of matrices. Assume, without loss of generality, that merging of levels j and j + 1 of variable X1 is under question. n © ª ¡ X1 ¢o Define the subvectors of γ and π as follows: γ e = γab : ab ⊇ j(j+1) ; π e has all elements of vector π ej = {πj,i2 ,...,iN ; ik = 1, . . . , |Ik |; k = 2, . . . , N } and all elements of vector π ej+1 = {πj+1,i2 ...iN ; ik = 1, . . . , |Ik |; k = 2, . . . , N }. Both γ e and π e are vectors of dimension (2|I2 | · · · |IN |)×1 e constructed and they respect the initial ordering of the elements of γ and π. The submatrix of C, C, by the rows of C corresponding to the elements of γ e and the columns of C corresponding to the elements of π e is such that e loge γ e=C π.
(19)
e into two submatrices C(j) e Furthermore, we partition the matrix C = C|IN | ⊗. . .⊗C|I2 | ⊗C|Ij+1,j 1| e + 1) = C|I | ⊗ . . . ⊗ C|I | ⊗ C j+1,j+1 . Recall that superscripts in C matrices denote their and C(j 2 N |I1 | e + C(j e + 1) = 0 elements. Since C|Ij+1,j = −1 and C|Ij+1,j+1 = 1 by the definition of matrices C, C(j) 1| 1| e e + 1) loge so (19) can be written as γ e = C(j) loge πj + C(j πj+1 . e Now notice that by construction, C(j) has rows corresponding to the elements of γ e and e + 1) has rows corresponding to the columns corresponding to the elements of π ej , whereas C(j elements of γ e and columns corresponding to the elements of π ej+1 . Obviously, both matrices are e e + 1)loge not singular; see Appendix A. Thus, if γ e = 0 the equation C(j) loge π = −C(j πj+1 has a unique solution; since π ej+1 = π ej is a solution this is unique. This proves the proposition.
18
Dipartimento di economia politica e metodi quantitativi Università degli studi di Pavia List of the lately published Technical Reports (available at the web site: "http://economia.unipv.it/Eco-Pol/quaderni.htm").
Quaderni di Dipartimento # 140
Date 02-02
Author(s) F.Menoncin
141
04-02
M.Balconi S.Breschi F.Lissoni
142
04-02
143
05-02
144
06-02
145
06-02
M.Balconi S.Borghini A.Moisello L.Bottolo G.Consonni P.Dellaportas A.Lijoi I.Epifani A.Lijoi I.Pruenster P.Bertoletti
146
12-02
147
01-03
148
02-03
149
02-03
150
02-03
151
02-03
152
02-03
P.Berti L.Fratelli P.Rigo A. Lijoi I. Pruenster A. Lijoi Pruenster S.G.Walker V.Leucari G.Consonni L. Di Scala L. La Rocca G.Consonni G.Ascari G.Ascari P.Gagnepain
Title Optimal Portfolio With Bechmark for Fund Managers Il trasferimento di conoscenze tecnologiche dall’università all’industria in Italia: nuova evidenza sui brevetti di paternità dei docenti Ivory Tower vs Spanning University: il caso dell’Università di Pavia Bayesian analysis of extreme values by mixture modelling
Exponential functionals and means of neutral to the right priors A note on third-degree price discrimination and output Limit theorems for predictive sequences of random variables Practicable alternatives to the Dirichlet process Extending Doob's consistency theorem to nonparametric densities Compatible Priors for Causal Bayesian Networks A Bayesian Hierarchical Model for the Evaluation of a Web Site Staggered Prices and Trend Inflation: Some Nuisances How inefficient are football clubs? An evaluation of the Spanish arms race