Computing Exact P-values in Incomplete Multi-way

0 downloads 0 Views 197KB Size Report
Jan 16, 2009 - the null distributions of test statistics of interest is questionable or ... In this paper I focus on the very general case of sampling from sets of multi-way tables that .... at least as extreme as the observed table n: ... than the observed value X2(n) and the entire set T. Estimating the ...... Biometrika, 78, 301–304.
Computing Exact P-values in Incomplete Multi-way Tables∗ Technical Report no. 548 Department of Statistics, University of Washington Adrian Dobra Department of Statistics, University of Washington, Seattle, WA 98195 [email protected] January 16, 2009

Abstract I develop a new Markov chain algorithm for sampling from sets of multi-way contingency tables defined by an arbitrary set of fixed marginals and by lower and upper bounds constraints on cell counts. My procedure is called the Bounds Sampling Algorithm (BSA) and it relies on the existence of a method to calculate lower and upper bounds for cell entries. BSA accommodates any pattern of structural zeros or, more generally, of missing cells. I investigate the validity of my approach in several examples, some of which have not been previously analyzed with exact testing methods. Keywords: Contingency tables, Exact tests, Markov chain Monte Carlo, Structural zeros, Volume test.

1

Introduction

Sampling from sets of contingency tables is key for performing exact conditional tests. Such tests arise by eliminating nuisance parameters through conditioning on their minimal sufficient statistics (Agresti, 1992). They are needed when the validity of asymptotic approximations to the null distributions of test statistics of interest is questionable or when no such approximations ∗

Adrian Dobra is Assistant Professor, Department of Statistics, University of Washington, Seattle, WA 98195 (email: [email protected]).

1

are available. For example, Kreiner (1987) argues against the use of large-sample χ2 approximations for goodness-of-fit tests for large sparse tables, while Haberman (1988) raises similar concerns for tables having expected cell counts that are small and large. The problem is further compounded by the existence of structural zeros or by limits on the values allowed on each cell, e.g. occurrence matrices in ecological studies described in Chen et al. (2005). One of the earlier algorithms for sampling two-way contingency tables with fixed row and column totals is due to Mehta and Patel (1983). Since then many other valuable procedures have been proposed. In particular, key developments include the importance sampling approach of Booth and Butler (1999) as well as the sequential importance sampling approaches of Chen et al. (2005) and Chen et al. (2006). Various Markov chain algorithms have been proposed by Besag and Clifford (1989), Guo and Thompson (1992), Forster et al. (1996) and Caffo and Booth (2001). A very good review is presented in Caffo and Booth (2003). One of the central contributions to the literature was the seminal paper by Diaconis and Sturmfels (1998). They generate tables in a reference set T through a Markov basis. The fundamental concept behind a Markov basis is easily understood by considering all the possible differences of tables in T , i.e. M = {n0 − n” : n0 , n” ∈ T }. The elements f ∈ M are called moves. Any table n0 ∈ T can be transformed in another table n” ∈ T by applying the move f = n” − n0 ∈ M . Clearly not all the moves in M are needed to connect any two tables in T through a series of moves. A Markov basis M B for T is obtained by eliminating some of the moves in M such that the remaining moves still connect T . Generating a Markov basis is in the most general case a computationally difficult task that is solved using computational algebraic techniques. The simplest Markov basis contains only moves with two entries equal to 1, two entries equal to −1 and the remaining entries equal to zero. It connects all the two-way tables with fixed row and columns totals (Diaconis and Sturmfels, 1998). These primitive moves extend to decomposable log-linear models as described in Dobra (2003). A divide-and-conquer technique for the determination of Markov bases for reducible log-linear models is given in Dobra and Sullivant (2004). Less literature related to tables with structural zeros exists. The problem of sampling zeroone two-way tables with fixed one-way marginals has been studied in Chen et al. (2005) who also discuss earlier related results. Sampling two-way tables with structural zeros can be performed using the importance sampling algorithm of Chen (2007) or using the Markov bases developed by Aoki and Takemura (2005). Markov bases for incomplete multi-way tables have been studied in Rapallo (2006) from a theoretical point of view. However, the applicability of all these methods to higher-dimensional incomplete tables has yet to be proven. In this paper I focus on the very general case of sampling from sets of multi-way tables that are determined by a set of fixed marginals and by lower and upper bounds constraints on each cell count. These constraints can render the table incomplete by fixing some cell counts at their observed value. In addition some counts could be missing altogether. I develop a MetropolisHastings sampling algorithm that is based on computation of lower and upper bounds. My method can generate elements from such general sets of tables without any additional requirements (e.g., the computation of a Markov basis). The structure of the paper is as follows. In Section 2 I introduce the notations and formally 2

define the questions of interest. In Section 3 I define exact p-values and introduce their Monte Carlo estimates. In Section 4 I present the Bounds Sampling Algorithm (BSA) and prove its validity. In Section 5 I talk about the computation of bounds. In Section 6 I illustrate the applicability of BSA in several datasets, while in Section 7 I discuss assessing goodness-of-fit for log-linear models in large sparse possibly incomplete contingency tables. In Section 8 I make some concluding remarks.

2

Notations and Framework

The observed data is summarized in a k-way contingency table n = {n(i)}i∈I be a where I = I1 × . . . × Ik and Ij = {1, 2, . . . , Ij }, Ij ≥ 2 for j ∈ K = {1, . . . , k}. For C ⊂ K, we denote by nC = {nC (iC )}iC ∈IC the C-marginal of n, where IC = ×j∈C Ij . The grand total of n is n∅ . Table n has m = I1 · . . . · Ik cells that can be ordered lexicographically with the index of the last variable varying fastest, so that the cells of n can be written as I = {i1 , i2 , . . . , im }. Consider two other k-way tables nL and nU that define lower and upper bounds for n. The role of these bounds is to specify various constraints that might exist for the cell entries of n. For example, a structural zero in cell i ∈ I is expressed as nL (i) = nU (i) = 0. Zero-one tables can be expressed by taking nL (i) = 0 and nU (i) = 1 for all i ∈ I. Other bounds constraints might arise in a disclosure limitation context from the information deemed to be public at a certain time. Furthermore, let A be a log-linear model whose minimal sufficient statistics are {nC : C ∈ C}, where C = {C1 , . . . , Cq }, with Cj ⊂ K for j = 1, . . . , q. I define the set of tables that are consistent with the minimal sufficient statistics of A and with the bounds nL and nU , i.e. T = {n0 = {n0 (i)}i∈I : n0Cj = nCj , for j = 1, . . . , q, nL (i) ≤ n0 (i) ≤ nU (i), for i ∈ I}. (1) I assume that n ∈ T , that is, the bounds constraints are not at odds with the observed data. The set T induces bounds constraints L(i) and U (i) on each cell entry n(i), i ∈ I: L(i) = min{n0 (i) : n0 ∈ T },

U (i) = max{n0 (i) : n0 ∈ T }.

These bounds are tighter than the initial bounds nL and nU . They can be determined by integer programming algorithms or, for example, with the Generalized Shuttle Algorithm (GSA, henceforth) of Dobra and Fienberg (2008). A very important remark is that the bounds L = {L(i)}i∈I and U = {U (i)}i∈I can render table n incomplete. More explicitly, let S ⊂ I be the set of cells such that L(i) < U (i). This means that all the tables in T have the same counts for the cells in I \ S, i.e. n0 (i) = n(i) for all n0 ∈ T and i ∈ I \ S. These cells should be eliminated from any subsequent statistical analysis since their probabilities can be regarded as nuisance parameters. Bishop et al. (1975) discard cells that are structural zeros, hence cells with positive counts should be treated in a similar manner if they cannot vary in a particular set of tables. Therefore, the tables in T can be regarded as incomplete with missing cells that belong to I \ S. 3

Two distributions defined on T play a key role in statistical analyses. They are the uniform and the hypergeometric distributions  −1 Q 0 n (i)! 1 i∈S 0 0 PU (n |T ) = , and PH (n |T ) = (2)  −1 , |T | P Q n”(i)! n”∈T

i∈S

for n0 ∈ T . In the most general case, the normalizing constants of PH and PU can be computed only if T can be enumerated. Note that Sundberg (1975) has developed a formula for the normalizing constant of pH if A is decomposable and there are no bounds constraints (i.e. nL (i) = 0 and nU (i) = n∅ for all i ∈ I). Sampling from PU is relevant for estimating the number of tables in T (see Chen et al. (2005) or Dobra et al. (2006)) or for performing the conditional volume test of Diaconis and Efron (1985). The hypergeometric distribution pH arises by conditioning on the log-linear model A and the set of tables T under multinomial sampling. Haberman (1974) proved that the the log-linear interaction terms cancel out, which leads to the form of pH given in (2). Sampling from Pu and PH is trivial if T can be explicitly determined. Unfortunately this task is computationally infeasible for most real datasets. As such, the goal of this paper is to develop a sampling procedure from PH and PU for any set of tables T induced by a set of fixed marginals and a pair of lower and upper bounds arrays.

3

Computing Exact P-values

Let P be a distribution on T and H : T −→ < be a real valued function. The exact p-value associated with H with respect to P is given by the sum of the probabilities of all tables that are at least as extreme as the observed table n: X p(n) = P (n0 ). (3) {n0 :H(n0 )≥H(n)}

Usually H is a test statistic such as the Pearson χ2 statistic or the likelihood-ratio G2 statistic, i.e.   X X (n(i) − n n(i) b(i))2 2 2 , G (n) = 2 n(i) log . X (n) = n b (i) n b (i) i∈I i∈I Here n b = {b n(i)}i∈I are the expected cell values under the log-linear model A. Asymptotically 2 both X and G2 have a χ2 distribution with a number of degrees of freedom given by the difference between the number of cells m and the number of parameters fitted for A. This approximation holds only if n does not contain any missing cells. In the case when I \ S 6= ∅, the number of degrees of freedom need to be adjusted as described in Bishop et al. (1975) or Fienberg (2007). The same references also discuss how to modify the iterative proportional 4

fitting algorithm to account for the missing cells when determining the expected table n b under A. As it can be seen in the examples from Section 6, adjusting the number of degrees of freedom with respect to a pattern of missing cells has a lot of merits. On the other hand, to the best of my knowledge, there are no formal theoretical arguments that justify them – see Kreiner (1987). An exact p-value of the goodness-of-fit of model A is determined by taking H to be X 2 or 2 G and P to be the hypergeometric distribution PH in (3). An exact p-value corresponding with the volume test of Diaconis and Efron (1985) is given by X p0 (n) = PU (n0 ). (4) {n0 :X 2 (n0 )≤X 2 (n)}

This represents the ratio of the volumes of the set of tables with Pearson’s χ2 statistic smaller than the observed value X 2 (n) and the entire set T . Estimating the p-value (4) is especially relevant since there do not seem to be asymptotic results for the distribution of X 2 under PU that go beyond two-way tables with no missing cells. Other test statistics and their corresponding exact p-values are discussed in numerous papers – for a detailed review see Agresti (2001). The exact p-value (3) can be written as the expected value under P of the indicator function I{H(n0 )≥H(n)} . Therefore a Monte Carlo estimate of p(n) is obtained by sampling N i.i.d. tables n1 , . . . , nN from T according to P , counting how many of these tables lead to values of H larger than H(n) and finally dividing this count by N . Similarly, a Monte Carlo estimate of p0 (n) from (4) is obtained by counting how many tables have the Pearson’s χ2 statistic smaller than the observed value X 2 (n) and dividing this count by N . Since i.i.d. samples are typically hard to obtain, alternative sampling algorithms have been devised – see Section 1. In particular, if feasible tables are sampled with a Markov chain, the resulting samples are no longer independent. The Monte Carlo estimation of p(n) and p0 (n) can be subsequently improved through the sequential p-values of Besag and Clifford (1991).

4

The Bounds Sampling Algorithm

I develop a Metropolis-Hastings sampler called the Bounds Sampling Algorithm (BSA) as follows. Let nc ∈ T be the current state of the Markov chain. Let nbd(nc ) denote the subset of tables in T that can be reached from nc at the current iteration of BSA. The sampler repeats the steps: 1. Select a table n0 from nbd(nc ) using the Neighborhood procedure described below. 2. At the next iteration the chain moves to n0 with probability min{1, P (n0 )/P (nc )}. The resulting Markov chain is irreducible if the neighborhoods of tables in T connect T . That is, for any n0 , n” ∈ T , there exists a path n0 = n1 , n2 , . . . , nJ = n” such that nj ∈ nbd(nj−1 ) for j = 2, . . . , J. It must also be true that n0 ∈ nbd(nc ) if and only if nc ∈ nbd(n0 ). Denote the number of cells in S ⊂ I by s ≤ m. As before, consider the arrays of lower and upper bounds L and U associated with the set of tables T . The procedure below generates 5

a candidate table in T given the current table nc : procedure Neighborhood(L,U ,nc ) Repeat the following steps until a candidate table n0 ∈ T has been identified: •(A) Generate a random ordering σ of S. Under this ordering, the j-th cell of S is iσ(j) . •(B) For smax = s − 1, s − 2, . . . , 0 do: ◦ Consider two new sets of lower and upper bounds L0 and U 0 . For every j = 1, . . . , smax , set L0 (iσ(j) ) = U 0 (iσ(j) ) = nc (iσ(j) ). For every j = smax + 1, . . . , s, set L0 (iσ(j) ) = L(iσ(j) ) and U 0 (iσ(j) ) = U (iσ(j) ). ◦ Update the bounds L0 and U 0 to account for the new smax constraints as described in Section 5. The updated bounds will be such that L(iσ(j) ) ≤ L0 (iσ(j) ) ≤ U 0 (iσ(j) ) ≤ U (iσ(j) ) for j = smax + 1, . . . , s. ◦ If max{U 0 (iσ(j) ) − L0 (iσ(j) ) : j = smax + 1, . . . , s} ≥ 1, do: Call the procedure BoundsTable(L0 ,U 0 ,σ). If BoundsTable returns a table n0 ∈ T and n0 6= nc , then stop and return n0 . If BoundsTable does not return a candidate table, go to (A) and generate another permutation σ. Continue with the “for” loop (B) as before.  A new table in T is determined by setting the lower and upper bounds of most cells in S to be equal with the counts in the same cells in nc . The bounds for the remaining cells are updated. For an ordering σ of S, the call Neighborhood(L,U ,nc ) returns a table in the set TσL,U = {n” ∈ T : n”(iσ(j) ) = nc (iσ(j) ), j = 1, . . . , smax }. These are the tables in T having the maximum number of cell counts in common with nc given the cell ordering σ. This implies that Neighborhood(L,U ,nc ) will identify with positive probability any table in T having the maximum number of counts in common with nc . Thus nbd(nc ) includes this set of tables but it is not limited to it. Typically TσL,U contains only a small number of tables since the lower and upper bounds of most cells are equal. If TσL,U contains at least one table, the procedure BoundsTable finds it. If it contains more than one table, BoundsTable randomly generates a table in TσL,U . If BoundsTable reports that TσL,U is empty, a new ordering of the cells is randomly generated. If S 6= ∅ (i.e., T potentially contains other tables in addition to the observed table n), the “for” loop (B) stops before smax reaches 0. Sequentially decreasing smax means that the bounds L0 and U 0 become closer to L and U . Hence it must be that the lower bound of at least one cell is strictly smaller than the corresponding upper bound. procedure BoundsTable(L0 ,U 0 ,σ) • Find j0 = min{j : 1 ≤ j ≤ s, L0 (iσ(j) ) < U 0 (iσ(j) )}. • If there is no such j0 , a new table n0 has been determined. Its counts are n0 (i) = L0 (i) = U 0 (i), for i ∈ I. If n0 actually belongs to T , stop and return n0 . • Otherwise, we have L0 (iσ(j0 ) ) < U 0 (iσ(j0 ) ). This means that the current set of possible values 6

P = {L0 (iσ(j0 ) ), L0 (iσ(j0 ) ) + 1, . . . , U 0 (iσ(j0 ) )} for cell iσ(j0 ) is not empty. • Generate a random ordering {g1 , g2 , . . . , gv } of P, where v = U 0 (iσ(j0 ) ) − L0 (iσ(j0 ) ) + 1. • For v 0 = 1, 2, . . . , v do: ◦ Define two new sets of lower and upper bounds L” and U ”. Set L”(i) = L0 (i) and U ”(i) = U 0 (i) for all i ∈ I \ {iσ(j0 ) }. Set L”(iσ(j0 ) ) = U ”(iσ(j0 ) ) = gv0 . ◦ Update the bounds L” and U ” as described in Section 5. ◦ Call the procedure BoundsTable(L”,U ”,σ). ◦ Stop if BoundsTable identifies a table n0 ∈ T . Return n0 .  The procedure BoundsTable is recursive and finds a table in T by sequentially fixing cells at a permissible random value, then updating the bounds for the rest of the cells. After making the lower and upper bounds of a cell equal to some possible value, the bounds for the rest of the cells become tighter. BoundsTable ends after identifying one table n0 ∈ T such that L0 (i) ≤ n0 (i) ≤ U 0 (i) for all i ∈ I or after determining that no such table exists. The next result establishes the irreducibility of the proposed Markov chain. Proposition 1. The neighborhoods generated by BSA connect the set of tables T . The proof of Proposition 1 is given in Appendix A.

5

Computing Bounds

The most straightforward method to determine the bounds required by BSA is linear programming (e.g., the simplex algorithm). While the resulting real bounds might not be sharp (i.e., they could differ from the actual integer bounds), the relatively small difference between the two sets of bounds does not constitute a significant burden on the efficiency of the sampling procedure. I have empirically observed that BSA with linear programming (LP) bounds updates works very well in the case of two-way tables. The two-dimensional examples discussed in Section 6 use LP bounds updates. Similar conclusions have been reached, for example, in Chen et al. (2005; 2006) and Chen (2007). On the other hand, I found that LP bounds updates are less desirable for higher-dimensional contingency tables. There are two major reasons for this decrease in performance. First of all, the existent set of bounds cannot be simply updated after another cell has been fixed at a certain value. In my current implementation, the LP bounds need to be recalculated from scratch after each step. Moreover, the bounds for each cell that is not currently fixed are updated independently, which inevitably leads to increased computing times. What is even more worrisome is that BSA with LP updates can generate integer tables that do not belong to the target set T . Checking the validity of each table and rejecting those tables that turn out not to belong to T are computationally demanding tasks that should be avoided for multi-way tables. An innovative idea for computing bounds has been proposed by Buzzigoli and Giusti (1999). They observed that the bounds for some cells correspond with bounds for some other cells. 7

Their procedure sequentially tightens the upper and lower bounds until no further adjustment can be made. Dobra and Fienberg (2008) devised the Generalized Shuttle Algorithm (GSA) which is based on the same basic principles. GSA takes advantage of the special structure of the categorical data and keeps track of the bounds for all the blocks of cells that can be formed by joining two or more categories of each of the k variables recorded in table n. Such blocks belong to the set Q = {(J1 , . . . , Jk ) : ∅ = 6 Jl ⊆ Il , for l = 1, . . . , k}. Remark that Q contains the cells in table n as well as the cells in any marginal of n and the cells in any table formed by collapsing n across categories within variables. As such, specifying the bounds constraints that define the set of tables T is straightforward since it involves setting the values of lower and upper bounds for relevant blocks in Q. Each block b = (J1 , . . . , Jk ) ∈ Q can be split along a dimension k0 (1 ≤ k0 ≤ k) to form two other blocks b1 = (J1 , . . . , Jk0 −1 , Jk10 , Jk0 +1 , . . . , Jk ) ∈ Q and b2 = (J1 , . . . , Jk0 −1 , Jk20 , Jk0 +1 , . . . , Jk ) ∈ Q such that Jk10 ∪ Jk20 = Jk0 . Denote by bL and bU the current lower and upper bounds for b and similar notations for b1 and b2 . The following relations need to be satisfied: bU ≤ bU1 + bU2 ,

bL ≥ bL1 + bL2 ,

bL1 ≥ bL − bU2 ,

bU1 ≤ bU − bL2 ,

(5)

and so on. Therefore the bounds for the blocks Q are interlinked and they can be sequentially updated through block relations (5). One of the most important features of GSA is that it correctly determines whether there are no integer tables in a set T as given in (1). This implies that if GSA finds that all the lower bounds for blocks Q equal the corresponding upper bounds, then a table in T has been determined. The second most important feature is that some current set of bounds can be efficiently updated once a new cell is fixed. Dobra and Fienberg (2008) prove that GSA converges to the actual sharp bounds L and U if A is decomposable or if n is a dichotomous table and A is defined by the (k − 1)-dimensional marginals. While GSA can be further modified through an additional step to find sharp bounds for any set of tables T , I observed that computing sharp bounds brings very little to BSA while consuming a significant amount of time. The multi-way examples from Section 6 have been performed with the version of GSA that does not give sharp bounds and allows for quick updates for increased computational efficiency.

6

Examples

In this section I illustrate the performance of BSA in several examples. The standard deviations of the estimates produced are calculated across five different runs and are given after the “±” sign.

6.1

Health Concerns of Teenagers Data

The data in Table 1 is a 4×2×2 cross-classification of health concerns in teenagers (H), their sex (S) and their age group (A). It has been originally analyzed by Grizzle and Williams (1972), 8

followed by Fienberg (2007), page 148. This table contains two structural zeros induced by the absence of menstrual problems in males. The marginal [HS] (i.e., health concerns vs. sex) also contains a structural zero. Fienberg (2007) argues that the number of degrees of freedom associated with a log-linear model needs to be reduced by two unless the [HS] marginal is fitted. In that case the number of degrees of freedom should only be reduced by one. The asymptotic χ2 p-values for the log-linear models [HS][AH][SA], [AH][SA], [HS][SA] and [HS][AH] corresponding with the Pearson X 2 statistic are given in Table 2 and are based on the calculations reported in Table 8.4, page 149 from Fienberg (2007). Due to the relatively small sample size, I was able to exhaustively enumerate all the feasible tables associated with the four log-linear models using the GSA algorithm of Dobra and Fienberg (2008). An examination of the complete set of feasible tables shows that two additional cell counts corresponding with menstrual problems in females are fixed at their observed values of 4 and 8 given the set of marginals [HS][AH][SA], [AH][SA] and [HS][AH]. This implies that the fit of these three models can be assessed in the reduced 3 × 2 × 2 table obtained by eliminating the category “Menstrual problems” altogether. Since the two structural zeros also vanish in this smaller table, there is no need to make any adjustments to the number of degrees of freedom. Usual calculations lead to the same asymptotic p-values obtained in the full 4 × 2 × 2 table. Table 1: Health concerns of teenagers data from Grizzle and Williams (1972). Male Female Health Concerns 12-15 16-17 12-15 16-17 Sex, Reproduction 4 2 9 7 Menstrual Problems – – 4 8 How Healthy I am 42 7 19 10 Nothing 57 20 71 31

Table 2: Assessing the fit of four log-linear models for the health concerns of teenagers data. Log-linear Asymptotic Exact Number of model χ2 p-value χ2 p-value tables [HS][AH][SA] 0.362 0.357 ± 0.001 126 [AH][SA] 0.0107 0.01 ± 0.0002 156240 [HS][SA] 0.087 0.085 ± 0.0004 1252881 [HS][AH] 0.173 0.175 ± 0.0005 6552

I used BSA to sample from the hypergeometric distribution in the set of tables defined by the two structural zeros and the minimal sufficient statistics of each of these four log-linear models. I performed five runs of one million iterations with a burn-in of 1000 iterations. The estimated exact p-values for the χ2 test are given in from Table 2. Remark that the exact and asymptotic p-values are almost equal. 9

6.2

NBER data

Table 3 is a 4 × 5 × 4 cross-classification of 4345 individuals by occupational groups (O1 – “self-employed, business”, O2 – “self-employed, professional”, O3 – “teacher”, O4 – “salaryemployed”), aptitude levels (A) and educational levels (E). It was collected in a 1969 survey of the National Bureau of Economic Research (NBER) – see Table 3-6 page 45 from Fienberg (2007). The horizontal lines denote structural zeros. The ten structural zeros under O3 and E1, E2 are associated with teachers being required to have higher education levels. The other two structural zeros under O2 can be motivated in a similar manner. I want to assess the fit of the model of all two-way interaction log-linear model. The number of degrees of freedom for this model are calculated by subtracting the number of structural zeros from 36 – the number of degrees of freedom corresponding with a 4 × 5 × 4 table without structural zeros. One needs to add back the number of structural zeros that are present in marginal tables that are among the minimal sufficient statistics of the log-linear model considered (Bishop et al., 1975). In this case there are two such counts present in the aptitude by educational levels marginal. The resulting number of degrees of freedom is 36 − 12 + 2 = 26. The observed value of the likelihood-ratio test statistic is G2 = 15.91 which leads to an asymptotic p-value for the all two-way interaction model of 0.938. I used BSA to sample from the hypergeometric distribution in the set of tables T defined by the three two-way marginals of the NBER data and the 12 structural zeros. I performed five runs of 100000 iterations with a burn-in of 5000 iterations. The rejection rates across the different runs were 0.648 ± 0.001. The algorithm converges to an estimate of the exact p-value for the likelihood-ratio test of 0.968 ± 0.002. The observed value of the χ2 test statistic is 17.1 which leads to an asymptotic p-value of 0.906. My estimate of the corresponding exact p-value is 0.918 ± 0.004. Note that the large sample size of the NBER data leads to a good agreement between the asymptotic and the exact p-values. Table 3: NBER data. E1 E2 E3 O1 A1 42 55 22 A2 72 82 60 A3 90 106 85 A4 27 48 47 A5 8 18 19 O2 A1 1 2 8 A2 1 2 15 A3 2 5 25 A4 2 2 10 A5 – – 12

The grand total of this table is 4345. E4 E1 E2 E3 3 O3 A1 – – 1 12 A2 – – 3 25 A3 – – 5 8 A4 – – 2 5 A5 – – 1 19 O4 A1 172 151 107 33 A2 208 198 206 83 A3 279 271 331 45 A4 99 126 179 19 A5 36 35 99

10

E4 19 60 86 36 14 42 92 191 97 79

6.3

High-tech Managers Data

Krackhardt (1987) surveyed 21 high-tech managers and asked each of them whether they consider the others to be their friends. The resulting 21 × 21 contingency table n has its nij entry equal to one if manager i thinks manager j is their friend and zero otherwise. This table has its diagonal entries constrained to zero. The test statistic of interest is X M (n0 ) = n0ij n0ji , i 38 (no,yes); c, husband unemployed (no,yes); d, child ≤ 4 (no,yes); e, wife’s education, high-school+ (no,yes); f , husband’s education, high-school+ (no,yes); g, 12

Asian origin (no,yes); h, other household member working (no,yes). There are 665 individuals cross-classified in 256 cells, which means that the mean number of observations per cell is 2.6. The table has 165 counts of zero and 217 other cells contain at most three observations. Whittaker (1990) argues that such a table is sparse and subsequently that the applicability of any asymptotic results relating to the limiting distributions of goodness-of-fit statistics for loglinear models becomes questionable due to the zeros present in marginals of dimension three or more. For example, the likelihood-ratio test statistic for the log-linear model of all two-way interaction is 144.59 with 219 degrees of freedom, which leads to a corresponding asymptotic p-value of 0.999. While lower order interactions are indeed expected to define log-linear models that fit this dataset well (Whittaker, 1990), such a high p-value for the all two-way interaction model should be interpreted with care. Determining whether a contingency table is sparse is crucial for a statistician who needs to choose whether to consider only log-linear models with lower-order interactions or to assess goodness-of-fit based exclusively on exact tests. Unfortunately this is currently a rather subjective judgement since there does not seem to exist any clear definition of sparseness in the literature. For example, no sensible statistician would argue that the NBER data from Table 3 is sparse. On the other hand, an experienced data analyst would agree with Whittaker’s judgement on the sparsity of Table 5. As such, it is fair to ask what percentage of zeros or what mean number of observations per cell actually make a table sparse? While many heuristic criteria can be proposed, I am not aware of any theoretical results that would offer an educated answer to this very important question. A possible solution might emerge from examining the bounds induced by the 28 two-way marginals of the Rochdale data. The lower bounds for all the cells are zero, while the upper bounds are given in the lower panel of Table 5. We see that the upper bounds are quite large even for cells containing counts of zero. Thus the corresponding set of tables T consistent with the 28 two-way marginals contains arrays having a pattern of small counts that is very different than the actual observed pattern. The relatively small number of available samples are spread around cells about which no information is actually available. The set T could contain tables with very little overlap of positive counts with the Rochdale data. This issue becomes even more serious for higher-dimensional tables in which most cells contain counts of zero. I propose an alternate way of analyzing sparse tables that conditions on the observed pattern of positive counts, i.e. I assume that only the cells with positive observed counts exist and the others are missing. This is equivalent to treating the counts of zero as structural zeros. Given this assumption, the Rochdale data is an incomplete table with 91 cells obtained by dropping the 165 counts of zero. The mean number of observations in this incomplete table is 7.31. The corresponding upper bounds are given in parentheses in the upper panel of Table 5. We notice a decrease in the upper bounds as we would expect. All the lower bounds are still zero. The number of degrees of freedom corresponding with any log-linear model involving two-way marginals needs to be reduced by 165, the number of cells with zero counts we conditioned on. Table 6 shows p-value estimates for the all two-way interaction model and for the log-linear model proposed by Whittaker (1990): f g|ef |dh|dg|cg|cf |ce|bh|be|bd|ag|ae|ad|ac. 13

(7)

The number of degrees of freedom for the log-linear model (7) is 233. The exact p-values have been determined by running five instances of BSA for 100000 iterations with a burn-in of 1000 iterations for each of the two log-linear models considered. The p-values obtained by conditioning on the observed positive counts are given in the lower panel of Table 6, while the upper panel gives the p-values obtained without this additional assumption. We see that the most striking discrepancy between the exact and the asymptotic p-values occurs for the likelihoodratio test when the Rochdale data is treated as complete. However, the exact p-values for the likelihood-ratio and the Pearson χ2 test are still in agreement. What is even more interesting to notice is that the exact and asymptotic p-values for the Pearson χ2 test are almost equal in three out of the four calculations. Table 5: Upper panel: Rochdale data from Whittaker (1990). The cells counts are written in lexicographical order with h varying fastest and a varying slowest. The numbers in parentheses represent the upper bounds induced by the 28 two-way marginals and the 165 counts of zero. Lower panel: upper bounds for the Rochdale data induced by the 28 two-way marginals alone. The bounds were calculated with GSA. 5 (40) 8 (42) 5 (25) 4 (23) 17 (68) 1 (12) 4 (25) 0 18 (76) 5 (42) 3 (23) 1 (23) 41 (102) 0 2 (23) 0 40 42 25 23 68 12 37 12 76 42 23 23 102 12 23 12

0 0 0 0 10 (39) 0 7 (19) 0 3 (15) 1 (7) 0 1 (7) 25 (72) 0 4 (19) 0 22 7 19 7 39 7 19 7 22 7 19 7 73 7 19 7

2 (17) 11 (23) 2 (17) 8 (25) 1 (14) 2 (12) 3 (14) 3 (12) 2 (6) 0 0 0 0 0 0 0 17 29 17 25 14 12 14 12 6 6 6 6 6 6 6 6

1 (7) 0 0 2 (7) 1 (7) 0 1 (7) 0 0 0 0 0 1 (6) 0 0 0 7 7 7 7 7 7 7 7 6 6 6 6 6 6 6 6

5 (40) 13 (56) 0 6 (23) 16 (72) 0 1 (24) 0 23 (80) 11 (56) 4 (23) 0 37 (106) 2 (12) 2 (23) 0 40 56 24 23 76 12 24 12 80 56 23 23 106 12 23 12

1 (15) 0 0 0 7 (39) 0 1 (19) 0 4 (15) 0 0 0 26 (75) 0 1 (19) 0 22 7 19 7 39 7 19 7 22 7 19 7 75 7 19 7

0 1 (12) 0 1 (12) 0 0 2 (12) 0 0 1 (6) 0 0 0 0 0 0 12 12 12 12 12 12 12 12 6 6 6 6 6 6 6 6

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 7 7 7 7 7 7 7 6 6 6 6 6 6 6 6

14

4 (40) 3 (42) 0 1 (11) 0 1 (12) 1 (11) 0 22 (76) 11 (42) 1 (11) 0 15 (75) 0 0 0 40 42 11 11 51 12 11 11 76 42 11 11 75 12 11 11

1 (15) 0 0 1 (9) 0 0 0 1 (9) 2 (39) 0 0 0 0 0 0 0 2 (15) 0 0 0 0 0 0 0 10 (51) 0 0 0 1 (11) 0 0 0 22 9 7 9 11 9 7 9 39 9 7 9 11 9 7 9 22 6 7 6 11 6 7 6 51 6 7 6 11 6 7 6

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 7 7 7 7 7 7 7 6 6 6 6 6 6 6 6

6 (40) 26 (66) 0 0 10 (56) 0 1 (11) 0 57 (107) 29 (69) 0 0 43 (89) 3 (12) 2 (11) 0 40 66 11 11 56 12 11 11 107 69 11 11 89 12 11 11

0 0 0 0 6 (39) 0 0 0 3 (15) 2 (7) 0 0 22 (51) 0 1 (11) 0 22 7 11 7 39 7 11 7 22 7 11 7 51 7 11 7

2 (9) 0 1 (9) 0 (25) 1 (9) 0 1 (9) 0 0 0 0 0 0 0 0 0 0 0 1 (6) 1 (6) 0 0 0 0 0 0 0 0 0 0 0 0 9 7 9 7 9 7 9 7 9 7 9 7 9 7 9 7 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6

Table 6: P-value estimates for the Rochdale data. Model

Full table Incomplete table

8

G2 X2 G2 X2

Observed Value 144.54 258.21 48.82 62.50

All two-way Asymptotic Exact p-value p-value 0.999 0.159 0.036 0.160 0.674 0.810 0.2 0.197

Observed Value 160.64 247.13 65.21 69.77

Whittaker Asymptotic Exact p-value p-value 0.999 0.154 0.251 0.264 0.573 0.743 0.418 0.410

Conclusions

The relevance of the work presented in this paper is twofold. First of all, I proposed extending the definition of incomplete tables to sparse tables and to tables that contain positive counts that are uniquely determined from the fixed marginals and the other bounds constraints. This is significant not only for applying exact methods for assessing goodness-of-fit, but also for establishing the validity of asymptotic approximations. So far the literature has attempted to determine when asymptotic approximations can be applied based on the number of zero counts or on the pattern of cells containing large versus small counts. I argue that a more direct and perhaps more suitable way to answer these questions can be found by examining the relevant set of tables consistent with the assumed constraints. The existence of a widely applicable sampling algorithm such as BSA will allow an in-depth investigation of various examples that could help elucidating major statistical issues related to the validity of asymptotic approximations for large sparse contingency tables with structural zeros, bounds constraints or other missing cell counts. Other related issues involve the justification of the use of exact tests that correspond with eliminating nuisance parameters by conditioning on the observed minimal sufficient statistics of these parameters. Second of all, the sampling approach I proposed seems to work as well as the sequential importance sampling method of Chen et al. (2005) and Chen (2007) for two-way tables with or without structural zeros. Moreover, BSA allowed the analysis of higher-dimensional tables that are sparse or could contain structural zeros with exact methods. I am not aware of any other sampling algorithm that can be used off-the-shelf and can accommodate any type of bounds constraints on the cell entries as well as any set of fixed marginal totals. Of course, the computation of bounds remains the largest computational effort required by my sampling method. However, this effort is far less demanding than the task of computing a Markov basis or the task of assessing the efficiency of an importance sampling method – see Bez´akov´a et al. (2006) for an evaluation of sequential importance sampling for zero-one two-way tables. The MCMC algorithm I proposed in this paper can be made more efficient by developing new methods for computing bounds. No other changes should be required as such methods become available. I do not make any claims related to the speed of my proposed approach. Since the relative speed of two competing methods can be assessed only on extremely well-developed speedoptimized implementations run on the same computer, I do not find it is straightforward to 15

perform such comparisons. I can report that BSA is certainly slower than the SIS methods of Chen et al. (2005) and Chen (2007). Since it requires the computation of bounds at each step, it is also slower than the Diaconis and Sturmfels (1998) sampling algorithm in the few cases when a Markov basis can be produced. As such, BSA should be used not because of its speed, but because of its generality. The numerical examples presented in the paper show how easy it is to successfully employ it for multi-way tables with various structures.

A. Proof of Proposition 1 Let nc ∈ T the current state of the chain and let n0 ∈ T be some other table. Assume that nc (i) = n0 (i) for i ∈ S0 ⊂ S and that nc (i) 6= n0 (i) for i ∈ S \ S0 . Define the subset of all tables in T sharing the same counts for the cells S0 : T0 = {n” ∈ T : n”(i) = nc (i), for i ∈ S0 }. Let i0 be the cell for which the minimum absolute strictly positive distance between the counts in tables nc and n0 is reached, i.e. i0 = argmin{|nc (i) − n0 (i)| : i ∈ S \ S0 }. Without restricting the generality, assume that nc (i0 ) < n0 (i0 ). It is also possible to assume that there does not exist another table n” ∈ T0 such that nc (i0 ) < n”(i0 ) < n0 (i0 ). Otherwise the argument that follows can be applied to prove that BSA can reach n” from nc and n0 from n”. Consider the subset of T0 : T1 = {n” ∈ T0 : n”(i0 ) ∈ {nc (i0 ), n0 (i0 )}}. Denote by S1 the cells that do not change in T1 , i.e. S1 = {i ∈ S : n”(i) = nc (i), for n” ∈ T1 }. We have S0 ⊂ S1 and i0 ∈ / S1 . Let L1 and U 1 be the lower and upper bounds determined by the set of tables T1 . Moreover, we write T1 = T10 ∪ T1 ” where T10 contains the tables in T1 with count nc (i0 ) and T1 ” contains the tables with count n0 (i0 ). Consider the orderings Σ of S such that the cells in S1 come before the cells in S \ S1 and such that the last cell is i0 . With positive probability, it could happen that during the call Neighborhood(L1 ,U 1 ,nc ) an ordering σ ∈ Σ is sampled. A call Neighborhood(L1 ,U 1 ,nc ) always generates tables in T1 . We want to prove that BSA is able to reach any table in T1 in one, two or possibly more steps through successive calls of Neighborhood(L1 ,U 1 ,·). Order the tables in T1 in increasing order with respect to their number of common counts with nc with ties are broken at random. Let nmin be the first table in this ordering. There exists σ min ∈ Σ such that min min nmin (iσ (j) ) = nc (iσ (j) ) for j = 1, 2, . . . , s0max , with s0max ≤ s − 1. Since there does 16

not exist a table with more counts in common with nc than nmin , it follows that during the call Neighborhood(L1 ,U 1 ,nc ) in which the ordering σ min is chosen, the smallest value of smax that leads to the determination of a feasible table is s0max . In this case table nmin can be returned with positive probability. Now a call Neighborhood(L1 ,U 1 ,nmin ) could lead with positive probability to the identification of the second table in the ordering of T1 with respect to the number of common counts with nc . By induction it follows that BSA can reach any table in T1 in a finite number of steps. Therefore it must be the case that a table n∗ in T1 ” is eventually returned. This implies that BSA transforms nc in another table n∗ that has at least one additional common count with n0 . An induction argument shows that there exists a sequence of tables in T each having at least one additional count in common with n0 than the previous table in the sequence, while preserving the other common counts. Therefore BSA is able to reach n0 from nc .

Acknowledgments This work was partially supported by Grant R01 HL092071 from the National Institute of Health and by a seed grant from the Center of Statistics and the Social Sciences, University of Washington.

References Agresti, A. (1992). “A Survey of Exact Inference for Contingency Tables.” Statistical Science, 7, 131–153. — (2001). “Exact Inference for Categorical Data: Recent Advances and Continuing Controversies.” Statistics in Medicine, 20, 2709–2722. Aoki, S. and Takemura, A. (2005). “Markov Chain Monte Carlo Exact Tests for Incomplete Two-Way Contingency Tables.” Journal of Statistical Computation and Simulation, 75, 787–812. Besag, J. and Clifford, P. (1989). “Generalized Monte Carlo Significance Tests.” Biometrika, 76, 633–642. — (1991). “Sequential Monte Carlo p-values.” Biometrika, 78, 301–304. ˘ Bez´akov´a, I., Sinclair, A., Stefankoi˘ c, D., and Vigoda, E. (2006). “Negative Examples For Sequential Importance Sampling of Binary Contingency Tables.” In Proceedings of the 14th conference on Annual European Symposium - Volume 14, Vol. 4168 of Lecture Notes In Computer Science, 136–147. Springer-Verlag, London, UK. Bishop, Y. M. M., Fienberg, S. E., and Holland, P. W. (1975). Discrete Multivariate Analysis: Theory and Practice. M.I.T. Press. Cambridge, MA. 17

Booth, J. G. and Butler, J. W. (1999). “An Importance Sampling Algorithm for Exact Conditional Tests in Log-linear Models.” Biometrika, 86, 321–332. Buzzigoli, L. and Giusti, A. (1999). “An Algorithm to Calculate the Lower and Upper Bounds of the Elements of an Array Given its Marginals.” In Statistical Data Protection (SDP’98) Proceedings, 131–147. Eurostat, Luxembourg. Caffo, B. S. and Booth, J. G. (2001). “A Markov Chain Monte Carlo Algorithm for Approximating Exact Conditional Probabilities.” Journal of Computational and Graphical Statistics, 10, 730–745. — (2003). “Monte Carlo Conditional Inference for Log-linear and Logistic Models: A Survey of Current Methodology.” Statistical Methods in Medical Research, 12, 109–123. Chen, Y. (2007). “Conditional Inference on Tables With Structural Zeros.” Journal of Computational and Graphical Statistics, 16, 445–467. Chen, Y., Diaconis, P., Holmes, S. P., and Liu, J. S. (2005). “Sequential Monte Carlo Methods for Statistical Analysis of Tables.” Journal of the American Statistical Association, 100, 109–120. Chen, Y., Dinwoodie, I. H., and Sullivant, S. (2006). “Sequential Importance Sampling for Multiway Tables.” The Annals of Statistics, 34, 523–545. Diaconis, P. and Efron, B. (1985). “Testing for Independence in a Two-Way Table: New Interpretations of the Chi-Square Statistic.” The Annals of Statistics, 13, 845–874. Diaconis, P. and Sturmfels, B. (1998). “Algebraic Algorithms for Sampling From Conditional Distributions.” The Annals of Statistics, 26, 363–397. Dobra, A. (2003). “Markov Bases for Decomposable Graphical Models.” Bernoulli, 9, 1–16. Dobra, A. and Fienberg, S. E. (2008). “The Generalized Shuttle Algorithm.” In Algebraic and Geometric Methods in Statistics Dedicated to Professor Giovanni Pistone, eds. P, Gibilisco, E, Riccomagno, M. P, Rogantin, and H. P, Wynn. Cambridge University Press. Dobra, A. and Sullivant, S. (2004). “A divide-and-conquer algorithm for generating Markov bases of multi-way tables.” Computational Statistics, 19, 347–366. Dobra, A., Tebaldi, C., and West, M. (2006). “Data Augmentation in Multi-Way Contingency Tables with Fixed Marginal Totals.” Journal of Statistical Planning and Inference, 136, 355–372. Fienberg, S. E. (2007). The Analysis of Cross-Classified Categorical Data. 2nd ed. Springer Science, New York.

18

Forster, J. J., McDonald, J. W., and Smith, P. W. F. (1996). “Monte Carlo Exact Conditional Tests for Log-linear and Logistic Models.” Journal of the Royal Statistical Society, 58, 445–453. Grizzle, J. E. and Williams, O. D. (1972). “Loglinear models and tests of independence for contingency tables.” Biometrics, 28, 137–156. Guo, S. W. and Thompson, E. A. (1992). “Performing the Exact Test of Hardy-Weinberg Proportion for Multiple Alleles.” Biometrics, 48, 361–372. Haberman, S. J. (1974). The Analysis of Frequency Data. University of Chicago Press, Chicago. — (1988). “A Warning on the Use of Chi-Squared Statistics With Frequency Tables With Small Expected Cell Counts.” Journal of the American Statistical Association, 83, 555–560. Krackhardt, D. (1987). “Cognitive Social Structures.” Social Networks, 9, 109–134. Kreiner, S. (1987). “Analysis of Multidimensional Contingency Tables by Exact Conditional Tests: Techniques and Strategies.” Scandinavian Journal of Statistics, 14, 97–112. Mehta, C. R. and Patel, N. R. (1983). “A network algorithm for performing Fisher’s exact test in r × c contingency tables.” Journal of the American Statistical Association, 382, 427–434. Ploog, D. W. (1967). “The Behaviour of Squirrel Monkeys (Saimiri sciureus) as Revealed by Sociometry, Bioacoustics, and Brain Stimulation.” In Social Communication Among Primates, ed. S, Altmann, 149–184. Chicago: University of Chicago Press. Rao, A. R., Jana, R., and Bandyopadhyay, S. (1996). “A Markov Chain Monte Carlo Method for Generating Random (0, 1)-Matrices with Given Marginals.” Sankhya, Series A, 58, 225–242. Rapallo, F. (2006). “Markov Bases and Structural Zeros.” Journal of Symbolic Computation, 41, 164–172. Snijders, T. A. B. (1991). “Enumeration and Simulation Methods for 0-1 Matrices With Given Marginals.” Psychometrika, 56, 397–417. Sundberg, R. (1975). “Some results about decomposable (or Markov-type) models for multidimensional contingency tables: distributions of marginals and partitioning of tests.” Scandinavian Journal of Statistics, 2, 71–79. Whittaker, J. (1990). Graphical Models in Applied Multivariate Statistics. John Wiley & Sons.

19

Suggest Documents