Simulate and Reject Monte Carlo Exact. Conditional Tests for Quasi-independence. Peter W. F. Smith and John W. McDonald. Department of Social Statistics, ...
Simulate and Reject Monte Carlo Exact Conditional Tests for Quasi-independence Peter W. F. Smith and John W. McDonald Department of Social Statistics, University of Southampton, Southampton, SO17 1BJ, United Kingdom
1 Introduction In a two-way contingency table, the null hypothesis of quasi-independence (QI) usually arises for two main reasons: 1) some cells involve structural zeros or 2) interest is focused on part of the table, e.g., the o-diagonal cells. Consider Table 1, analyzed by Becker (1990), which cross-classi es two independent interpretations of sputum cytology slides for lung cancer. Since the two interpretations tend to agree, most of the observations lie on the main diagonal and the hypothesis of independence is rejected. The hypothesis of QI for the o-diagonal cells, i.e., that the interpretations are independent given that they dier, is considered. However, the sparseness of the o-diagonal cells causes concern about the validity of using asymptotic tests, and an exact conditional test is used. Table 1: Cross-classi cation of rst and second independent interpretations of sputum cytology slides for lung cancer (Source: Archer et al., 1966) First interpretation
Second interpretation N A S P T Negative 26 19 1 0 7 53 Ambiguous cells 2 11 5 3 4 25 Suspect 0 1 6 6 0 13 Positive 0 0 0 4 1 5 Technically unsatisfactory 1 1 0 0 2 4 29 32 12 13 14 100
In order to perform an exact test of quasi-independence the null distribution of an appropriate test statistic must be calculated or simulated. For both independence and quasi-independence calculating the required distribution is often computationally infeasible. So simulation is used and a Monte Carlo exact conditional test is performed. 1
A Monte Carlo exact conditional test for independence is described by Agresti, Wackerly and Boyett (1979), Kreiner (1987) and Whittaker (1990). Brie y, one generates a random sample of tables according to the conditional distribution of the table counts given the marginal totals. For each generated table, an appropriate test statistic is calculated and the exact conditional p-value is estimated by the proportion of generated tables which are at least as discrepant from the null as the observed. The accuracy of this unbiased estimate may be evaluated using binomial con dence intervals. The problem, when using this approach to test for quasi-independence, is how to generate a random sample of tables from the null distribution. Since, as shown by Smith and McDonald (1993), the null distribution has a normalizing constant which is very dicult to evaluate. In the next section, we introduce a simulate and reject procedure based on simulating tables under independence. We then suggest some modi cations which dramatically reduce the rejection rate and so make the procedure viable.
2 Simulate and Reject Procedure
Let X = fX : ij 2 I = (1; : : :; r) (1; : : :; c)g be a r c contingency table, and let I be a proper subset of the index set I. We call the cells in I the cells of interest and the cells not in I xed. For the 5 5 Table 1, I refers to the o-diagonal cells. The saturated log-linear model for m = E(X ) has the form log m = + 1 + 2 + 12: The hypothesis of quasi-independence over I corresponds to 12 = 0 for ij 2 I . Now 12 for ij 62 I are nuisance parameters with sucient statistics x ; ij 62 I . Therefore, an exact conditional test for QI is constructed using the conditional distribution of the table counts, given the margins and the observed counts in the xed cells. Hence, tables under QI can be generated by simulating tables under independence and only retaining those where the counts in the xed cells match the observed values. For Table 1, we simulate tables from a multivariate hypergeometric distribution, thus maintaining the margins, and reject all tables which do not match the diagonal (26,11,6,4,2). Methods for simulating from a multivariate hypergeometric distribution are given by Agresti, Wackerly and Boyett (1979) and Pate eld (1981). Alas, this naive simulate and reject procedure is not computationally viable, since we failed to simulate under independence a table with a matching diagonal in over one billion attempts! All is not lost. Smith and McDonald (1993) show that the distribution of the cells of interest under quasi-independence does not depend on the observed values of the xed cells. By replacing the values in the xed cells with any counts and adjusting the margins accordingly, the simulate and reject procedure yields the correct null distribution. Therefore, the rejection rate ij
ij
ij
i
j
ij
ij
ij
ij
2
ij
can be signi cantly reduced by replacing the counts in the xed cells by those closest to independence, based on the adjusted margins. For Table 1 we replace the diagonal with (3,13,1,0,1). Note that the row and column margins for this adjusted tables are (x + ) = (30; 27; 8; 1; 3) and (x+ ) = (6; 34; 7; 9;13), respectively, and that now x equals the nearest integer to x + x+ =x++ , where + denotes summation over a subscript. Using this adjusted table, in order to obtain 2000 tables with matching diagonal, 234,595 tables were simulated under independence. The rejection rate of 99.15% is very large, but this adjusted-margins simulate and reject procedure is now computationally feasible. Pate eld (1981) simulates the required multivariate hypergeometric distribution by simulating cell by cell and row by row from univariate hypergeometrics, based on a factorization of the multiple hypergeometric mass function. Note that each r c table requires (r ? 1) (c ? 1) simulated counts (the others obtained by subtraction). For Table 1, 234; 595 16 = 3; 753; 520 simulations were required to obtain 2000 tables with matching diagonal, i.e., an average of 1877 simulations per retained table. We now propose various ways of reducing the average number of simulated cell counts required per retained table, by modifying Pate eld's algorithm. i
i
ii
i
2.1 Rejecting Partly Simulated Tables
i
Pate eld's algorithm simulates tables cell by cell, so a mismatch can be identi ed immediately after the count for a xed cell has been simulated, thus eliminating unnecessary simulation of the remaining cell counts in the table. For Table 1, after adjusting the diagonal and margins, we would repeatedly simulate the (1,1) cell count until a match of 3 occurs, then simulate the (1,2) to (1,4) cell counts and obtain the (1,5) cell count by subtraction. However, since the number of rejections does not aect the distribution of the tables retained, the (1,1) cell count can be set at its observed value and the rest of the row obtained as described. Next the (2,1) and (2,2) cell counts are simulated. If the simulated (2,2) cell count matches the observed value of 13, the rest of the row can be simulated. If not, the whole table must be rejected and a new table started. Once we have a successful match for the (2,2) cell count, we can continue simulating the table until the count in the next xed cell is simulated, the (3,3) cell here. Again, if we have a match, we continue simulating the table; a mismatch means that we must reject the table and start again. We continue in this manner until we have simulated a table with the required matching counts for all xed cells, remembering to check for matches where the count is obtained by subtraction. Partly simulated tables are now rejected, so eciency is measured by the number of cell counts simulated per retained table. By xing the rst cell count and rejecting partly simulated tables, 481,605 simulations were required to obtain 2000 tables, an average 241 per retained table (versus 1877 without these improvements). 3
2.2 Changing the Order of Cell Count Simulation
A further improvement is to permute the rows and columns of the table in order to attempt to match the counts in the xed cells as early as possible. Hence, on average, reducing the number of wasted simulations. For example, if the only xed cell in a r c table is the (r; c) cell, we must simulate the whole table before checking for a match for the last cell. By permuting the rows and columns so the xed cell becomes the (1,1) cell, we can set the count in the (1,1) cell at its observed value and simulate the rest of the table. Therefore, no rejection is required. McDonald and Smith (1994) extend this idea to triangular tables and propose an algorithm where no rejection is necessary. However, when simulating cell counts row by row, no such permutation is possible for tables with only diagonal xed cells. We now discuss the important and common situation of testing for odiagonal QI in a r r table. Recall that Pate eld's algorithm simulates cell by cell, row by row. However, one can show that in order to simulate the (i; j) cell only cells above and to the left need to have been simulated, i.e., the cells (k; l); k = 1; : : :; i; l = 1; : : :; j; k 6= l. Note that these cells plus the cell whose count is being simulated form a rectangle. Therefore, we can change the order in which the cells are simulated so as to attempt to match the counts in the xed cells as early as possible. When matching on the diagonal, we can set the (1,1) cell count to the (adjusted) observed value. The next xed count to match on is in the (2,2) cell, so we need only simulate cells counts above and to the left before checking that the simulated count for the (2,2) cell equals the (adjusted) observed value. Here we have only simulated 3 cell counts before checking for a match. If we have a mismatch, we have saved r ? 3 unnecessary simulations for the rst row. If we have a match, we continue by simulating the counts of the cells above and to the left of the (3,3) cell, which reduces the number of simulations required before the second match is attempted. After each successful match we continue through the table in this manner. We call this the expanding-rectangle algorithm. For Table 1, this algorithm reduced the average number of simulations per retained table to 172 (from 241 when simulating row by row). For a r r table with xed diagonal, the counts in the xed cells can be reordered by permuting the rows and columns using the same permutation. For the r! possible reorderings, the average number of simulations per retained table varies. In our experience, attempting the \hardest" matches rst reduces the average number of simulations per retained table. For example if the (r; r) cell count is the \hardest", we would simulate the whole table only to have to reject the table frequently because the nal match is the \hardest". On the other hand, if the \hardest" match is the (1,1) cell count, this count is set to the (adjusted) observed value and the \hardest" match never attempted. Our measure of hardness of match for the (i; i) cell count is the conditional probability of a match, given that we have matched on the (k; k); k = 1; : : :; i ? 1, cell counts. 4
When trying to determine the optimum permutation of the diagonal, the problem is how to calculate the conditional probability of a match, i.e., the hardness of a match. However, our experience suggests that the conditional probability of a match is approximately equal to the \marginal" probability of a match, i.e., the probability of a match if we were simulating the whole table before checking for matches. This is easily calculated for each diagonal cell since, as shown by Pate eld's factorization and used in his algorithm, the marginal probability of a match is hypergeometric. For Table 1 with diagonal (3,13,1,0,1), the marginal probabilities of a match are 0.3095, 0.1925, 0.4117, 0.8696, 0.3821, respectively. We permute the rows (and columns) using the permutation (2,1,5,3,4) so that these probabilities are in increasing order for the rearranged table. Now using the expanding-rectangle algorithm on the permuted table, the average number of simulations per retained table is reduced to 104 (from 172 before permuting).
2.3 Estimated P-values
The likelihood ratio test statistic for quasi-independence for Table 1 is 37.194 with estimated exact p-value of 0.00005 and associated 99% con dence interval of (0.00000, 0.00018), based on 20,000 tables generated under QI. While the observed test statistic and associated p-value are extreme, note that the rejection rate does not depend on their values.
3 Discussion In this paper, we propose improvements to a naive simulate and reject procedure for generating r c tables under quasi-independence for an arbitrary pattern of xed cells. Although some of the algorithmic improvements are described for generating under QI for the o-diagonal cells of a square table, the ideas are applicable to other patterns of xed cells. Apart from complete enumeration, which is only viable for small tables, the simulate and reject procedure is currently the only method for generating independent tables from the exact null distribution under QI. Our improvements to the naive procedure greatly increase its eciency. Smith, McDonald and Forster (1994) discuss another method for generating tables under QI using a Gibbs sampling approach, based on theoretical results in Forster, McDonald and Smith (1994). However, the generated tables are not necessarily independent and are only realizations from an approximation to the exact null distribution. When using a single Markov chain, the observed table is the obvious starting value. For multiple chains, obtaining other starting values with the same sucient statistics for the nuisance parameters as the observed data is problematic. A possible solution is to generate a small number of independent starting values using the simulate and reject algorithms proposed. 5
Acknowledgements This work was supported by Economic and Social Research Council award H519255005 as part of the Analysis of Large and Complex Datasets Programme.
References Agresti, A., Wackerly, D. and Boyett, J. M. (1979). Exact conditional tests for cross-classi cations: approximation of attained signi cance levels. Psychometrika, 44, 75{83. Archer, P. G., Koprowska, I., McDonald, J. R., Naylor, B., Papanicolaou, G. N. and Umiker, W. O. (1966). A study of variability in the interpretation of sputum cytology slides. Cancer Res., 26, 2122{2144. Becker, M. P. (1990). Quasisymmetric models for the analysis of square contingency tables. J. R. Statist. Soc. B, 52, 369{378. Forster, J. J., McDonald, J. W. and Smith, P. W. F. (1994). Monte Carlo exact conditional tests for log-linear and logistic models. Working Paper, University of Southampton. Kreiner, S. (1987). Analysis of multi-dimensional contingency tables by exact conditional tests: techniques and strategies. Scand. J. Statist., 14, 97{112. McDonald, J. W. and Smith, P. W. F. (1994). Exact conditional tests of quasi-independence for triangular contingency tables: estimating attained signi cance levels. Appl. Statist., (to appear). Pate eld, W. M. (1981). Algorithm AS 159: An ecient method of generating random R C tables with given row and column totals. Appl. Statist., 30, 91{97. Smith, P. W. F. and McDonald, J. W. (1993). Exact conditional tests for incomplete contingency tables: estimating attained signi cance levels. Working Paper, University of Southampton. Smith, P. W. F., McDonald, J. W. and Forster, J. J. (1994). Monte Carlo exact conditional tests for quasi-independence using Gibbs sampling. Working Paper, University of Southampton. Whittaker J. (1990). Graphical Models in Applied Multivariate Statistics. Chichester: Wiley.
6