... del titolo di. Dottore di Ricerca in Matematica e Applicazioni nel mese di Aprile 2003 ..... berg (1980), Fingleton (1984) and Agresti (1996). We present here a ...
DOTTORATO DI RICERCA IN MATEMATICA E APPLICAZIONI
LOG-LINEAR MODELS AND TORIC IDEALS
Fabio Rapallo
Relatore di tesi:
Coordinatore del Dottorato:
Prof. GIOVANNI PISTONE
Prof. CLAUDIO PEDRINI
(Politecnico di Torino)
(Universit`a di Genova)
DOTTORATO DI RICERCA IN MATEMATICA E APPLICAZIONI XV CICLO
Sede Amministrativa: Universit`a di Genova Sedi consorziate: Universit`a di Torino, Politecnico di Torino
Tesi presentata per il conseguimento del titolo di Dottore di Ricerca in Matematica e Applicazioni nel mese di Aprile 2003
Contents Introduction
3
1 Statistical and algebraic background
7
1.1
1.2
Log-linear models . . . . . . . . . . . . . . . . . . . . . . . . . .
8
1.1.1
Definitions and basic properties . . . . . . . . . . . . . .
8
1.1.2
Maximum likelihood estimates . . . . . . . . . . . . . . . 12
1.1.3
Goodness of fit tests . . . . . . . . . . . . . . . . . . . . 14
1.1.4
An important remark . . . . . . . . . . . . . . . . . . . . 16
1.1.5
Exact methods . . . . . . . . . . . . . . . . . . . . . . . 17
Algebraic theory of toric ideals . . . . . . . . . . . . . . . . . . . 21 1.2.1
Definitions and first properties . . . . . . . . . . . . . . . 22
1.2.2
Computation of toric ideals. First method . . . . . . . . 24
1.2.3
Navigating inside the set of solutions of an integer linear system . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.2.4
Computation of toric ideals. Second method . . . . . . . 27
2 The Diaconis-Sturmfels algorithm
29
2.1
Hypergeometric distribution . . . . . . . . . . . . . . . . . . . . 29
2.2
The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3
Rates of convergence . . . . . . . . . . . . . . . . . . . . . . . . 36
2.4
Two examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3 Two-way contingency tables
39
3.1
Two-way contingency tables: Independence . . . . . . . . . . . . 40
3.2
Some models for incomplete and square tables . . . . . . . . . . 44 3.2.1
Incomplete tables . . . . . . . . . . . . . . . . . . . . . . 44
3.2.2
The quasi-independence model . . . . . . . . . . . . . . . 45
3.2.3
The symmetry model . . . . . . . . . . . . . . . . . . . . 47
3.2.4
The quasi-symmetry model . . . . . . . . . . . . . . . . 47 1
2
CONTENTS 3.2.5
Computational notes . . . . . . . . . . . . . . . . . . . . 48
3.3
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4
An application. The h-sample problem . . . . . . . . . . . . . . 52
4 Rater agreement models
55
4.1 4.2
A medical problem . . . . . . . . . . . . . . . . . . . . . . . . . 55 Multidimensional rater agreement models . . . . . . . . . . . . . 56
4.3
Exact inference . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4
Conditioning
4.5 4.6
Computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5 Sufficiency and estimation
73
5.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.2
Sufficiency and sampling . . . . . . . . . . . . . . . . . . . . . . 75
5.3
Geometry of toric models . . . . . . . . . . . . . . . . . . . . . . 78
5.4 5.5
Algebraic representation of the sufficient statistics . . . . . . . . 84 Minimum-variance estimator . . . . . . . . . . . . . . . . . . . . 85
5.6
Algebraic method . . . . . . . . . . . . . . . . . . . . . . . . . . 87
A Programs
89
Bibliography
98
Introduction It is useful to start this work with two sentences from a recent survey paper by Agresti (2001) recently appeared in “Statistics in Medicine”. “For the null hypothesis of independence and conditional independence, the relevant conditional distribution is simple, which is why ordinary Monte Carlo applies so easily. This is not true of more complex hypotheses, such as quasi-independence, quasi-symmetry, and others for square and triangular contingency tables that were until recently given little attention in the exact literature.” “If statisticians cannot agree how to analyze even a 2 × 2 table, with no one approach being obviously best, what hope is there for a consensus on more complex analyses? The probability is probably decreasing to 0 with time, . . .” These two claims give concisely an excellent idea on how statisticians need simple and comprehensive procedures for analyzing complex problems involving contingency tables. In this thesis we use techniques from Computational Commutative Algebra to solve complex problems in estimation and hypothesis testing for discrete data. The application of Computational Commutative Algebra in statistics and probability is a recent topic and a first book in the field is Pistone et al. (2001a). In this thesis, starting from the work by Diaconis & Sturmfels (1998), we study the log-linear models through polynomial algebra, with important applications to statistical problems, such as goodness-of-fit tests for contingency tables. In particular, we apply the algebraic theory of toric ideals to study a Markov Chain Monte Carlo method for sampling from conditional distributions, through the notion of Markov basis. Throughout the thesis we will consider a wide range of statistical problems: Models for two-way and multi-way contingency tables; rater agreement models; contingency tables with structural zeros. In some cases we will characterize the relevant Markov basis, while in 3
4
CONTENTS
other cases we will give results which lead to a fundamental simplification of the symbolic computations. A number of examples will show the relevance, the efficiency and the versatility of these algebraic procedures. The relevance of the topics considered here is proved by the increasing number of specific workshops, conferences and lectures on this subject. Moreover, the number of people interested in grows rapidly, also including Professor Bernd Sturmfels, and Professor Stephen Fienberg. In Chapter 1, we introduce some background material about the log-linear models and the toric ideals, which have a key role in the analysis of contingency tables. In Chapter 2, we present a slight generalization of the DiaconisSturmfels algorithm, while in Chapter 3 we analyze the case of two-way contingency tables, with some results for the characterization of the relevant toric ideals and some other results useful to simplify their computation. In Chapter 4 we study an application of the previous theory to rater agreement models, which are widely used in biomedical and pharmaceutical research. We also give results for the characterization of toric ideals for multi-way contingency tables. Finally, in Chapter 5 we study the algebraic properties of the notions of sufficiency and maximum likelihood estimation in statistical models on finite sample space, with results about the computation of the sufficient statistic of the model and its geometric representation. The programs, both symbolic and numerics, presented in Appendix A, show the practical applicability of the algorithms, together with the examples on real data sets which are presented in all Chapters. The results presented in Chapters 3, 4 and 5 represent original work, as well as the generalization of the algorithm presented in Chapter 2 and the programs in Appendix A. A part of the thesis has been published in preliminary versions. In particular, a part of Chapter 2 is in press on the “Scandinavian Journal of Statistics”, while a part of Chapters 4 and 5 represents the subject of a preprint of the Department of Mathematics of Genova. Two papers are currently submitted to international journals. The topics presented here have been the subject of some presentations to conferences and international workshops. Among other we want to cite here the series “Grostat” and the Meetings of the Italian Statistical Society. Moreover, the tutorial for the course “Statistical Toric Models”, at the “First International School on Algebraic Statistics”, has been based on the work
CONTENTS
5
presented in this thesis. Finally, the theory and the applications developed in Chapter 4 have been the starting point for a current cooperation with the pharmaceutical company “Schering” in Berlin.
Acknowledgements During the writing of this work, through the last two years, I have benefited from the help of many people, both from the scientific and moral point of view. In alphabetical order, a minimal list is formed by Daniele De Martini (University of “Piemonte Orientale”, Novara), Franco Fagnola (University of Genova), Mauro Gasparini (Politecnico di Torino), my advisor Giovanni Pistone (Politecnico di Torino), Eva Riccomagno (University of Warwick), Lorenzo Robbiano (University of Genova), Maria Piera Rogantin (University of Genova), Ivano Repetto (University of Genova), Richard Vonk (Schering, Berlin). Outside the academia, I want to acknowledge the constant support of Matteo Cella, my cousin, and Andrea Giacobbe. Their support has been essential in the most critical moments.
6
CONTENTS
Chapter 1 Statistical and algebraic background In this chapter we introduce the main background theory we need, both from the statistical and the algebraic point of view. In Section 1.1 we recall the properties of log-linear models for the analysis of qualitative data, with a particular attention on the maximum likelihood estimation and the goodness of fit tests, using the classical asymptotic approach. Moreover, we emphasize how simple algebraic objects, such as power products and binomials, naturally appear in the classical analysis of log-linear models. Finally, in Section 1.1.5, we review the main exact methods for hypothesis testing for contingency tables. Exact methods come back to the works by Fisher in the forties, where an exact test for independence in 2×2 contingency tables was introduced. A wide range of exact procedures for inference for contingency tables is now available and in most cases the increased speed and computing power of computers has played a fundamental role for the practical applicability of such procedures. We describe here the main methods for exact analysis of contingency tables. In Section 1.2, we introduce the algebraic theory of toric ideals, with special attention to the statistical meaning of the algebraic procedures. A brief and neat review of some relevant concepts from Computational Commutative Algebra is presented in order to give a presentation as self-contained as possible. In particular, we deal with the computation of toric ideals, which will represent a key step in the algorithms presented in the next chapters. In this chapter, the proofs of the results are omitted, with references to literature, in particular Haberman (1974), Bishop et al. (1975) and Agresti (2002) for the theory of log-linear models, and Kreuzer & Robbiano (2000) and Bigatti & Robbiano (2001) for the theory of toric ideals. 7
8
CHAPTER 1. STATISTICAL AND ALGEBRAIC BACKGROUND
1.1
Log-linear models
1.1.1
Definitions and basic properties
The material presented in this section is mainly taken from classical works in the field of the log-linear models. The aim of this section is to show that in the classical representation of the log-linear models, i.e. using vector space theory, polynomial equations play a fundamental role. Some simple examples of this section will be used many times throughout the thesis. The main references are Haberman (1974), Bishop et al. (1975), Agresti (1996), Fienberg (1980) and Dobson (1990). In particular we follow the notation in Haberman (1974), as it allows us to give an unique presentation of the theory of log-linear models in a general setting. Some examples and useful remarks are taken from the other references. Log-linear models are statistical models for the analysis of qualitative data, then the sample space is finite and the data are usually presented in a contingency table. Definition 1.1 A contingency table is a collection f of frequencies (f (x) ∈ N : x ∈ X ), where X is the sample space. A contingency table contains the joint observations of d random variables on N subjects. A first classification of contingency tables can be made looking at the sample space X . In particular, the contingency table is said to be a d-way contingency table if f is the cross-classification of the subjects with respect to d random variables. Consequently, the sample space X is a finite subset of Nd , possibly with a suitable coding of the levels of the random variables. Moreover, the contingency table f is complete if the sample space is a Cartesian product of one-dimensional sample spaces and is incomplete otherwise. Log-linear models are used to describe the stochastic structure of the joint probability distribution underlying the sample, i.e. to determine stochastic relations between the observed variables, such as independence, non-interaction, symmetry and others more complex relations. If we denote the cardinality of the sample space X by q, the table f is an element of the q-dimensional space Rq , with the standard inner product defined by < f1 , f2 >=
X x∈X
f1 (x)f2 (x)
1.1. LOG-LINEAR MODELS
9
and the canonical basis (ex )x∈X . In a log-linear model, f is assumed to be the realization of a random vector F with mean m = (m(x) : x ∈ X ), where m(x) > 0 for all x ∈ X . Here and throughout we denote the observed tables with lowercase letters and the random vectors with the corresponding uppercase letters. Thus for each x ∈ X , log m(x) is well defined. Here we assume that the probabilities of the sample points are positive, and that the means m(x) > 0 are a parameterization of the model. Therefore, if µ = (log m(x) : x ∈ X ), then µ ∈ Rq . In a log-linear model it is assumed that µ ∈ M , where M is a p-dimensional linear manifold contained in Rq . When q = p, we say that the corresponding log-linear model is saturated. Example 1.2 Consider a 2-way contingency table with two levels for any random variable, i.e. a 2 × 2 contingency table. The space R4 is spanned by these four orthogonal vectors: ¶ µ ¶ µ 1 1 1 1 v2 = v1 = −1 −1 1 1
µ v3 =
¶ 1 −1 1 −1
µ v4 =
¶ 1 −1 . −1 1
We can consider a model M such that M = span(v1 , v2 , v3 ). In such case, Bishop et al. (1975) showed that the model can be written in the form Y log m(i, j) = λ + λX i + λj
(1.1)
for i, j = 1, 2, with the constraints X Y Y λX 1 + λ2 = λ1 + λ2 = 0 .
(1.2)
If we denote by mi+ and m+j the marginal totals, one can write X
Y
Y
Y
X
X
mi+ = eλ eλi (eλ1 + eλ2 ) , m+j = eλ eλj (eλ1 + eλ2 ) , X
X
Y
Y
N = eλ (eλ1 + eλ2 )(eλ1 + eλ2 ) . Hence we obtain m(i, j) = mi+ m+j /N
(1.3)
where mi+ and m+j are the marginal totals. Moreover, since v4 is orthogonal to M1 , we obtain the relation log
m(1, 1)m(2, 2) =0 m(1, 2)m(2, 1)
(1.4)
10
CHAPTER 1. STATISTICAL AND ALGEBRAIC BACKGROUND
from which we obtain m(1, 1)m(2, 2) − m(1, 2)m(2, 1) = 0 .
(1.5)
Equation (1.5) represents an algebraic relation between the means of the cells and in particular it is a binomial. We will see later the relevance of such binomials in our framework. A detailed representation of the vector space bases of log-linear models is presented in Collombier (1980), with special attention to the meaning of such representation in terms of hypothesis testing. For the statistical inference, it is now essential to specify the underlying probability distribution of the table f . The main models in literature are the Poisson model and the multinomial model. The Poisson model In the Poisson model, the elements of F are independent Poisson random variables with E[F (x)] = m(x) for all x ∈ X and m(x) > 0 for all x ∈ X . The log-likelihood is log L(f, µ) =
X
f (x) log m(x) − m(x) =
x∈X
= < f, µ > −
X
exp(µ(x)) .
x∈X
If PM is the orthogonal projection from Rq to M , then since µ ∈ M and PM is a symmetric operator, we have log L(f, µ) =< PM f, µ > −
X
exp(µ(x)) .
(1.6)
x∈X
Therefore, the family of Poisson models such that µ ∈ M is an exponential family. Following classical results about exponential families, the projection PM f is a complete minimal sufficient statistic, see for example Lehmann (1983). Moreover, any nonsingular linear transformation of PM f is again a complete minimal sufficient statistic. Traditionally, Equation (1.6) is presented as the following theorem. Theorem 1.3 Under the Poisson model, the sufficient statistic is expressible as linear combination of the cell counts (f (x) : x ∈ X ).
1.1. LOG-LINEAR MODELS
11
For the proof, see Agresti (2002), page 136. The random vector F has mean m and variance-covariance matrix D(µ), where D(µ) is a diagonal matrix with the variances m(x) on the main diagonal and 0 otherwise. We denote the convergences in probability and in distribution by the symp
d
bols −→ and −→, respectively. Moreover the symbol N denotes the Gaussian distribution. The first asymptotic results are summarized in the following proposition. Proposition 1.4 If (F (N ) )N >0 is a sequence of random vectors from a Poisson model with respective means (m(N ) )N >0 and if N −1 m(N ) −→ m, then p
(1.7)
d
(1.8)
N −1 F (N ) −→ m and N −1/2 (F (N ) − m(N ) ) −→ N (0, D(µ)) . For the proof, see Haberman (1974), page 7. The multinomial model
In the multinomial model, the vector F consists of one or more independent multinomial random vectors (Fi : i ∈ Xk ) with mean (mi : i ∈ Xk ). The sets Xk , k = 1, . . . , r are disjoint and have union X , and m(x) > 0 for all x ∈ X . We suppose that the model M is such that for each k = 1, . . . , r ν (k) = {IXk (x) : x ∈ X } ∈ M . For the determination of the complete minimal sufficient statistic, we consider the decomposition of M in M1 and M2 = M −M1 , where M1 is the linear manifold generated by (ν (k) : k = 1, . . . , r) and M2 is its orthogonal complement. We denote the sample size for (f (x) : x ∈ Xk ) by Nk . The log-likelihood is log L(f, µ) =< PM2 f, PM2 µ > −
r X
Nk log(< m(PM2 µ), ν
k=1
(k)
>) +
r X k=1
log Nk ! . (1.9)
Thus, PM2 f is a complete minimal sufficient statistic. As any nonsingular linear transformation of PM2 f is again a complete minimal sufficient statistic, we can use the statistic PM f as in the Poisson model. The random vector F has mean m and variance-covariance matrix ∗ Σ(µ) = D(µ)(I − PM (µ)) 1
12
CHAPTER 1. STATISTICAL AND ALGEBRAIC BACKGROUND
∗ where PM is the orthogonal projection from Rq onto M1 and I is the identity 1
matrix. We summarize in the following proposition the basic convergence results. Proposition 1.5 If (F (N ) )N >0 is a sequence of random vectors from a multinomial model with respective means (m(N ) )N >0 and if N −1 m(N ) −→ m, then p
N −1 F (N ) −→ m
(1.10)
and d
∗ N −1/2 (F (N ) − m(N ) ) −→ N (0, D(µ)(I − PM (µ))) . 1
(1.11)
For the proof, see Haberman (1974), page 13. The close connection between the Poisson model and the multinomial model can be found in Haberman (1974), where the author proves that many results about maximum likelihood estimation and hypothesis testing are the same under both the Poisson model and the multinomial model. Other models can be considered, such as the product-multinomial model and the conditional Poisson model. See Haberman (1974) and Agresti (2002) for details. Moreover, the different meanings of the models from an applicative point of view are widely discussed in Chapters 2 and 3 of Fienberg (1980).
1.1.2
Maximum likelihood estimates
We start by considering the problem of maximum likelihood estimation for the Poisson model. The maximum likelihood estimate µ ˆ of µ is the value such that log L(f, µ ˆ) = sup log L(f, µ) .
(1.12)
µ∈M
We briefly recall the basic theorems about the existence and the uniqueness of the maximum likelihood estimate Theorem 1.6 If a maximum likelihood estimate µ ˆ exists, then it is unique and satisfies the equation PM m ˆ = PM f .
(1.13)
Conversely, if for some µ ˆ ∈ M and m ˆ = exp(ˆ µ) and Equation (1.13) is satisfied, then µ ˆ is the maximum likelihood estimate of µ.
1.1. LOG-LINEAR MODELS
13
Theorem 1.7 A necessary and sufficient condition that maximum likelihood estimate µ ˆ of µ exists is that there exists δ ∈ M ⊥ such that f (x) + δ(x) > 0 for all x ∈ X . Corollary 1.8 If f (x) > 0 for all x ∈ X , then the maximum likelihood estimate µ ˆ exists. Theorem 1.9 A necessary and sufficient condition that maximum likelihood estimate µ ˆ exists is that there not exist µ ∈ M such that µ 6= 0, µ(x) ≤ 0 for all x ∈ X and < f, µ >= 0. For the multinomial model, the relevant theorem is the following, which states that the maximum likelihood estimate is the same under the Poisson model and the multinomial model. ˆ(m) is the maximum likelihood estimate for a multinomial Theorem 1.10 If µ model with µ ∈ M and if µ ˆ is the corresponding maximum likelihood estimate for a Poisson model with µ ∈ M , then µ ˆ(m) = µ ˆ, in the sense that when one side of the equation exists, then the other side exists and the two sides are equal. For the proof of all the above theorems, see Haberman (1974), pages 35-41. In view of this theorem, the necessary and sufficient conditions stated for the Poisson model also apply to the multinomial model. In many cases, the maximum likelihood estimate can not be computed directly and we resort to numerical approximations. Among the numerical methods, the most important ones in this field are the Newton-Raphson algorithm and the Iterative Proportional Fitting algorithm. Details on such methods can be found in Haberman (1974) and Agresti (2002). We will come back later on the numerical computation of the maximum likelihood estimate. Now, we analyze the asymptotic behavior of the maximum likelihood estimate. Suppose that (F (N ) )N >0 is a succession of random vectors with means (m(N ) )N >0 . We denote by (µ(N ) )N >0 the sequence of the logarithms of the means. Using the same notation as above, the main asymptotic results are summarized in the following theorem.
14
CHAPTER 1. STATISTICAL AND ALGEBRAIC BACKGROUND
Theorem 1.11 Let m∗ = limN →+∞ N −1 m(N ) , assume m∗ > 0, and let µ∗ = log m∗ . The following relations hold when N goes to +∞. p
µ ˆ(N ) − µN −→ 0 p
N −1 m ˆ (N ) −→ m∗ d
∗ ∗ N 1/2 (ˆ µ(N ) − µ(N ) ) −→ N (0, (PM (µ∗ ) − PM (µ∗ ))D−1 (µ∗ )) 1 d
∗ ∗ N 1/2 (m ˆ (N ) − m(N ) ) −→ N (0, D(µ∗ )(PM (µ∗ ) − PM (µ∗ ))) 1
(1.14) (1.15) (1.16) (1.17)
For the proof, see Agresti (2002), page 584. The previous result, and in particular Equations (1.16) and (1.17) allows us to compute asymptotic confidence intervals for any linear combination of the components of µ and m, and simple hypotheses tests for such quantities, applying the theory described in Lehmann (1983) and Lehmann (1987). Remark 1.12 For historical reasons, the log-linear models are usually expressed in terms of the expected cell counts, instead of in terms of the cell probabilities. However, the representation of our models in terms of the cell probabilities is given by noting that in our notation f is the vector of the observed frequencies, while N −1 f is the vector of empirical probabilities. Thus, m(N ) represents the vector of the expected frequencies, and N −1 m(N ) represents the vector of cell probabilities, as well as m∗ . Sometimes in the thesis, we will refer to the vector of the cell probabilities as p = (p(x) : x ∈ X ).
1.1.3
Goodness of fit tests
Hypothesis tests for log-linear models are generally based on test statistics which have an asymptotic chi-squared distribution under the null hypothesis. We consider here tests of the form H0 : µ ∈ M0 ⊂ M1 against the alternative hypothesis µ ∈ M1 . Such tests are commonly called goodness of fit tests, in the sense that they test if a smaller model M0 can replace a bigger model M1 without a significant loss of information. Moreover, in many situations these tests give some information about the stochastic relations between the variables. The goodness of fit tests are based on the Pearson test statistic C(ˆ µ(1) , µ ˆ(0) ) =
X (m ˆ (1) (x) − m ˆ (0) (x))2 x∈X
m ˆ (0) (x)
(1.18)
1.1. LOG-LINEAR MODELS
15
or on the log-likelihood ratio test statistic ∆(ˆ µ(1) , µ ˆ(0) ) = log L(f, µ ˆ(0) ) − log L(f, µ ˆ(1) ) .
(1.19)
Suppose that M0 has dimension p0 and M1 has dimension p1 . The relevant results about hypothesis testing is the following. Theorem 1.13 Consider a sequence of null hypotheses H0 : µ(N ) ∈ M0 and alternatives H1 : µ(N ) ∈ M1 . Suppose that µ ˆ(N,1) and µ ˆ(N,0) are the maximum likelihood estimates of µ(N ) under H0 and H1 , respectively. Then under H0 , the statistics −2∆ and C are asymptotically equivalent and lim P[−2∆(ˆ µ(N,1) , µ ˆ(N,0) ) > c∗ (α)] = P[C(ˆ µ(N,1) , µ ˆ(N,0) ) > c∗ (α)] = α ,
N →+∞
(1.20) where c∗ (α) is the α-th quantile of the chi-square distribution with p1 − p0 degrees of freedom. For the proof, see Bishop et al. (1975), page 514. If M1 is the saturated model, we can apply the fact that µ ˆ(N,1) = f and the test statistics in Equations (1.18) and (1.19) assume the most familiar expressions C(f, µ ˆ(0) ) =
X (f (x) − m ˆ (0) (x))2 x∈X
and ∆(f, µ ˆ(0) ) =
X
m ˆ (0) (x)
m ˆ (0) (x) log
x∈X
f (x) . m ˆ (0) (x)
(1.21)
(1.22)
A detailed discussion on these test statistics can be found in Dobson (1990) or Haberman (1978). The small sample properties of the asymptotic tests are presented in Fienberg (1980), Appendix IV. We will consider in details the small sample properties later in the thesis. Remark 1.14 In the applications, the goodness of fit tests are performed also in those cases where the maximum likelihood estimate does not exist. In this case we can use the notion of extended maximum likelihood estimate, see Appendix B of Haberman (1974) for details.
16
CHAPTER 1. STATISTICAL AND ALGEBRAIC BACKGROUND
1.1.4
An important remark
As we have seen in Example (1.2), the vector space theory leads to a canonical representation of the log-linear models in the form log m(i) = AΛ ,
(1.23)
where A is a the model matrix and Λ is the vector of the model parameters. In many cases, see for example Haberman (1978), the matrix A has integer non-negative entries and the model can be written in the form log m(i) =
s X
A(h, i)λ(h) (i) .
(1.24)
h=1
This is the classical representation of the log-linear models and it allows a simple interpretation of the statistical relations among the variables, see Fienberg (1980), Fingleton (1984) and Agresti (1996). We present here a detailed analysis of the independence model for two categorical variables. This model is enough simple to be analyzed in few pages with any detail and it contains all the relevant algebraic objects we have pointed out in the previous discussion. More complex models which involve a greater number of categorical variables and different stochastic relations among the variables will be presented in the next chapters. Example 1.15 Consider the cross-classification of two categorical random variables X and Y with supports {1, . . . , I} and {1, . . . , J}, respectively. The saturated model is written in the form Y XY log m(i, j) = λ + λX i + λj + λij
with the constraints I X i=1
λX i
=
J X
λYj
=
j=1
I X
λXY ij
i=1
=
J X
λXY ij = 0 ,
j=1
while the independence model, where we impose no interaction between the variables, assumes the form Y log m(i, j) = λ + λX i + λj
with the constraints
I X i=1
λX i
=
J X j=1
λYj = 0 .
1.1. LOG-LINEAR MODELS
17
This representation follows immediately from the vector space representation presented in Example 1.2. Note that a vector space basis is easy to be found in the two-way case, while in the multi-way case some difficulties arise. From our point of view is of great interest the multiplicative form of the loglinear model, obtained by exponentiating the log-linear form. For the saturated model we obtain the expression m(i, j) = ζ0 ζiX ζjY ζijXY
(1.25)
and for the independence model we can write m(i, j) = ζ0 ζiX ζjY
(1.26)
where the ζ parameters satisfy appropriate constraints. Note that our notation is coherent with the notation in Chapter 6 of Pistone et al. (2001a). Manipulating Equation (1.26), it is easy to check that under the independence model the means m(i, j) must satisfy the relations m(i1 , j1 )m(i2 , j2 ) − m(i1 , j2 )m(i2 , j1 ) = 0
(1.27)
for all i1 , i2 ∈ {1, . . . , I} and j1 , j2 ∈ {1, . . . , J}. Of course, the same relations hold for the cell probabilities p(i, j). Thus, the independence model is represented by the zero set of the polynomial system formed by the binomials in Equation (1.27). We will see in the next section the correspondence between algebraic restrictions of this form and the geometric restrictions on the space of the means parameters, or equivalently on the space of the cell probabilities. The multiplicative form of the log-linear models has been widely studied in Goodman (1979a), Goodman (1979b) and Goodman (1985). We will see in the next sections the relevance of this representation.
1.1.5
Exact methods
Exact statistics can be useful in situations where the asymptotic assumptions are not met, and so the asymptotic p−values are not close approximations for the true p−values. Standard asymptotic methods involve the assumption that the test statistic follows a particular distribution when the sample size is sufficiently large. When the sample size is not large, asymptotic results may not be valid, with the asymptotic p−values differing substantially from the exact p−values. Asymptotic results may also be unreliable when the distribution of
18
CHAPTER 1. STATISTICAL AND ALGEBRAIC BACKGROUND
the data is sparse. Examples can be found in Agresti (1996) and Bishop et al. (1975). We will also see some examples in our framework later in the thesis. Exact computations are based on the statistical theory of exact conditional inference for contingency tables, reviewed by Agresti (2001). Following such exact conditional methods, one computes the exact distribution of the test statistic given the sufficient statistic. These procedures often arise for independence and goodness of fit tests, as well as in the construction of uniformly most powerful tests and accurate confidence interval computations through the Rao-Blackwell theorem, see Lehmann (1987). Thus, the analysis is restricted to the set of tables with the same value t of the sufficient statistic TN as the observed table Ft = {f : X −→ N : TN (f ) = t} .
(1.28)
This set is often called the reference set. Throughout all the thesis, we will assume that Ft is a finite set. In literature, there exist many exact algorithms for a wide number of exact tests, such as Pearson chi-square, likelihood-ratio chi-square, Mantel-Haenszel chi-square, Fisher’s exact test and McNemar’s test, but in general such algorithms are limited to two-way contingency tables. A number of exact tests for contingency tables are presented for example in Agresti (2001). These algorithms are also implemented in statistical softwares, see for example SAS/STAT User’s Guide (2000). These exact algorithms compute exact p−values for general I × J tables using the network algorithm developed by Mehta & Patel (1983). This algorithm provides a substantial advantage over direct enumeration of the reference set Ft , which can be very time-consuming and feasible only for small problems. Refer to Agresti (1992) for a review of algorithms for computation of exact p−values, and refer to Metha et al. (1991) for information on the performance of the network algorithm. Limiting ourselves to the basic steps of the algorithm, the network algorithm proceeds as follows. Corresponding to the reference set Ft , the algorithm forms a directed acyclic network consisting of nodes in a number of stages. A path through the network corresponds to a distinct table in the reference set. The distances between nodes are defined so that the total distance of a path through the network is the corresponding value of the test statistic. At each node, the algorithm computes the shortest and longest path distances for all the paths that pass through that node. The exact computation of these quan-
1.1. LOG-LINEAR MODELS
19
tities for statistics which are linear combinations of the cells counts is presented in Agresti et al. (1990). For statistics of other forms, the computation of an upper bound for the longest path and a lower bound for the shortest path are needed. The longest and shortest path distances or bounds for a node are compared to the value of the test statistic to determine whether all paths through the node contribute to the p−value, none of the paths through the node contribute to the p−value, or neither of these situations occur. If all paths through the node contribute, the p−value is incremented accordingly, and these paths are eliminated from further analysis. If no paths contribute, these paths are eliminated from the analysis. Otherwise, the algorithm continues, still processing this node and the associated paths. The algorithm finishes when all nodes have been accounted for, incrementing the p−value accordingly, or eliminated. For each possible table, the algorithm compares its test statistic value with the corresponding value for the observed table. If the value for the table is greater than or equal to the observed test statistic value, it increments the exact p−value by the probability of that table, which is calculated under the null hypothesis using the multinomial frequency distribution. For several tests we meet in the analysis of categorical data, the test statistic is nonnegative, and large values of the test statistic indicate a departure from the null hypothesis. Such tests include Pearson’s chi-square, the likelihood-ratio chi-square, and other ones, such as the Mantel-Haenszel chisquare, Fisher’s exact test for tables larger than 2 × 2 tables, McNemar’s test. The exact p−value for these tests is the sum of probabilities for those tables having a test statistic greater than or equal to the value of the observed test statistic. These recently developed algorithms for two-way tables, together with improvements in computer power, make it feasible now to perform exact computations for data sets where previously only asymptotic methods could be applied. Nevertheless, there are still large problems that may require a prohibitive amount of time and memory for exact computations As noted by Diaconis & Sturmfels (1998), the reference set Ft is very large and then difficult to enumerate even for very small data sets, where the asymptotic approximation is heavily inadequate. Recently, a software has been implemented in order to compute efficiently the cardinality of the reference set Ft also in non trivial cases. This soft-
20
CHAPTER 1. STATISTICAL AND ALGEBRAIC BACKGROUND
ware, called LattE, acronym of “Lattice Point Enumeration”, is presented in De Loera et al. (2003) and uses non-standard algebraic results about convex polytopes. For example, using this software, the authors are able to compute the cardinality of the reference set Ft for Table 1.1 with two-dimensional marginal totals as components of the sufficient statistic, concluding that this cardinality is equal to 441, while for a table whose two-dimensional marginal total are displayed in Table 1.2, the cardinality of the reference set is about 2.2498 ∗ 1040 . X1 1 2 3
X2 1 2 3 96 72 161 10 7 6 1 1 2 X3 = 1
X1 1 2 3
X2 1 2 3 186 127 51 11 7 3 0 1 0 X3 = 2
Table 1.1: A 3 × 3 × 2 example.
X1 1 2 3
X2 1 2 3 164424 324745 127239 262784 601074 9369116 149654 7618489 1736281
X1 1 2 3
X3 1 2 3 163445 49395 403568 1151824 767866 8313284 16099500 6331023 1563901
X2 1 2 3
X3 1 2 3 184032 123585 269245 886393 6722333 935582 1854344 302366 9075926
Table 1.2: Two-dimensional marginal totals for a 3 × 3 × 3 example.
As the software LattE is at this time under development, we only dispose of some demo, and this is the reason why we present here examples taken from De Loera et al. (2003), instead of new examples.
1.2. ALGEBRAIC THEORY OF TORIC IDEALS
21
However, it is worth noting that for the statistical applications, it is often sufficient to have an approximation of the cardinality of Ft . When asymptotic methods may not be sufficient for such large problems, we can use Monte Carlo estimation of the exact p−values. A formula does not exist that can predict in advance how much time and memory are needed to compute an exact p-value for a certain problem. The time and memory required depend on several factors, including which test is being performed, the total sample size, the number of rows and columns, and the specific arrangement of the observations into table cells. Moreover, for a fixed sample size, time and memory requirements tend to increase as the number of rows and columns increases, since this corresponds to an increase in the number of tables in the reference set. Also for a fixed sample size, time and memory requirements increase as the marginal row and column totals become more homogeneous, see Agresti et al. (1990) for details.
1.2
Algebraic theory of toric ideals
Following the description of the log-linear models in Section 1.1, we remark that binomials and power products play a fundamental role in the formal representation of the log-linear models. Experts in Commutative Algebra have studied extensively the algebraic relations of power products, leading to the theory of toric ideals. In this section we briefly review the theory of toric ideals with a particular emphasis on the results more relevant for the statistical applications. Moreover, we present some recent computational issues which lead to a feasible computation of toric ideals. In the following, some basic notions in Computational Commutative Algebra is needed. In particular, we will use the notions of polynomial ideal, Gr¨obner basis, elimination ideal, saturation. The reader can refer to Kreuzer & Robbiano (2000) or Cox et al. (1992). Many of the statistical applications of the theory of toric ideals in Probability and Statistics comes from the fundamental work by Sturmfels (1996) For the theory of toric ideals we mainly refer to the papers Bigatti & Robbiano (2001) and Bigatti et al. (1999). The computations in the examples are obtained with the free software CoCoA, see Capani et al. (2000).
22
CHAPTER 1. STATISTICAL AND ALGEBRAIC BACKGROUND
1.2.1
Definitions and first properties
Let K be a numeric field and let K[y1 , . . . , ys ] and K[ξ1 , . . . , ξq ] be the polynomial rings in the indeterminates y1 , . . . , ys and ξ1 , . . . , ξq , respectively. Recall that a term in a polynomial ring is a power product and that a binomial is a difference of terms. We denote by Ts the set of all terms in s indeterminates. Definition 1.16 Let t1 , . . . , tq be terms in K[y1 , . . . , ys ]. The toric ideal I = I(t1 , . . . , tq ) ⊆ K[ξ1 , . . . , ξq ] associated to the set {t1 , . . . , tq } is the ideal of all polynomials p such that p(t1 , . . . , tq ) = 0. Note that different sets of power products produce the same toric ideals. This remark will be analyzed from the point of view of probability models late in the thesis. In the following, we use the multi-index notation, i.e. y d means y1d1 · · · ysds . Given D = {d1 , . . . , dq } ⊂ Ns such that y di = ti for i = 1, . . . , q, consider the semigroup homomorphism T : Nq −→ Ns
(1.29)
ei 7−→ di . We can associate to this map a ring homomorphism π : K[ξ1 , . . . , ξq ] −→ K[y1 , . . . , ys ]
(1.30)
ξi 7−→ ti . With these definitions, we refer to the toric ideal I as the toric ideal associated to T or the toric ideal associated to π. It is worth noting that the toric ideal I is the kernel of the ring homomorphism π. Note that we have π(ξ u ) = y v = y T (u) , with u ∈ Nq and v ∈ Ns such that P v = qi=1 ui di . We report in the following results the main properties about toric ideals of statistical interest. Proposition 1.17 Let T : Nq −→ Ns be a semigroup homomorphism and let us denote by π : K[ξ1 , . . . , ξq ] −→ K[y1 , . . . , ys ] the corresponding ring homomorphism. The toric ideal I associated to π is generated, as vector space, by the binomials of the set {ξ u − ξ v : T (u) = T (v)} .
(1.31)
1.2. ALGEBRAIC THEORY OF TORIC IDEALS Proof.
23
Choose a term-ordering τ on Ts and suppose there exists a polyno-
mial g in I which cannot be written as linear combination of polynomials in {ξ u −ξ v : T (u) = T (v)}. Choose g 6= 0 such that the leading term LT (g) = ξ a is minimal with respect to τ . As g ∈ ker(π), we have g(t1 , . . . , tq ) = 0. In particular, the term y T (a) must be eliminated and in g there is another monomial ξ b ≺τ ξ a such that T (b) = T (a). Now, the polynomial g 0 = g − ξ a + ξ b is non-zero, LT (g 0 ) ≺τ LT (g) and it cannot be written as linear combination of polynomials in {ξ u − ξ v : T (u) = T (v)}. This is a contradiction, because we supposed LT (g) minimal. Now we define T˜ : Zq −→ Zs
(1.32)
the extension of T to the group Z. As any u ∈ Zq can be written in the form u = u+ − u− , where u+ and u− are the positive and negative part of u respectively, the above result together with the standard theory of Gr¨obner bases leads to the following proposition. Proposition 1.18 For any term-ordering τ on Tq , there exists a finite set of vectors Gτ ⊂ ker(T˜) such that the reduced Gr¨obner basis of I with respect to τ is +
−
{ξ u − ξ u Proof.
: u ∈ Gτ } .
(1.33)
The result follows from Proposition 1.17 and from standard results
about Gr¨obner bases. Theorem 1.19 Let t1 , . . . , tq be terms in the indeterminates y1 , . . . , ys and let J be the ideal in K[ξ1 , . . . , ξq , y1 , . . . , ys ] generated by {ξ1 − t1 , . . . , ξq − tq }. We have that 1. I = J ∩ K[ξ1 , . . . , ξq ]; 2. if Gτ is a Gr¨obner basis of J with respect to an elimination ordering τ for {y1 , . . . , ys }, then I is generated by the polynomials in Gτ ∩ K[ξ1 , . . . , ξq ]. For the proof, see Bigatti & Robbiano (2001), Proposition 1.4. Moreover, nice properties of toric ideals are summarized in the following theorem. We will see later in the thesis the relevance of these properties.
24
CHAPTER 1. STATISTICAL AND ALGEBRAIC BACKGROUND
Theorem 1.20 In the previous settings, consider any grading on the polynomial ring K[ξ1 , . . . , ξq , y1 , . . . , ys ] such that the degrees of the yj are arbitrary integers and deg(ξi ) = deg(ti ), i = 1, . . . , r. Then 1. the ideal I is prime; 2. the ideal I is generated by pure binomials, i.e. I = Ideal(t1 − t2 : π(t1 ) = π(t2 ), gcd(t1 , t2 ) = 1) ;
(1.34)
3. the ideal I is homogeneous. For the proof, see Bigatti & Robbiano (2001), Theorem 1.6. The variety Variety(I) is the geometric counterpart of the ideal I. A point (ξ1 , . . . , ξq ) belongs to Variety(I) if and only if p(ξ1 , . . . , ξq ) = 0 for all polynomials p ∈ I. This condition can be verified making use of a system of generators of the ideal I. For details about the relationships between ideals and varieties, see for example Cox et al. (1992).
1.2.2
Computation of toric ideals. First method
A simple algorithm to compute the Gr¨obner basis of the toric ideal I = I(t1 , . . . , tq ) is based on the elimination algorithm as described in Theorem (1.19). We first consider the homomorphism π : K[ξ1 , . . . , ξq ] −→ K[y1 , . . . , ys ] defined by ξi 7−→ ti ; then, we consider the ideal J in K[ξ1 , . . . , ξq , y1 , . . . , ys ] generated by the set of polynomials {ξ1 − t1 , . . . , ξq − tq } .
(1.35)
Using the results from Theorem (1.19), we have that a (reduced) Gr¨obner basis of I is obtained computing a (reduced) Gr¨obner basis of J with respect to a term-ordering of elimination for y1 , . . . , ys and taking the only polynomials not involving the y’s. Example 1.21 Consider the polynomial ring K[y1 , . . . , y5 ] and the following set of six terms: {y1 y4 , y1 y5 , y2 y4 , y2 y5 , y3 y4 , y3 y5 }
(1.36)
We will consider this example later in the thesis, viewed from a statistical point of view.
1.2. ALGEBRAIC THEORY OF TORIC IDEALS
25
In order to compute the toric ideal I = I(y1 y4 , y1 y5 , y2 y4 , y2 y5 , y3 y4 , y3 y5 ) we introduce the polynomial ring K[ξ1 , . . . , ξ6 ] and we define the ring homomorphism π as described in Equation (1.30). The toric ideal of interest I is the kernel of π. To compute this kernel, we define the ideal J = Ideal(ξ1 − y1 y4 , ξ2 − y1 y5 , ξ3 − y2 y4 , ξ4 − y2 y5 , ξ5 − y3 y4 , ξ6 − y3 y6 ) in the polynomial ring K[ξ1 , . . . , ξ6 , y1 , . . . , y5 ]. Using CoCoA, the Gr¨obner basis of this ideal with respect to an elimination term ordering for the indeterminates y1 , . . . , y5 is G(J ) ={−y3 y5 + ξ6 , −y3 y4 + ξ5 , −y2 y5 + ξ4 , −y2 y4 + ξ3 , − y1 y5 + ξ2 , −y1 y4 + ξ1 , ξ5 y5 − ξ6 y4 , ξ3 y5 − ξ4 y4 , ξ1 y5 − ξ2 y4 , ξ4 y3 − ξ6 y2 , ξ2 y3 − ξ6 y1 , ξ2 y2 − ξ4 y1 , ξ3 y3 − ξ5 y2 , ξ1 y3 − ξ5 y1 , ξ1 y2 − ξ3 y1 , ξ4 ξ5 − ξ3 ξ6 , ξ2 ξ5 − ξ1 ξ6 , ξ2 ξ3 − ξ1 ξ4 } and the only polynomials which do not involve the y’s are the last three ones. Thus I = Ideal(ξ4 ξ5 − ξ3 ξ6 , ξ2 ξ5 − ξ1 ξ6 , ξ2 ξ3 − ξ1 ξ4 )
(1.37)
and the Gr¨obner basis of the relevant toric ideal consists of three binomials. This first method, although intuitive, is not the best from the computational point of view. Better methods for computing toric ideals can be found in Bigatti et al. (1999). These methods are based on the theory of the saturation. We will illustrate such methods in Section 1.2.4.
1.2.3
Navigating inside the set of solutions of an integer linear system
The saturation of an ideal with respect to a polynomial is given by the following construction. Consider a polynomial p in K[ξ1 , . . . , ξr ] and an ideal B ⊂ K[ξ1 , . . . , ξr ]. The ideal I is the saturation of B with respect to p if I = Elim(v, B + Ideal(pv − 1))
(1.38)
where v is a new indeterminate and the computations are made in the polynomial ring K[ξ1 , . . . , ξr , v]. Let t1 , . . . , tq power products in the polynomial ring K[y1 , . . . , ys ], i.e. yi = ta11i · · · ysasi .
(1.39)
26
CHAPTER 1. STATISTICAL AND ALGEBRAIC BACKGROUND We can define the matrix A ∈ Matq×s (N) associated to the power products.
The matrix A has non-negative entries. Conversely, any matrix A defines a set of power products using Equation (1.39). Definition 1.22 Let A ∈ Matq×s (N). We define the toric ideal associated to A as the toric ideal IA = I(t1 , . . . , tq ), where tq is defined in Equation (1.39). Now, we investigate the relation between the toric ideal IA and the kernel of A viewed as Z-module. First, we need some technical definitions. Consider the homogeneous system of Diophantine equations associated to A:
a11 z1 + · · · + α1q zq = 0 ··· as1 z1 + · · · + αsq zq = 0
(1.40)
where aij is the (i, j)-th element of the matrix A. The set of integer solutions ker(A) of such system is a Z-module. In the sequel, we will use the decomposition of a vector into its positive and negative parts, that is v = v + − v − . Now, let Bin(K[ξ1 , . . . , ξq ]) be the set of all binomials in K[ξ1 , . . . , ξq ] and define the maps ρ0 : Zq −→ K[ξ1 , . . . , ξq ] and η 0 : Bin(K[ξ1 , . . . , ξq ]) −→ Zq as +
−
ρ0 (u) = ξ u − ξ u
(1.41)
η 0 (ξ u − ξ v ) = u − v .
(1.42)
and
Definition 1.23 Let B = {v1 , . . . , vr } be a set of vectors. If B generates ker(A) as Z-module, the ideal I(B) = I(ρ0 (v1 ), . . . , ρ0 (vr )) is called a lattice ideal associated to ker(A). Q Let π = qi=1 ξi be the product of all the ξ indeterminates. The main theorem about the application of the saturation for computing toric ideals is the following. Theorem 1.24 Let B = {v1 , . . . , vr } ⊆ ker(A). The following conditions are equivalent. a) I(B) : π ∞ = I(A); b) I(B)Pπ = I(A)Pπ ;
1.2. ALGEBRAIC THEORY OF TORIC IDEALS
27
c) B is a set of generators of ker(A), i.e. I(B) is a lattice ideal. For the proof, see Bigatti & Robbiano (2001), Theorem 2.10. For further details about Theorem 1.24, see also Sturmfels (1996). Now, let A ∈ Matq×s (N) and let b = (b1 , . . . , bs ) be a vector in Ns . We want to find the non-negative solutions of the system a11 z1 + · · · + α1q zq = b1 ··· as1 z1 + · · · + αsq zq = bs
(1.43)
The following result easily lead to a method for navigating inside the set of solutions of a Diophantine system. Theorem 1.25 Let S a Diophantine system as defined in Equation (1.43). Let ti , i = 1, . . . , q be the power products associated to the matrix A and t = y1b1 · · · ysbs . Moreover, let (α1 , . . . , αq ) ∈ Nq . Let J be the ideal in the polynomial ring K[ξ1 , . . . , ξq , y1 , . . . , ys ] generated by the set of binomials {ξ1 − t1 , . . . , ξq − tq }. The following conditions are equivalent. a) The vector (α1 , . . . , αq ) is a solution of S. α
b) There is an equality of power products tα1 1 · · · tq q = y1b1 · · · ysbs . α
c) The binomial y1b1 · · · ysbs − ξ1α1 · · · ξq q is in J . For the proof, see Bigatti & Robbiano (2001), Proposition 3.2. We will see an important example of such kind of navigation in the following Chapter, in the framework of the contingency tables.
1.2.4
Computation of toric ideals. Second method
In view of the results in Section 1.2.3, and in particular Theorem 1.24, we can compute the toric ideal with the following steps: • consider the matrix representation A of the power products; • compute a lattice basis B of ker(A) and then the lattice ideal I(B); • compute the saturation of I(B) with respect to the polynomial π = Qq i=1 ξq .
28
CHAPTER 1. STATISTICAL AND ALGEBRAIC BACKGROUND
Example 1.26 Consider again the problem of Example 1.21. Now we apply the saturation-based algorithm. The matrix representation of the 1 1 0 A= 0 0 0
power products in Equation (1.36) is 0 0 1 0 0 0 0 1 1 0 1 0 (1.44) 1 0 0 1 0 1 1 0 0 1 0 1
A lattice basis of ker(A) is formed by the two vectors (1, −1, −1, 1, 0, 0) and (1, −1, 0, 0, −1, 1) . Thus, the lattice ideal I(B) is Ideal(ξ1 ξ4 − ξ2 ξ3 , ξ1 ξ6 − ξ2 ξ5 ). The ideal J = I(B) + (πv − 1) is generated by the polynomials G(J ) ={−ξ2 ξ5 + ξ1 ξ6 , −ξ2 ξ3 + ξ1 ξ4 , −ξ12 ξ3 ξ4 ξ62 v + 1, − ξ12 ξ32 ξ63 v + ξ5 , −ξ13 ξ42 ξ62 v + ξ2 , ξ4 ξ5 − ξ3 ξ6 }
(1.45)
and, eliminating the v indeterminate, we obtain the same result as in Example 1.21, but with a relevant computational improvement. We will consider again Examples 1.21 and 1.26, after the discussion of the Diaconis-Sturmfels algorithm. Note that the first step of the algorithm, i.e. the computation of a basis of ker(A), can be made using the theory of vector spaces, with a notable saving of time. A comparison of the different timings between the algorithms introduced here can be found in Bigatti et al. (1999). The saturation-based algorithm is implemented in the CoCoA function Toric, which computes a system of generators of the toric ideal, starting from the matrix representation of the power products. We will use CoCoA for all symbolic computations presented in this thesis.
Chapter 2 The Diaconis-Sturmfels algorithm In the present chapter we present an algorithm for sampling from conditional distributions on finite sample spaces, first introduced by Diaconis & Sturmfels (1998). While in the original paper by Diaconis & Sturmfels (1998) the attention is restricted to exponential models, in this chapter we show that the algorithm can be applied to a more general class of probability models, namely all probability models which verify a linearity property of the sufficient statistic. In Section 2.1, we introduce the multivariate hypergeometric distribution, showing how that distribution is essential in the problems of estimation and hypothesis testing. In Section 2.2, we describe the algorithm, and we discuss the relevance of the use of toric ideals. The algorithm is a Monte Carlo Markov chain one, based on the key notion of Markov basis. The links between the theorems proving the validity of the algorithm and the theory of toric ideals presented in Chapter 1 will be pointed out. In Section 2.3 we briefly present some results about the rates of convergence of the Markov chain and in Section 2.4 we present two simple examples.
2.1
Hypergeometric distribution
This Section is a slight generalization of Diaconis & Sturmfels (1998), Section 1, where this theory is presented under the exponential model. Let X be a finite set and let P[x] = ϕ(T (x)) 29
x∈X ,
(2.1)
30
CHAPTER 2. THE DIACONIS-STURMFELS ALGORITHM
be a probability model with sufficient statistic T : X −→ Ns and such that the distribution of a sample of independent and identically P-distributed random variables X = (X1 , . . . , XN ) is of the form ∗N
P
= ψ(TN )
where
TN =
N X
T (Xk ) .
(2.2)
k=1
This means that the sufficient statistic of X is the sum of the sufficient statistics of the one-dimensional random variables Xk , k = 1, . . . , N . This model contains the usual exponential models of the form Pθ [x] = c(θ)e , but other ones are included, such as the models of the form Pθ [(x, y)] = c(θ)eθx+θ We denote
( Yt =
2y
.
¯ N ) ¯X ¯ (x1 , . . . , xN ) ¯ T (xk ) = t , ¯
(2.3)
k=1
i.e., the set of all samples with fixed value t of the sufficient statistic TN . It is known that the distribution of X given {TN = t} is uniform on Yt . In fact P∗N [x]
P∗N [X = x | TN = t] = P
{y : TN (y)=t}
P∗N [y]
=
1 1 = #{x : TN (x) = t} #Yt
for all x = (x1 , . . . , xN ) ∈ Yt . As presented in the previous chapter, standard statistical techniques, such as bootstrap methods, need uniformly distributed samples on Yt . Even in simplest cases, the set Yt is very large and difficult to enumerate and it is not easy to sample directly from Yt with classical Monte Carlo methods. Refer in particular to the discussion in Chapter 1, Section 1.1.5. As we are in the finite case, the problem can be reduced to a more suitable form, in terms of counting. Then, we consider the reference set Ft introduced in Chapter 1 ( Ft =
f : X −→ N :
X
) f (x)T (x) = t
,
(2.4)
x∈X
i.e., the set of all frequency tables obtained from samples with value t of the sufficient statistic TN .
2.2. THE ALGORITHM
31
The relation between Yt and Ft is given by the map F : Yt −→ Ft defined by X
F (x1 , . . . , xN ) =
ex
x∈X
N X
I{xk =x}
(2.5)
k=1
which associates to every sample the corresponding frequency table. Here (ex )x∈X denotes the canonical basis of RX . Note that the additivity condition (2.2), together with the conditions imposed in the definitions of Yt and Ft , is essential for the following composition: F
Yt @
-F
@
t
P
TN@
@
f (x)T (x)
@ R ? @
Rs The image probability of P∗N [ · | TN = t] induced by F is #{x : F (x) = f } , (2.6) #Yt which is by definition the hypergeometric distribution on Ft . Simple compuHt (f ) = P∗N [F −1 (f ) | TN = t] =
tations show that Ht (f ) =
N! Y (f (x)!)−1 . #Yt x∈X
(2.7)
For further details on hypergeometric distribution, see Bishop et al. (1975) and Agresti (2002). However, MCMC methods presented in this thesis need not the computation of #Yt , as it will be clear in the next Section.
2.2
The algorithm
As pointed out in Chapter 1, in order to perform hypothesis testing for contingency tables, the usual approach is the asymptotic one, which involves chisquared distributions, but in many cases, especially when the table is sparse, the chi-squared approximation may not be adequate (for further details on this topic see, for example, Appendix IV of Fienberg (1980)). We can obtain approximations of test statistics via Monte Carlo methods, drawing an iid hypergeometric ample of contingency tables in the reference set Ft . The problem is then reduced to sample from the hypergeometric distribution on Ft , or equivalently from the uniform distribution on Yt . Literature suggests to avoid the enumeration problem via the use of Markov Chains
32
CHAPTER 2. THE DIACONIS-STURMFELS ALGORITHM
Monte Carlo (MCMC) methods. In particular we are interested in Metropolis– Hastings algorithm which is based on a set of moves for constructing the relevant Markov chain. A review on the Metropolis–Hastings algorithm can be found in Chib & Greenberg (1995). The following is a key definition in that theory. Definition 2.1 A Markov basis of Ft is a set of functions m1 , . . . , mL : X −→ Z, called moves, such that for any 1 ≤ i ≤ L X
mi (x)T (x) = 0 ,
(2.8)
x∈X
where T is the sufficient statistic, and for any f, f 0 ∈ Ft there exist a sequence of moves (mi1 , . . . , miA ) and a sequence (²j )A j=1 with ²j = ±1 such that f0 = f +
A X
²j mij
(2.9)
²j mij ≥ 0
(2.10)
j=1
and f+
a X j=1
for all 1 ≤ a ≤ A. The condition (2.8) implies that a move is a table with integer entries (even negative) and such that the value of the sufficient statistic is constant for every table obtained with moves in {m1 , . . . , mL }. Note once again the importance of the linearity condition for the sufficient statistic T . In particular, if the components of the sufficient statistic are the margins, then every move is a table with null margins. For example, if we consider the 3 × 3 tables with the marginal totals as components of the sufficient statistic a move is For example the following move for the 3 × 3 tables 0 +2 +1 −1 0 −1
0 0 +1
From this definition it is clear that Markov basis is the main tool to define a random-walk-like Markov chain on Ft . It is well known that a connected, reversible and aperiodic Markov chain converges to the stationary distribution. In particular, we use here the following theorem.
2.2. THE ALGORITHM
33
Theorem 2.2 Given a Markov basis {m1 , . . . , mL }, generate a Markov chain on Ft by choosing I uniformly in {1, . . . , L}. If the chain is currently at g ∈ Ft , determine the set of j ∈ Z such that g + jmI ∈ Ft . Choose j in this set with probability proportional to Y
{(g(x) + jmI (x))!}−1
(2.11)
x∈CI
with CI = {x : mI (x) > 0}. This is a connected, reversible, aperiodic Markov chain with stationary distribution Ht . Proof.
It is easy to see that the product in Equation (2.11) is proportional
to the stationary distribution Ht , and then the Markov chain is reversible with respect to Ht . As M is a Markov basis, from the property (2.9) it follows that the Markov chain is connected. To show that the chain is aperiodic, choose a move and apply this move until some cell counts is zero, obtaining a holding probability strictly positive. In practice, to obtain a sample from the distribution of interest σ(f ) on Ft , the Markov chain is performed as follows: (a) at the time 0 the chain is in f ; (b) choose a move m uniformly in the Markov basis and ² = ±1 with probability 1/2 each independently from m; (c) if f + ²m ≥ 0 then move the chain from f to f + ²m with probability min{σ(f + ²m)/σ(f ), 1}; in all other cases, stay at f . The transition matrix of this Markov chain is given by P (f, f + m) =
1 min{σ(f + m)/σ(f ), 1} 2L
if f + m ≥ 0
1 min{σ(f − m)/σ(f ), 1} if f − m ≥ 0 2L for all m ∈ {m1 , . . . , mL } and P (f, f ) chosen in order to have a stochastic P (f, f − m) =
matrix. For example, if σ is the uniform distribution, we have P (f, f + m) =
1 2L
if f + m ≥ 0
P (f, f − m) =
1 2L
if f − m ≥ 0
34
CHAPTER 2. THE DIACONIS-STURMFELS ALGORITHM
and P (f, f ) is simply the ratio of the number of inadmissible moves among {±m1 , . . . , , ±mL } over the total number of moves. Moreover, note that in the hypergeometric case the computation of the ratio σ(f + m)/σ(f ) leads to some simplifications. In fact Q x∈X f (x)! Ht (f + m)/Ht (f ) = Q = x∈X (f (x) + m(x))! Q Y f (x)! x∈X f (x)! = =Q . (2.12) (f (x) + m(x))! x∈X (f (x) + m(x))! x : m(x)6=0
Let us recall here the basic convergence theorem for the Metropolis–Hastings algorithm working in our framework. Theorem 2.3 Let σ(f ) be a positive function on Ft . Given a Markov basis {m1 , . . . mL }, the Markov chain generated following the described algorithm is connected, reversible and aperiodic on Ft with stationary distribution proportional to σ(f ). Proof.
The proof is the same as for Theorem 2.2, considering a general
stationary distribution σ(f ) instead of the hypergeometric distribution Ht . Now, consider an indeterminate for each sample point, that is q indeterminates ξ1 , . . . , ξq , an let K[ξ1 , . . . , ξq ] be the polynomial ring in the indeterminates ξ1 , . . . , ξq with coefficients in the numeric field K. Moreover, let M = {m1 , . . . , mL } be a set of moves. We define the ideals IT = Ideal(ξ a − ξ b : T (a) = T (b)) IM = Ideal(ξ
m+ i
−ξ
m− i
: i = 1, . . . , L)
(2.13) (2.14)
Note that the ideal IT is a toric ideal as T is a semigroup homomorphism. The application of the theory of toric ideals comes from the following result. Theorem 2.4 The Markov chain with moves in M is connected if and only if IM = IT . Proof. “only if”: Notice that that IM ⊆ IT is obvious. Moreover, if the Markov chain is connected basis, then for any a, b with T (a) = T (b) the term ξ a can be written as a
b
ξ =ξ +
A X j=1
²j (ξ
m+ i
j
−ξ
m− i j
)
(2.15)
2.2. THE ALGORITHM
35
for a suitable sequence of moves (mi1 , . . . , miA ) ∈ M and a suitable sequence of signs (²1 , . . . , ²A ). This implies that ξ a = ξ b ∈ IM , and then IT ⊆ IM . P “if”: For all g, g 0 such that x∈X (g(x) − g 0 (x))T (x) = 0, there is a representation of the form 0
ξg − ξg =
L X
²j ξ hj (ξ
fi+ j
−ξ
fi− j
)
(2.16)
j=1
with ²j = ±1 for all j. If L = 1, the above Equation reduces to the connectedness condition. If L > 1 the result follows by induction. In fact from +
−
Equation (2.16), it follows that either ξ g = ξrh ξ fir or ξ g = ξrh ξ fir for some +
r = 1, . . . , L. Suppose for example ξ g = ξrh ξ fir . Then g − fi−r is non-negative +
−
and so g + fi+r is non-negative. Subtracting ξ hr (ξ fir − ξ fir ) from both sides and 0
using hr + fi+r = g + fir , we obtain an expression for ξ g+fir − ξ g having length L − 1. By induction, g + fir can be connected to g 0 in L − 1 steps. The proof is now complete. Theorem 2.4 gives also the relation between moves and binomials. If m is +
−
a move, m = m+ − m− , the corresponding binomial is ξ m − ξ m and vice versa. For example, the move presented above for the 3 × 3 tables 0 +1 0
+2 −1 −1
0 0 +1
is represented by the binomial 2 gm = ξ12 ξ21 ξ33 − ξ22 ξ32 .
in the polynomial ring Q[ξ11 , ξ12 , ξ13 , ξ21 , ξ22 , ξ23 , ξ31 , ξ32 , ξ33 ]. Thus, in view of Theorem 2.4, in order to compute the Markov basis, one computes the toric ideals associated to the sufficient statistic and one defines the moves from the binomials following the above rule. Note that the theory of Markov bases does not need the notion of Gr¨obner basis; it is sufficient to consider the ideals and set of generators. Gr¨obner bases will be used below as computational tool. Moreover, the above theory can be applied to the contingency tables observing that: 1) the (reduced) Gr¨obner basis of a toric ideal is formed only by binomials; 2) applying the division algorithm to a monomial with respect to a binomial Gr¨obner basis, the dividends and the Normal Form are all monomials: this means that at every step of the algorithm we have a monomial, or equivalently the algebraic rules transform a contingency table into another contingency table.
36
CHAPTER 2. THE DIACONIS-STURMFELS ALGORITHM
2.3
Rates of convergence
In general, the Markov chain described in Section 2.2 needs some running time to reach the stationary distribution. There are no general results for the rates of convergence of the chain, but there is a number of specific results working in some special framework. Roughly speaking, the rate of convergence depends on the diameter γ of the random walk. γ is the diameter of the graph with the points in Ft as vertices and two points f and f 0 are connected if and only if f 0 can be reached from f in one step. Here is an example of such theorems, taken from Diaconis & Sturmfels (1998). Consider a sample space of the form {1, . . . , I} × {1, . . . , J}, the row sums and the column sums as components of the sufficient statistic, and let U be the uniform distribution on the reference set Ft . Theorem 2.5 In the previous settings, let K(f, g) be the connected Markov chain on Ft based on the moves of the form +1 −1 −1 +1 for any 2 × 2 minor of the table and zero otherwise. Then, kK h (x, f ) − UkT V ≤ A1 exp(−A2 c)
(2.17)
for h = cγ 2 , where k · kT V denotes the total variation distance. Refer to Diaconis & Sturmfels (1998), Section 2.3, for a general survey on other results and pointers to literature. However, in the case of sample spaces with a more complex geometrical structure, no results are available.
2.4
Two examples
We conclude this chapter with the illustration of two examples. Example 2.6 Consider a sample space formed by 6 points, say X = {a(1), . . . , a(6)}
2.4. TWO EXAMPLES
37
and consider the 5−dimensional sufficient statistic with components T1 = F (a(1)) + F (a(2)) T2 = F (a(3)) + F (a(4)) T3 = F (a(5)) + F (a(6)) T4 = F (a(1)) + F (a(3)) + F (a(5)) T5 = F (a(1)) + F (a(2)) + F (a(6)) The matrix representation of this sufficient statistic produces the matrix analyzed in Example 1.26 in Chapter 1. The Gr¨obner basis of the toric ideal is {ξ4 ξ5 − ξ3 ξ6 , ξ2 ξ5 − ξ1 ξ6 , ξ2 ξ3 − ξ1 ξ4 } corresponding to the three moves (0, 0, 1, −1, −1, 1) (1, −1, 0, 0, −1, 1) (1, −1, −1, 1, 0, 0) We will consider again this example in the next chapter in the framework of the independence model. The above example belongs to a special class where the components of the sufficient statistic are the counts over subsets of the sample space. The following example is not in that class, and statistical literature calls such model logistic regression model. Example 2.7 Suppose that the sample space is X = {(1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2)} ⊂ N2 and consider the 5−dimensional sufficient statistic with components T1 = F (1, 1) + F (1, 2) T2 = F (2, 1) + F (2, 2) T3 = F (3, 1) + F (3, 2) T4 = F (1, 2) + F (2, 2) + F (2, 3) T5 = F (1, 2) + 2F (2, 2) + 3F (2, 3)
38
CHAPTER 2. THE DIACONIS-STURMFELS ALGORITHM
Using the matrix representation of this sufficient statistic and computing the toric ideal associated to that matrix, we obtain the following Gr¨obner basis 2 2 {ξ11 ξ22 ξ31 − ξ12 ξ21 ξ32 }
corresponding to a Markov basis with only one move +1 −1 −2 +2 +1 −1
Chapter 3 Two-way contingency tables In the present chapter, we restrict our attention to two-way contingency tables. In particular, we consider the application of the algorithm presented in the previous chapter to a wide class of log-linear models. Throughout this chapter we will present some new results which make explicit the computation of Markov bases for some classical log-linear models for two-way contingency tables, such as quasi-independence and quasi-symmetry. For a discussion on the meaning of the different log-linear models in the two-way case, the reader can refer to Fienberg (1980), Fingleton (1984) and Agresti (2002), where a number of applications of log-linear models to different problems coming from biology, psychology, econometrics are presented. The notation is the same as in the previous chapters, except for the cell frequencies, where we will use the standard notation in the analysis of contingency tables, e.g. we will write fij instead of f (i, j). This choice allows also to save space in the proofs. In Section 3.1 we adapt the notation of Chapters 1 and 2 to the framework of the contingency tables, starting with the independence model for complete tables, as it represents an easy prototype, both from the algebraic and statistical point of view. In Section 3.2 we present new characterizations or computations of the Markov bases for the most widely used models for two-way contingency tables, including models for tables with structural zeros. In Section 3.3 we analyze three examples of goodness of fit test for the quasi-independence and the quasi-symmetry models. The first one is a standard case where the asymptotic is correct; the second one is a case where the Monte Carlo p− value leads to a different decision with respect to the asymptotic p−-value; the third one, from Agresti (1996), shows a case where the asymptotics dramatically fail. Finally, Section 3.4 shows the application of the methodology presented here 39
40
CHAPTER 3. TWO-WAY CONTINGENCY TABLES
to a class of problems which frequently happens in the biological and medical research.
3.1
Two-way contingency tables: Independence
In this Section we consider I × J contingency tables formed by considering the counts generated from two categorical variables. Let us denote the observed count for the (i, j) cell by fij and the totals for the ith row and jth column by fi+ and f+j respectively. Finally, let N = PI PJ i=1 fi+ = j=1 f+j the sample size. We emphasize that this notation is easily extendable to the multi-way contingency tables and also the topics of this Section can be extended to the multi-way case. We will see examples of log-linear models for multi-way contingency tables in Chapter 4. Using the same notation as in Chapters 1 and 2, the sample space is X = {(i, j) : 1 ≤ i ≤ I, 1 ≤ j ≤ J} and the vectors r = (f1+ , . . . , fI+ ) and c = (f+1 , . . . , f+J ) are the components of the sufficient statistic T . In the independence case, the probability model can be written as I Y
fi+
pr (i)
i=1
J Y
pc (j)f+j ,
(3.1)
j=1
where pr (i) and pc (j), with i = 1, . . . , I and j = 1, . . . , J, are the parameters of the marginal distributions. Every margin as component of the sufficient statistic is additive and then this probability model is coherent with the above theory. If we denote the set of all contingency tables with fixed margins by F(r,c) , the following result holds true. Proposition 3.1 The probability distribution on F(r,c) is the hypergeometric distribution, and its explicit formula is QJ H(r,c) (f ) = Proof.
¡
f+j j=1 f1j ···fIj ¡ N ¢ f1+ ···fI+
¢ .
The probability of (f11 , . . . , f1J ) in the first row of the table is QJ ¡f+j ¢ ¡f+1 ¢ ¡f+J ¢ · · · j=1 f1j f11 ¡ N ¢ f1J = ¡N¢ f1+
f1+
(3.2)
3.1. TWO-WAY CONTINGENCY TABLES: INDEPENDENCE
41
and by straightforward computation the result is proved. As it is difficult to find the number of tables in F(r,c) and to have a complete list of all tables in F(r,c) , we approximate the distribution of a generic test statistic C conditioned to the sufficient statistic using the algorithm described in Chapter 2. Recalling the definition of Markov basis, see Chapter 2, Definition 2.1, it is clear that the Markov basis for the independence model has to be formed by moves with null margins. Moreover, the set of moves must give a connected Markov chain. Then, using this Markov basis, one can apply the algorithm described in the previous chapter in order to obtain a Monte Carlo approximation of the distribution of the test statistic. The likelihood of any I ×J contingency table f can be written in monomial form f11 fIJ , ξ11 · · · ξIJ
(3.3)
where the ξij ’s are functions of the model parameters. The sufficient statistic in the independence case is a map T : NI×J −→ NI+J
(3.4)
f 7−→ (f1+ , . . . , fI+ , f+1 , . . . , f+J )
(3.5)
Then, for I × J contingency tables the ring homomorphism of interest is π : K[ξ11 , . . . , ξIJ ] −→ K[y1+ , . . . , yI+ , y+1 , . . . , y+J ]
(3.6)
ξij 7−→ yi+ y+j
(3.7)
The following result characterizes a well known object, e.g. the set of 2 × 2 minors, as the Gr¨obner basis of the toric ideal I associated to π. Theorem 3.2 Given any term-ordering τ , the reduced Gr¨obner basis of I is Gτ = {ξil ξjk − ξik ξjl : 1 ≤ i < j ≤ I, 1 ≤ k < l ≤ J} Proof.
(3.8)
First, we prove that Gτ is a set of generators for ker(π). Every
polynomial of the form ξil ξjk − ξik ξjl is in ker(π), so Ideal(Gτ ) ⊆ ker(π). For the converse, consider a monic polynomial g ∈ ker(π). Letting ξ a = LT (g) the
42
CHAPTER 3. TWO-WAY CONTINGENCY TABLES
leading term of g and π(ξ a ) = y u , there exists in g at least another term ξ b such that π(ξ b ) = y u . The polynomial ξ a − ξ b is such that the exponents verify I X
aij =
i=1
I X
bij
for all j = 1, . . . , J
bij
for all i = 1, . . . , I .
i=1
and J X
aij =
j=1
J X j=1
With these constraints, after long but trivial computations one proves that ξ a − ξ b ∈ Ideal(Gτ ). Now, also the polynomial g 0 = g − ξ a + ξ b is in ker(π) and LT (g 0 ) ≺τ LT (g). Then, the result is proved by applying the same procedure to g 0 and so on for a finite number of times. The proof that Gτ is a Gr¨obner basis is a simple application of the notion of syzygy, which is a standard tool in Commutative Algebra. The reader can find the theory of syzygies in Kreuzer & Robbiano (2000), Section 2.3. The following example is the continuation of Example 2.6 of Chapter 2. Example 3.3 The reduced Gr¨obner basis for the 3 × 2 contingency tables gives the three moves +1 −1 −1 +1 0 0
+1 −1 0 0 −1 +1
0 +1 −1
0 −1 +1
This example can be used to show that the standard analysis of contingency tables with the use of the theory of vector spaces is not sufficient in order to find a Markov basis, see also Pistone et al. (2001a). The matrix representation of the homomorphism T in Equation (3.5) is
1 0 S1 = 0 1 0
1 0 0 0 1
0 1 0 1 0
0 1 0 0 1
0 0 1 1 0
0 0 1 0 1
(3.9)
Note that we use the notation S1 in order to preserve here the notation of the statistical literature, but the matrix S1 is just the matrix A introduced in Chapter 1, Section 1.2.3.
3.1. TWO-WAY CONTINGENCY TABLES: INDEPENDENCE
43
We extend its action from R6 to R5 and compute its kernel with the system of linear equations
a11 + a12 = 0 a21 + a22 = 0 a31 + a32 = 0 a + a21 + a31 = 0 11 a12 + a22 + a32 = 0
(3.10)
As the system has rank 4, the kernel has dimension 2. Two solutions for this system are (a11 , a12 , a21 , a22 , a31 , a32 ) = (1, −1, 0, 0, −1, 1) ,
(3.11)
(a11 , a12 , a21 , a22 , a31 , a32 ) = (0, 0, 1, −1, −1, 1) .
(3.12)
Then, a basis of the orthogonal of S1T = Z1 is given by 1 0 −1 0 0 1 Z2 = 0 −1 −1 −1 1 1
(3.13)
but the polynomials ξ11 ξ32 − ξ12 ξ31 and ξ21 ξ32 − ξ22 ξ31 are not a Markov basis. For example, starting from the table 3 5 6 4 , (3.14) 0 0 the Markov chain based on the two moves +1 0 −1
−1 0 +1
0 +1 −1
0 −1 +1
does not produce any other tables. In fact, the polynomial ξ11 ξ22 − ξ12 ξ21 represents a non-trivial move, but it does not lie in the ideal Ideal(ξ11 ξ32 − ξ12 ξ31 , ξ21 ξ32 − ξ22 ξ31 ). Although this table is statistically trivial, this example allows us to write explicitly the homomorphism in a simple case. In other models there are more interesting examples, as we will see in Section 3.3. Using the saturation-based algorithm presented in Chapter 1, Section 1.2.4, the Markov basis can be found through the computation of the Gr¨obner basis of the ideal Ideal(ξ11 ξ32 − ξ12 ξ31 , ξ21 ξ32 − ξ22 ξ31 , hu − 1) ∩ K[ξ11 , ξ12 , ξ21 , ξ22 , ξ31 , ξ32 ] (3.15)
44
CHAPTER 3. TWO-WAY CONTINGENCY TABLES
where h is the polynomial formed by the product of all the ξ’s indeterminates. This operation leads to the three moves described above. Refer to Chapter 1 for details on the computation of toric ideals as well as for a discussion on the symbolic software we use. Finally, note that in the independence case the Markov basis consists of all tables of the type +1 −1 −1 +1 for any 2 × 2 minor of the table and zero otherwise. Then, the computation of the ratio H(r,c) (f + m)/H(r,c) (f ) in the rejection probability of the Metropolis– Hastings algorithm is again simplified. In fact, referring to Equation (2.12), we obtain Y fij ! H(r,c) (f + m)/H(r,c) (f ) = = (3.16) (f ij + mij )! i,j : m 6=0 ij
=
Y
Y fij ! fij ! = (f + 1)! (f − 1)! ij ij i,j : mij =+1 i,j : mij =−1 Y Y = (fij + 1)−1 fij i,j : mij =+1
(3.17) (3.18)
i,j : mij =−1
and only four numbers are involved.
3.2
Some models for incomplete and square tables
We have considered in the previous sections the independence model as it represents an easy prototype. We analyze now the Markov bases for other models, with special attention to the most widely used models for two-way contingency tables. A survey on these models in the framework of the loglinear models can be found in Haberman (1978).
3.2.1
Incomplete tables
Incomplete tables are contingency tables with structural zeroes. In this case the sample space X is a subset of {(i, j) : 1 ≤ i ≤ I, 1 ≤ j ≤ J}. A complete description of a Markov basis for incomplete tables is not easy. For example, we consider a 3×3 table with the missing entry (1, 3), again with the row sums and the column sums as components of the sufficient statistic. Using CoCoA
3.2. SOME MODELS FOR INCOMPLETE AND SQUARE TABLES
45
for computing a Gr¨obner basis of the corresponding toric ideal, we obtain these five moves in the Markov basis: +1 −1 • −1 +1 0 0 0 0
+1 −1 • 0 0 0 −1 +1 0 0 0 +1 0 −1 0
• −1 +1
0 0 +1 −1 −1 +1
• 0 0
0 0 • 0 +1 −1 0 −1 +1
In this case, the Markov basis seems to be obtained from the corresponding Markov basis of the complete table by deleting the moves which involve the missing entry. But this is false in general as we show in the following example. If we consider a 3 × 3 table with all diagonal cells as missing entries, the algebraic computation show that the Markov basis has only one move: • −1 +1
+1 • −1
−1 +1 •
In this example, the ring homomorphism of interest is π : K[ξ12 , ξ13 , ξ21 , ξ23 , ξ31 , ξ32 ] −→ K[y1+ , y2+ , y3+ , y+1 , y+2 , y+3 ] ξij 7−→ yi+ y+j
(3.19) (3.20)
with i 6= j.
3.2.2
The quasi-independence model
In square tables is often useful the quasi-independence model. The log-linear form of this model is Y log mij = λ + λX i + λj + δi I{i=j}
(3.21)
where the λX i ’s are the effects of the first variable X with values 1, . . . , I, the λYj ’s are the effects of the second variable Y with values 1, . . . , I and the δi ’s are the effect of the diagonal cells. Here the indicator I{i=j} equals 1 when i = j and 0 otherwise. The usual constraints of this model are I X i=1
λX i
=0
and
I X j=1
λYj = 0 .
(3.22)
46
CHAPTER 3. TWO-WAY CONTINGENCY TABLES This model consider the independence except for the diagonal cells, which
are fitted exactly. In this model, the components of the sufficient statistic are the row sums and the column sums together with the diagonal entries d = (f11 , . . . , fII ). As the diagonal entries are themselves sufficient statistic, the model fits exactly these cells, and then no move of the Markov basis can modify the diagonal. This observation leads us to the following result. Proposition 3.4 The reduced Gr¨obner basis of the toric ideal I for the quasiindependence model is the Gr¨obner basis for the corresponding incomplete table with missing diagonal entries. Proof. An equivalent minimal sufficient statistic has components d, r˜ = (f˜1+ , . . . , f˜I+ ) and c˜ = (f˜+1 , . . . , f˜+I ), where f˜i+ = fi1 + . . . + fi i−1 + fi i+1 + . . . + fiI is the i-th row sum except for the i-th diagonal cell which is excluded and the same definition holds for the column sums f˜+j . With this sufficient statistic, the equations defining the kernel of π are ξij = y˜i+ y˜+j
for
i, j = 1, . . . , I, i 6= j
(3.23)
ξii = fii
for
i = 1, . . . , I
(3.24)
A Gr¨obner basis G1 of the ideal J1 = Ideal(ξij − y˜i+ y˜+j , i, j = 1, . . . , I, i 6= j) is the Gr¨obner basis coming from the incomplete table with missing diagonal entries. A Gr¨obner basis G2 of the ideal J2 = Ideal(ξii − fii , i = 1, . . . , I) is {ξii − fii , i = 1, . . . , I}. Now, observe that the first group of equations involves only the indeterminates ξij , i 6= j, y˜i+ and y˜+j , i, j = 1, . . . I, whereas the second group of equations involves the indeterminates ξii and fii , i = 1, . . . I. As a consequence, it is a standard fact in polynomial algebra that a Gr¨obner basis of the ideal J = J1 + J2 is G1 ∪ G2 . As we have previously seen, the Gr¨obner basis of the toric ideal I = J ∩ K[ξij : i 6= j] is obtained from G1 ∪ G2 by deleting the polynomials involving the y’s and the f ’s indeterminates. So, the polynomials in G2 are all deleted and the result is proved. As the computation of a Gr¨obner basis of a toric ideal is very computer intensive and contingency tables need a great number of indeterminates and equations, the above result is specially useful from a computational point of view, as it reduces the number of equations used in the computation. Moreover,
3.2. SOME MODELS FOR INCOMPLETE AND SQUARE TABLES
47
note that the above proposition can be extended to other models where some entries are sufficient statistics.
3.2.3
The symmetry model
Following the notations of the log-linear model, the symmetry model can be written in the form log mij = λ + λi + λj + λij with λij = λji for all i, j = 1, . . . , I and with the constraint
(3.25) PI i=1
λi = 0. This
model implies the marginal homogeneity. The sufficient statistic is s = (sij = fij + fji : 1 ≤ i ≤ j ≤ I). The Markov basis for the symmetry model is characterized by the following result. Theorem 3.5 Given any term-ordering τ , the reduced Gr¨obner basis of I for the symmetry model is Gτ = {ξij − ξji : 1 ≤ i < j ≤ I} Proof.
(3.26)
Writing explicitly the kernel of the ring homomorphism we obtain
a polynomial system which contains the couples of equations ½ ξij = sij ξji = sij
(3.27)
for all 1 ≤ i < j ≤ I and the equations ξii = sii
(3.28)
for all i = 1, . . . , I. As the corresponding polynomials are a Gr¨obner basis, straightforward elimination of sij ’s indeterminates gives the result.
3.2.4
The quasi-symmetry model
The quasi-symmetry model has the log-linear form Y log mij = λ + λX i + λj + λij
(3.29)
with λij = λji for all i, j = 1, . . . , I and with the constraints I X i=1
λX i
=0
and
I X j=1
λYj = 0 .
(3.30)
48
CHAPTER 3. TWO-WAY CONTINGENCY TABLES
This model is more general than the previous, as it does not imply the marginal homogeneity. The components of the sufficient statistic are the row sums, the column sums and s = (sij = fij + fji : 1 ≤ i ≤ j ≤ I). In this model the Markov basis can be computed with CoCoA. For example, for a 4 × 4 table, the seven moves are 0 0 0 0 0 0 +1 −1
0 0 +1 −1 0 0 −1 +1
0 −1 0 +1 −1 +1 0 0
0 +1 −1 0 +1 −1 0 0
0 0 +1 −1 0 +1 0 −1 0 +1 −1 0
0 0 0 0 −1 0 +1 0 −1 0 +1 0
−1 0 0 +1 0 −1 0 +1 +1 −1 0 0
+1 0 −1 0 +1 0 −1 0
0 +1 0 −1 0 −1 +1 0
−1 0 0 +1 +1 0 0 −1
0 0 0 0 −1 0 0 +1
+1 −1 0 0 0 +1 −1 0
0 0 0 0
It should be noted that the Markov basis is easily computable also in those models where the maximum likelihood equations are not in explicit form and a numerical method, such as the Newton-Raphson method, must be used in order to calculate the maximum likelihood estimate, see Chapter 1.
3.2.5
Computational notes
The application of the above theory leads to two computational problems. First, the computation of the Gr¨obner bases of toric ideals involves a great number of indeterminates and the number increases fast when the dimensions of the table increase. For example, the computation of the Gr¨obner basis of the toric ideal in the independence model for complete I × J tables involves IJ + I + J indeterminates. For other models the number of indeterminates is even greater. For example, the quasi-symmetry model for an I × I table involves I 2 + 2I + I(I + 1)/2 indeterminates. The most common symbolic software packages, such as CoCoA or Maple, give computational problems if the number of indeterminates is too high. So, for solving problems for large tables are essential the theorems of the above Sections in order to have the Gr¨obner basis avoiding the computation. However, we have found in our computations that CoCoA, having the function Toric for the computation of toric ideals, is the package which allows to work with the grater number of indeterminates. Second, the number of moves increases fast when I and J increase. Considering once again the independence model, the number of polynomials of the
3.3. EXAMPLES
49
Gr¨obner basis, or equivalently the number of moves of the Markov basis, is ¡I ¢¡J ¢ . In Table 3.1, we report the number of moves of the Markov basis for 2 2 the independence model. I 2 3 4 5 6 7 8 9 10
2 3 1 3 3 9 6 18 10 30 15 45 21 63 28 84 36 108 45 135
4 6 18 36 60 90 126 168 216 270
5 10 30 60 100 150 210 280 360 450
J 6 15 45 90 150 225 315 420 540 675
7 8 9 10 21 28 36 45 63 84 108 135 126 168 216 270 210 280 360 450 315 420 540 675 441 588 756 945 588 784 1008 1260 756 1008 1296 1620 945 1260 1620 2025
Table 3.1: Number of moves for the independence model.
In this case, the main problem is how to store the results of the symbolic computations in matrix form in order to use these results with statistical softwares.
3.3
Examples
In this section we present three different examples of application of the Diaconis-Sturmfels algorithm to goodness of fit tests for log-linear models. Example 3.6 Consider the categorical data in Fingleton (1984), page 142 and reported in Table 3.2. These data describe the place of residence of a subgroup of British migrants in 1966 and 1971. The data are simplified in order to consider only four places, labelled A, B, C and D. For the quasisymmetry model, the Pearson statistic is C = 2.5826. If we compare this value with the chi-squared distribution with (I − 1)(I − 2)/2 = 3 degrees of freedom, it gives an asymptotic approximate p−value 0.4607. Running the Markov chain based on the seven moves found in Section 3.2 for the quasi-symmetry model with 50,000 burn-in steps and sampling every 50 steps for a total of 10,000 values, the Monte Carlo approximate p−value is 0.4714. As the sample size is high, the classical asymptotic approximation works well and the two approximations are similar.
50
CHAPTER 3. TWO-WAY CONTINGENCY TABLES 1971 A B C D 118 12 7 23 14 2127 86 130 8 69 2548 107 12 110 88 7712
1966 A B C D
Table 3.2: British migrants data.
The sample size is chosen in order to have a 95%-confidence interval for the exact p−value shorter than 0.02. In fact, assuming near independence, p the width of the 95%-confidence interval is 2 · 1.96 p(1 − p)/B, where p is the estimated p−value and B is the size of the Monte Carlo sample. In the p inequality 2 · 1.96 p(1 − p)/B < 0.02, the left hand term has a maximum in p p = 1/2, so the inequality is reduced to 2 · 1.96 1/(4B) < 0.02. From the last inequality we easily obtain B > 9, 604.
Example 3.7 As a second example, we present a quasi-independence model for complete tables. We analyze the medical data in Agresti (1996), page 242 and reported in Table 3.3. Y X 1 2 3 4
1 22 5 0 0
2 3 4 2 2 0 7 14 0 2 36 0 1 17 10
Table 3.3: Carcinoma data. This table shows ratings by two pathologists, X and Y , who separately classified 118 slides regarding the presence and extent of carcinoma of the uterine cervix. The rating scale has four ordered categories, 1, 2, 3 and 4. Under the quasi-independence model, the Pearson statistic is C = 11.5, leading to an asymptotic approximate p−value 0.0423. Applying the algebraic MCMC algorithm as in the previous example, the Monte Carlo approximate p−value is 0.0080. If we perform a 1%-level goodness of fit test, the conclusions are different using different approaches.
3.3. EXAMPLES
51
Example 3.8 Finally, we consider again the data of the previous example, but under the quasi-symmetry model. The Pearson statistic for this table under the quasi-symmetry model is C = 0.6348. Agresti (1996) compares this value with the chi-squared distribution with 2 degrees of freedom, obtaining an asymptotic approximate p−value 0.7280. Running the Markov chain as above, the Monte Carlo approximate p−value is exactly 1, and in this case the two approximations are not similar. We used the definition of p−value as the smallest significance level at which the model is rejected with probability 1. But there is more. Using the seven moves of the Markov basis, we can compute for this table the exact distribution of C under the model. In fact, starting from the original table (denoted here by f1 ), only the last move 0 −1 +1 0 +1 0 −1 0 −1 +1 0 0 0 0 0 0 with negative sign, can be applied and this move gives the table f2 below. Again, only the last move can be used: if we choose the positive sign we come back to the table f1 , whereas if we choose the negative sign we obtain the table f3 .
22 4 f2 = 1 0
3 1 0 7 15 0 1 36 0 1 17 10
22 3 f3 = 2 0
4 0 0 7 16 0 0 36 0 1 17 10
(3.31)
Now, only the last move can be applied, with positive sign, coming back to table f2 . There are only three tables with non-negative entries and with the appropriate values of the sufficient statistic and it is not difficult to compute the corresponding hypergeometric probabilities. Applying the formula (3.16), we obtain the equations H(f2 )/H(f1 ) = 4/9 and H(f3 )/H(f2 ) = 1/32, together with the constraint H(f1 ) + H(f2 ) + H(f3 ) = 1. Solving this simple system, we find H(f1 ) = 72/105, H(f2 ) = 32/105, H(f3 ) = 1/105. The Pearson statistic values for these three tables are C(f1 ) = 0.6348, C(f2 ) = 1.8405, C(f3 ) = 12.3206. Moreover, following the Example 3.3, the size of the matrix S1 is 16 × 18 and the dimension of its kernel is 3. Then, a basis of the kernel as vector space is given by three linearly independent moves, as instance the first three ones. If
52
CHAPTER 3. TWO-WAY CONTINGENCY TABLES
the last move is excluded, the Markov chain starting from f1 cannot reach the other two tables.
3.4
An application. The h-sample problem
In this section we present a first application of the theory above, and in particular of the exact test for the independence model. We want to compute the exact p−values for tests on equality of h multinomial distributions, through the test of independence on h × k tables. It is worth noting that when the p−value is not exactly computed, classical approximations based on chi-squared asymptotic distribution of the test statistic are used. As for log-linear models, asymptotic approximations in many situations are not accurate, indeed the problem of computing exact p−values has been widely considered in recent literature. Mehta & Patel (1983) provide a network algorithm to inspect the orbit of h × k contingency tables with fixed margins. But, the entire set of tables is often too wide and consequently approximation methods have been proposed. Baglivo et al. (1992) used a FFT algorithm to compute the exact p−value on h × k tables with fixed margins. Hirji & Johnson (1996) compared the two algorithms above and concluded that the first one is superior to the second one with respect to computing speed and accuracy. The algorithm proposed by Metha et al. (1988) provides an approximation of the exact p−value and it is based on a complicated Monte Carlo method which generates random tables providing statistics belonging to the critical region of the test. Recently, Strawderman & Wells (1998) used saddlepoint approximations to the exact distribution of the conditional maximum likelihood estimate (MLE) for several 2 × 2 tables, providing approximated p−values, power calculations and confidence intervals. Booth & Butler (1999) proposed a Monte Carlo importance sampling for conducting exact conditional tests in log-linear models. Moreover, Metha et al. (2000) proposed a Monte Carlo method for approximating the exact p−values in conditional logistic regression and foresee the applicability of algebraic-based methods in this framework. The main problem in Metha et al. (1988), and in most cited works, was how to generate random tables with fixed margins. In this paper we give a new solution to the h-sample problem applying the technique described in Diaconis & Sturmfels (1998), and
3.4. AN APPLICATION. THE H-SAMPLE PROBLEM
53
the computer program source can be included in few lines of statements. Moreover, we provide a solution not only for 2 × k tables but, more generally, for h × k tables. The h-sample problem can be stated as follows. Let X1 , . . . , Xh be random variables with values in {1, . . . , k} defined on h different groups and consider h samples drawn from these groups: (X1,1 , . . . , X1,n1 ) from the multinomial distribution D1 , . . ., (Xh,1 , . . . , Xh,nh ) from the multinomial distribution Dh . P Let N = hi=1 ni be the total sample size. The distribution Di has parameters P pi1 , . . . , pik , for i = 1, . . . , h with the constraints pij ≥ 0 and kj=1 pij = 1 for all i = 1, . . . , h. The usual test in this situation is the test for equality of the proportions, where the null hypothesis is H0 : p1j = . . . = phj
for j = 1, . . . k
(3.32)
against the composite alternative hypothesis of different proportions. The components of the sufficient statistic are the sums of the observations in each group and the sums of observations for each possible value of the variables, that is à ! X X X X S= I{X1,u =j} , . . . , I{Xh,u =j} , I{Xi,u =1} , . . . , I{Xi,u =k} (3.33) u,j
j,u
i,u
i,u
for i = 1, . . . , h and j = 1, . . . , k, where IA is the indicator function of A. Of course, the sample can be summarized in a contingency table with h rows and P i I{Xi,u =j} . k columns and we denote this table by F . In other words Fij = nu=1 The table F has raw parameters p∗ij = pij ni /N . Moreover, we denote by fi+ and by f+j the row sums and the column sums, respectively. The sufficient statistic S is represented by the margins of this table and the maximum likelihood estimate of the parameters is pˆ∗ij = fi+ f+j /N 2
for i = 1, . . . h , j = 1, . . . , k .
(3.34)
For this hypothesis the most commonly used test statistics are the PearP son statistic C = ˆ∗ij )2 /N pˆ∗ij and the log-likelihood ratio L = i,j (Fij − N p P 2 i,j Fij log(Fij /N pˆ∗ij ), see for example Agresti (1996). There are many nonparametric algorithms, based on Monte Carlo approximations of the p−value which do not involve limit theorems. The algorithm presented here is still valid when the test statistic is of the form X T (F ) = aij (Fij ) , i,j
(3.35)
54
CHAPTER 3. TWO-WAY CONTINGENCY TABLES
where the aij ’s are real valued functions on the set N of the nonnegative integers. As usual in this framework, we consider a problem of conditional inference, where only the tables with the same margins of the original table are relevant. So, we restrict the analysis to the reference set Ft . Using the algorithm described above with the Markov basis for the independence model, the approximate Monte Carlo p−value is simply the ratio of the number of tables whose value of the test statistic is greater than or equal to the one of the original table, over the number B of sampled tables. Note that or the two-sample problem, there is an algorithm proposed by Metha et al. (1988), which is an importance sampling algorithm and is based on the network representation of the reference set Ft . It were also used in order to approximate the power of the test, but is difficult to rewrite it for the hsample problem. Moreover, our method allows us to approximate the p−value as stated before, instead of considering also the hypergeometric probability weights of the sampled tables. It is interesting to note that also Metha et al. (2000) considered the possibility of applying algebraic-based methods: “An investigation is needed to establish conditions under which the probability of connectedness of the Markov chain is high”. But, the theorems presented in Diaconis & Sturmfels (1998), and described in Chapter 2 of this thesis, state that the algebraic algorithm described above defines a connected Markov chain. As in this kind of applications only the independence model is involved, it is worth noting the algebraic step of the algorithm needs not symbolic computations. In fact, the relevant Markov bases are completely characterized in Theorem 3.2. Finally, we emphasize that classical methods, for example Metha et al. (1988), proposed the solution by sampling few well chosen tables with great computational effort. On the other hand, this method easily generates tables from the reference set.
Chapter 4 Rater agreement models In this chapter we apply the algebraic technique for sampling from conditional distributions presented in Chapter 2 in order to define an exact Markov Chain Monte Carlo (MCMC) algorithm for conditional hypothesis testing in the problems of rater agreement. Moreover, we extend the results and the computations made up in Chapter 3 to the multi-dimensional case, i.e. to the case of d-way tables, with d > 2. These new results allow us to compute Markov bases also in non-trivial cases and some of the most important cases are presented explicitly. In Section 4.1, we present the medical framework where models for rater agreement apply. In Section 4.2 we review the main methods for measuring rater agreement, with particular attention to the multi-observer case. The methods considered here are the Cohen’s κ and the goodness of fit tests for appropriate log-linear models. In Section 4.3, we describe the conditional inference for these problems, using the Diaconis-Sturmfels algorithm and the notion of Markov basis, while in Section 4.4 a careful presentation of the conditioning structures arising from complex problems is given. Section 4.5 deals with the computation of some relevant Markov bases coming from agreement problems, through some new theoretical results useful for simplifying the symbolic computations. Finally, in Section 4.6, an example on a real data set shows the practical applicability of the algebraic procedures.
4.1
A medical problem
The material presented in this chapter was inspired by a medical research on medical imaging drugs and biological products, in particular contrast agents, see CDER (2000). Contrast agents are drugs that improve the visualization of 55
56
CHAPTER 4. RATER AGREEMENT MODELS
tissues, organs or physiologic processes by increasing the relative difference of imaging signal intensities in adjacent parts of the body. From the statistical point of view, the effectiveness of a contrast agent is assessed by evaluating with statistical quantitative studies the agent’s ability to provide a better reading of the medical images. The assessment of the validity of a new product over the existing ones is performed in general by comparing radiographies, computed tomographies and magnetic resonance imaging taken with the new product and the old product, when available. In this medical setting, the most commonly used statistical models are the models for agreement among raters. For the case of two raters the proposed methods are the Cohen’s κ and similar indices, see Landis & Koch (1975) for a survey. More recently, the application of log-linear models, in particular the quasi-independence model and the quasi-symmetry model has been proposed, see for example Agresti (1992). In some particular clinical settings, Government Agencies can require models for the agreement among many raters or the analysis of stratified samples with respect to other covariates, for example when the study includes subjects that represent the full spectrum of the population. For example, in a study intended to demonstrate the validity of a contrast agent to assess bronchiestasis, the sample should be stratified with respect to age classes of the patients and to the presence or absence of the main related pulmonary disorders, e.g., chronic bronchitis, pneumonia, asthma, cystic fibrosis, see CDER (2000). In these situations, the use of conditional inference is common practice. In fact, the use of conditional inference seems to be preferable with respect to the use of an unconditional approach, see Klar et al. (2000) for a discussion and further references. In addition, different structures of conditioning arise in the inferential procedures, as it will be clear in the next sections. For example, it is usual to define different indices of agreement for each pair of observers, e.g. different κ’s for any pairs of observers and for any strata. As it will be clear from the following, such a study can be much more informative than the study of an overall index of agreement for the sample.
4.2
Multidimensional rater agreement models
Suppose that d observers rate the same set of N subjects. In this section, we briefly review the main methods for measuring rater
4.2. MULTIDIMENSIONAL RATER AGREEMENT MODELS
57
agreement, starting with the case of d = 2 observers, say A and B. We present here a wide range of models, both for binary, nominal and ordinal rating scales, and we denote the set of possible responses as {1, . . . , I}. We do not consider continuous responses, where parametric or nonparametric ANOVA methods can be applied,see for example Landis & Koch (1975). The data are summarized in a two-way contingency table. We note πij = P{A = i, B = j} the probability that observer A classifies the subject in the category i and observer B in category j, with i, j = 1 . . . , I. Moreover, we denote by {πi+ , i = 1, . . . , I} and {π+j , j = 1, . . . , I} the marginal distributions for the observers A and B, respectively. Note that in the previous chapters the cell probabilities were denoted by p, as usual in the analysis of contingency tables. But in the rater agreement literature the notation π to denote the probabilities is the standard one. Probably, the most widely used index of agreement for nominal and ordinal categories is the Cohen’s κ defined as PI
P πii − Ii=1 πi+ π+i P 1 − Ii=1 πi+ π+i
i=1
κ=
leading to the model πij = (1 − κ)πi+ π+j
(4.1)
for i 6= j
πii = (1 − κ)πi+ π+i + κ/I for i = j
This index measures the observer agreement beyond chance in the sense that the index equals zero if the agreement is equal to that expected by chance and computed under the independence assumption, see Landis & Koch (1975) for further details or the original work Cohen (1960). Another version of the κ index implies marginal homogeneity and its definition assumes the form PI h
κ =
P πii − Ii=1 (πi+ + π+i )2 /4 . P 1 − Ii=1 (πi+ + π+i )2 /4
i=1
(4.2)
There are also several versions of weighted κ for ordinal scales, see Landis & Koch (1975) or Fleiss (1981) for details and discussion. As the Markov bases do not change when considering weights, we restrict our presentation to the case of standard weights, such as in (4.1) and (4.2). Moreover, some authors define separated k’s for each pair of responses, see for example Agresti (1992).
58
CHAPTER 4. RATER AGREEMENT MODELS An agreement index is a normalized index such that it is equal to 1 in the
case of perfect agreement and it is equal to 0 in the case of independence. For d > 2 raters, one can find different versions of the Cohen’s κ. Here we use the following formula o ¡d¢−1 nPd PI Pd PI π − π (i)π (i) i ,...,i ,...,i ,...,i k h 1 h k d h 0 for all x ∈ X . Then X log px = (log ζj )Tj (x) + log ζ0 , (5.3) j
5.2. SUFFICIENCY AND SAMPLING where
à ζ0 =
X
77
!−1 L(ζ, x)
(5.4)
x∈X
is the normalizing constant. M>0 appears to be a log-linear, then it is an exponential model with sufficient statistic T and canonical parameters log ζj , j = 1, . . . , s. Note that the representation of the toric model in Equation (5.2) refers to the notion of multiplicative form of a log-linear model, see for example Goodman (1979a). A first example of toric model is a model with sufficient statistic consisting of the counts over the sets A1 , . . . , As ⊆ X , possibly overlapping: T : x 7−→ (IA1 (x), . . . , IAs (x)) In most examples in the literature X ⊆ {1, . . . , I1 } × · · · × {1, . . . , Id } is a d-way array, possibly incomplete. A second example is a log-linear model of the type log Pψ (x) =
s X
ψi Ti (x) − k(ψ)
(5.5)
i=1
with integer valued Ti ’s and parameters ψ = (ψ1 , . . . , ψs ). In such a case each X Ti (x) = Ti (a)Ia (x) (5.6) a∈X
is an integer combination of all counts, see Fienberg (1980). Note that in Equation (5.1) the strict positivity implied by the log-linear model in Equation (5.5) is not assumed. Consider now the problem of sampling. As the probabilities can be written as T (x)
px = ζ0 ζ1 1
· · · ζsTs (x)
(5.7)
the probability of a sample (x1 , . . . , xN ) is P
px1 · · · pxN = ζ0N ζ1
i
T1 (xi )
P
· · · ζs
i
Ts (xi )
(5.8)
i.e., the sufficient statistic for the sample of size N is the sum of the sufficient statistics of the N components of the sample. Note that this result is formally the same as in the positive case where the theory of exponential models applies.
78
CHAPTER 5. SUFFICIENCY AND ESTIMATION The j-th component of the sufficient statistic can be represented as X i
Tj (xi ) =
X
Tj (a)Fa (x1 , . . . , xN )
(5.9)
a∈X
where Fa (x1 , . . . , xN ) =
N X
Ia (xj )
(5.10)
j=1
is the count of the cell a. This result generalizes the corresponding theorem holding in the framework of log-linear models, which states that the sufficient statistics for all log-linear models are linear combinations of the cell counts.
5.3
Geometry of toric models
Now, define the matrix AT in the following way. AT is a matrix with k rows and s columns and its generic element AT (i, j) is Tj (xi ) for all i = 1, . . . k and j = 1, . . . , s. If ηj = E[Tj ] are the mean-parameters and p is the row vector of the probabilities, then η = pAT
(5.11)
The matrix AT can also be used in order to describe the geometric structure of the statistical model. First, we can state the following result. Proposition 5.2 Choose a parameter ζj and take the set X 0 of points x ∈ X such that Tj (x) > 0. Suppose that X 0 6= X . If we set ζj = 0, we obtain a model on the remaining sample points and this model is again a toric model. Proof. Without loss of generality, suppose j = 1 and T1 (x) = 0 for 0 i = 1, . . . , k and T1 (x) > 0 for i = k 0 + 1, . . . , k. The matrix AT can be partitioned as
0 .. . 0 AT = ∗ . .. ∗
A0T
(5.12)
where ∗ denotes non-zero entries. Now, the matrix A0T is non-zero and it is the representation of a toric model on the first k 0 sample points.
5.3. GEOMETRY OF TORIC MODELS
79
Thus, the geometric structure of the toric model is an exponential model and at most s toric models with (s − 1) parameters on appropriate subsets of the sample space X . Moreover, by applying recursively the above theorem we obtain the following result. Theorem 5.3 Let ζj1 , . . . , ζjr be a set of parameters and take the set X 0 of points x ∈ X such that Tib (x) > 0 for at least one b ∈ {1, . . . , r} and Tib (x) = 0 for all b = 1, . . . , r and x ∈ X − X 0 . Suppose that X 0 6= X . If we set ζjb = 0 for all b = 1, . . . , r, we obtain a toric model on the remaining sample points. Proof.
Apply b times Proposition 5.2.
These results lead to a geometrical characterization of a toric model. Theorem 5.4 A toric model is the disjoint union of exponential models. Proof.
The result follows from the application of Theorem 5.3 for all
possible sets of parameters ζj1 , . . . , ζjb for which X 0 6= X .
Example 5.5 Consider the independence model for 3 × 2 tables analyzed in Example 3.3 of Chapter 3. The matrix representation of the sufficient statistic is 1 0 0 1 0 1 0 0 0 1 0 1 0 1 0 AT = 0 1 0 0 1 0 0 1 1 0 0 0 1 0 1 and the toric model is, apart from the normalizing constant, p11 = ζ1 ζ4 p12 = ζ1 ζ5 p21 = ζ2 ζ4 p22 = ζ2 ζ5 p = ζ3 ζ4 31 p32 = ζ3 ζ5 If ζ > 0 we have the log-linear model. Moreover, we can choose the following sets X 0 in Theorem 5.3: •
a) X 0 = {(1, 1), (1, 2)} corresponding to ζ1 = 0;
80
CHAPTER 5. SUFFICIENCY AND ESTIMATION b) X 0 = {(2, 1), (2, 2)} corresponding to ζ2 = 0; c) X 0 = {(3, 1), (3, 2)} corresponding to ζ3 = 0. In this cases we obtain the independence model for the 2×2 tables X \X 0 ; •
d) X 0 = {(1, 1), (2, 1), (3, 1)} corresponding to ζ4 = 0; e) X 0 = {(1, 2), (2, 2), (3, 2)} corresponding to ζ5 = 0. In this case we obtain the multinomial model for the second and the first column, respectively;
•
f) X 0 = {(1, 1), (1, 2), (2, 1), (2, 2))} corresponding to ζ1 = ζ2 = 0; g) X 0 = {(1, 1), (1, 2), (3, 1), (3, 2))} corresponding to ζ1 = ζ3 = 0; h) X 0 = {(2, 1), (2, 2), (3, 1), (3, 2))} corresponding to ζ2 = ζ3 = 0. In this case we obtain the Bernoulli model for the third, the second and the first row, respectively;
•
i) X 0 = {(1, 1), (1, 2), (2, 1), (3, 1)} corresponding to ζ1 = ζ4 = 0; j) X 0 = {(1, 1), (1, 2), (2, 2), (3, 2)} corresponding to ζ1 = ζ5 = 0; k) X 0 = {(2, 1), (2, 2), (1, 1), (3, 1)} corresponding to ζ2 = ζ4 = 0; l) X 0 = {(2, 1), (2, 2), (1, 2), (3, 2)} corresponding to ζ2 = ζ5 = 0; m) X 0 = {(3, 1), (3, 2), (1, 1), (2, 1)} corresponding to ζ3 = ζ4 = 0; n) X 0 = {(3, 1), (3, 2), (1, 2), (2, 2)} corresponding to ζ3 = ζ5 = 0. In this case we obtain 6 Bernoulli models. These models are the 6 models on the columns of the 3 independence models found above;
• corresponding to the six conditions ζ1 = ζ2 = ζ4 = 0, ζ1 = ζ2 = ζ5 = 0, ζ1 = ζ3 = ζ4 = 0, ζ1 = ζ3 = ζ5 = 0, ζ2 = ζ3 = ζ4 = 0 and ζ2 = ζ3 = ζ5 = 0 we obtain 6 trivial distributions on one sample point. The toric model is then formed by 21 models: the log-linear model on 6 points, 3 models on 4 points, 2 models on 3 points, 9 models on 2 points and 6 trivial models on 1 point. Apart from the trivial models, the remaining 15 models are represented in Table 5.5. In Table 5.5 the 0 means structural zero, while ∗ denotes the cells with non-zero probability. The procedure in Theorem 5.3, applied as in Example 5.5 can also be used to define the admissible sets of structural zeros.
5.3. GEOMETRY OF TORIC MODELS
81
∗ ∗ ∗ ∗ ∗ ∗
∗ ∗ ∗ ∗ 0 0
∗ ∗ 0 0 ∗ ∗
0 ∗ ∗
0 ∗ ∗
0 0 0
∗ ∗ ∗
∗ ∗ ∗
0 0 0
0 0 0 0 ∗ ∗
0 ∗ 0
0 ∗ 0
∗ 0 0
∗ 0 0
0 0 0
0 ∗ ∗
0 0 ∗ 0 ∗ 0
0 ∗ 0 0 0 ∗
∗ 0 0 0 ∗ 0
0 0 0
∗ ∗ 0
∗ 0 ∗ 0 0 0
Table 5.1: Nontrivial models.
Definition 5.6 A subset X 0 ⊂ X is an admissible set of structural zeros if there exist parameters ζi1 , . . . , ζir such that the condition of Theorem 5.3 holds. Now, we consider the behavior of the η-parameterization in the different exponential models. Starting from the representation of the η parameters as function of the ζ parameters in Equation (5.11), we can easily prove that the η parameterization is coherent, as stated in the following proposition. Proposition 5.7 The η parameters for the reduced models are the same as in the exponential case, provided that a component is fixed to zero. Proof.
For the proof, it is enough to combine the linear relation between η
and ζ given in Equation (5.11) and Theorem 5.3.
Remark 5.8 The definition of toric model as disjoint union of exponential models has a counterpart in differential geometry, namely in the definition of the differential manifolds with corners. For the construction of a differential manifold with corners, consider the positive quadrant in Rk . It is an open variety and its boundary is formed by k quadrants of dimension k − 1. Add one or more of these quadrants to the variety and iterate k times the procedure. In general, such derivatives differ from the limit of the derivatives computed in Rk . Every real variety diffeomorphic to a variety defined as above is a differential manifold with corner. For details about differential manifolds with corners, see Margalef Roig & Outerelo Dominguez (1992)
82
CHAPTER 5. SUFFICIENCY AND ESTIMATION Now, we state a differential relationship among the ζ parameters and the
η parameters. Theorem 5.9 In the positive case the following equation holds ζj
∂ζ0 = −ηj ζ02 ∂ζj
(5.13)
for all j = 1, . . . , s Proof.
By straightforward computation, we have P Tj (x)ζ T (x) /ζj ∂ζ0 ζ02 ηj = − x∈X , = − ¡P ¢ T (x) 2 ∂ζj ζj ζ x∈X
and the proof is complete. Note that the previous Equation holds in all exponential models contained in the toric model. The geometric representation of the structural zeros as presented in the previous definition needs some further discussions. Let we consider a simple example. Example 5.10 Consider the simple example of the independence model for 2 × 2 contingency tables. Using the notations of the previous chapters, a first parameterization is
p11 = ζ0 p12 = ζ0 ζ1 p21 = ζ0 ζ2 p22 = ζ0 ζ1 ζ2
(5.14)
This parameterization is used in Pistone et al. (2001a). A second parameterization is the following one, based on the matrix representation of the sufficient statistic as explained in Chapter 3. p11 p12 p21 p22
= ζ1 ζ3 = ζ1 ζ4 = ζ2 ζ3 = ζ3 ζ4
(5.15)
Eliminating the ζ indeterminates in the above systems of equations, in both cases we obtain only one polynomial, namely the binomial p11 p22 − p12 p21 = 0
(5.16)
5.3. GEOMETRY OF TORIC MODELS
83
But, the toric models in Equation (5.14) and in Equation (5.15) are different. In fact, the second parameterization contains for example the Bernoulli model on the two points (1, 1) and (1, 2), while the first parameterization does not. Now, consider the binomial equations obtained from a toric model by elimination of the ζ indeterminates as described in Chapter 1. In this way we obtain a set of binomials which defines a statistical model, but in general the binomial form differs from the parametric form. The binomial form is independent on the parameterizations and it can allow some boundaries excluded from the parametric form, as in the previous Example. In the following, we will use the terms “parametric toric model” and “binomial toric model”. Moreover, we can also restate the definition of set of structural zeros for a toric model in binomial form. Definition 5.11 Let B be the set of binomials defined by elimination as described in Chapter 2. A subset X 0 ⊂ X is an admissible set of structural zeros independent from the parameterization if the polynomial system B = 0 together with px = 0 for all x ∈ X 0 has a non-negative normalized solution. In general, it is difficult to find a parameterization such that the parametric form of the toric model contains all the exponential sub-models. Definition 5.12 If a parameterization defines a parametric toric model which contains all the exponential sub-models, we say that the parameterization is a full parameterization. In view of Example 5.10 and Definition 5.12, it follows that not all matrix representations of the sufficient statistic are equivalent. In general there is an infinite number of non-negative integer valued bases of the sub-space spanned by T , but not all of these contains all the exponential sub-models. This happens because the columns of the matrix AT are defined in the vector space framework, but in the power product representation we need non-negative exponents and then we need linear combinations with non-negative integer coefficients. In general, it is difficult to find a full parameterization, but it is easy to characterize all the parameterizations which lead to a given binomial toric model.
84
CHAPTER 5. SUFFICIENCY AND ESTIMATION
Theorem 5.13 Let AT be the matrix model. If v1 , . . . , vs is a non-negative integer system of generator of the rank of AT , then v (x)
px = ζ1 1
· · · ζ vs (x)
(5.17)
for x ∈ X is a parameterization and all parameterizations of this kind lead to the same binomial toric model. Proof.
Let Av be the matrix Av = (v1 , . . . , vs )
As the kernels of the systems As = 0 and AT = 0 are the same. As a consequence of the construction of the toric ideals presented in Section 2 of Chapter 1, the toric models defined through AT and As are represented by the same binomials. Among others, one can consider the rank of AT as a lattice and then consider a Hilbert basis of such lattice. Definition 5.14 Let S ⊆ Nk be the set of integer solutions of a diophantine system AT p = 0. A set of integer vectors H = {v1 , . . . , vh } is a Hilbert basis of S if for all β ∈ S X β= cv v (5.18) v∈H
where cv ∈ N. It is known that such a set H exists and is unique. The number of elements in H in general differs from the dimension of the rank of AT as vector subspace. For details about the properties of the Hilbert basis and the algorithms for its computation, see Sturmfels (1993) and Kreuzer & Robbiano (2003). The application of theory of Hilbert bases to the geometrical description of the toric models is currently in progress.
5.4
Algebraic representation of the sufficient statistics
Let us now introduce the algebraic representation of the sufficient statistic for toric models. Let T = T (X ) be the image of the sufficient statistic T , i.e. the set of all values of the sufficient statistic. As noted in Section 5.2, the set T is a finite subset of Ns .
5.5. MINIMUM-VARIANCE ESTIMATOR
85
Proposition 5.15 Let (T1 (X ), . . . , Ts (X )) = T ⊆ Ns be the range of the sufficient statistic. As T is a finite subset of Qs , then there exists a list of monomials {tα : α ∈ L}
(5.19)
such that all function φ : T −→ T are expressed as X φα tα φ(t) =
(5.20)
α∈L
For the proof, see e.g. Pistone et al. (2001a). Implemented algorithms can be used to derive such list. The model M has the polynomial representation X px = φ(T1 (x), . . . , Ts (x)) = φα T α (x) .
(5.21)
α∈L
The functions T α , with α ∈ L are a sufficient statistic equivalent to T1 , . . . , Ts . If the φα are unrestricted except for the normalizing constraint, this means that the model contains all the probabilities with sufficient statistic T . The advantage of the former is the linearity of Equation (5.21). This remark prompts the following definition. Definition 5.16 The set T1 , . . . , Ts0 is a basis for the statistical model M if à ! X P∗N (x1 , . . . , xN ) = φN Tj (xi ), j = 1, . . . , s (5.22) i
for all N ≥ 2. The T α in Equation (5.21) are a basis for all models with sufficient statistic T.
5.5
Minimum-variance estimator
In this section, we define an unbiased minimum-variance estimator of the cell counts. This estimator is defined as the mean of the hypergeometric distribution, whose parameters depend on the observed value of the sufficient statistic TN . Moreover we show that, under some extra conditions, this estimate is equal to the ML estimate of the vector of parameters N p. In the classical log-linear theory, where the attention is restricted to the positive case, the most commonly used theorem of existence and uniqueness
86
CHAPTER 5. SUFFICIENCY AND ESTIMATION
of the ML estimate is based on the vector space representation of log-linear models, see Chapter 1 for an introduction and Haberman (1974), Theorem 2.2, for a detailed proof. From this point of view, an unbiased minimum-variance estimator has two advantages. This estimator is well defined also in such cases where the ML estimator does not exists or its computation is difficult. Moreover, in some applications, it could be desirable to use an unbiased estimator instead of a biased one. The difficulty of the computation of the minimum-variance estimate will be avoided with an algebraic based algorithm presented in the next section. From Definition 5.16 it follows that the basis is the algebraic counterpart of the complete sufficient statistic. Definition 5.17 Define Fˆx = Eφ [Fx (X1 , . . . , XN ) | TNα , α ∈ L]
(5.23)
Fˆx is the minimum-variance unbiased estimator of the expected cell counts N px . In fact, Eφ [Fˆx ] = Eφ [Fx ] = N px and Fˆx is minimum-variance as it is function of the sufficient statistic. Moreover, if TNα is a basis, see Definition 5.16, then the estimator Fˆx is a linear combination of the T α . N
Theorem 5.18 If T1 , . . . , Ts are a vector space basis of the space of the sufficient statistic, the estimator Fˆx is the unique unbiased estimator of the count N px which is function of T1 , . . . , Ts . Proof.
Following the technique used in Lehmann (1983), let F˜x be another
estimator of N px function of T1 , . . . , Ts . We have Fˆx =
X
cα T α
α
and F˜x =
X
dα T α
α
Thus, we can write 0 = Eφ [Fˆx − F˜x ] =
X (cα − dα )Eφ [T α ] α
5.6. ALGEBRAIC METHOD
87
for all Eφ [T α ]. As the φα are unrestricted, the previous Equation leads to 0 = Eφ [Fˆx − F˜x ] =
X
φα E0 [(Fˆx − F˜x )T α ] ,
α
where E0 is the mean with respect to a dominant measure. As E0 [(Fˆx − F˜x )T α ] must be all zero, we conclude that F˜x = Fˆx .
Remark 5.19 Note that the result above can be easily proved if we restrict to the model M>0 , as in the exponential model the sufficient statistic is complete and the result in Theorem 5.18 is a consequence of the completeness. The estimator Fˆx exists even when ML estimate is not well defined. The links between this estimator and the ML estimator is given by the following result. Proposition 5.20 If ML estimate exists and the vector Fˆ /N belongs to the model, then the ML estimate N pˆ is equal to Fˆ . Proof.
A classical result from Birch (1963), generalized by Bishop et al.
(1975), states that there is a unique configuration of the cell probabilities which lies in the model and verifies the constraints of the sufficient statistic, and this configuration is the ML estimate of the cell probabilities.
5.6
Algebraic method
From the above theory, together with the use of the algorithm presented in Chapter 2, we can derive a Monte Carlo method for the computation of the estimates fˆx for all x ∈ X . As in the previous Chapters, let Yt be the set of all samples with fixed value t of the sufficient statistic TN and Ft be the set of all frequency tables obtained from samples with value t of the sufficient statistic TN . As P∗N φ = φN (TN ), then the distribution of the sample given {TN = t} is uniform on Yt . The distribution of F = (Fx )x∈X is the image of the uniform distribution, i.e. it is the hypergeometric distribution on Ft , see Chapter 2 for details. Given t, we define a Markov chain with sample space Ft and with stationary distribution Ht . The Markov chain is a random walk on Ft with moves in an
88
CHAPTER 5. SUFFICIENCY AND ESTIMATION
appropriate Markov basis M = {m1 , . . . , mL }, computed starting from the matrix representation of the sufficient statistic T . Thus, as we are interested in the mean value of the cell counts, the Markov chain is constructed following a Metropolis-Hastings-type algorithm and it is performed as follows: (a) let S = 0; (b) at the time 0 the chain is in f ; (c) choose a move m uniformly in the Markov basis and ² = ±1 with probability 1/2 each independently of m; (d) if f + ²m ≥ 0 then move the chain from f to f + ²m with probability min{Ht (f + ²m)/Ht (f ), 1} .
(5.24)
In all other cases, stay at f ; (e) let S = S + f ; (f) repeat B times the steps (c)–(e); (g) the (approximated) mean is S/B. The convergence of this Markov chain has been proved in Chapter 2. As in many cases we are interested in the computation of the estimates in the case of two-way of multi-way contingency tables, we can use the results for the computation of the Markov bases presented in Chapters 3 and 4.
Appendix A Programs In this Appendix we show the programs, both symbolic and numerics, needed for the implementation of the MCMC algorithm for the goodness of fit tests for log-linear models presented in Chapters 3 and 4. We will use the data and the quasi-symmetry model for 4 × 4 tables in Example 3.8, Chapter 3, as main example. The first step of the algorithm needs the computation of the relevant Markov basis for the model. In order to do that, we use CoCoA. We set an appropriate polynomial ring, we write the matrix representation of the sufficient statistic and we use the CoCoA function Toric. We use a polynomial ring with an indeterminate for each sample point, i.e., in our example, with 16 indeterminates. Thus, we write: Use A::=Q[x[1..16]]; T:=Mat[[1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0], [0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0], [0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0], [0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1], [1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0], [0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0], [0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0], [0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1], [1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0], [0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0], [0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0], [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1], [0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0], [0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0], [0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0], [0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0], [0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0], [0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0]];
89
90
APPENDIX A. PROGRAMS I:=Toric(T); G:=GBasis(I); G;
The first 4 rows of the matrix represent the row sums, the second 4 rows represent the column sums and the remaining rows represent the sum of the diagonal-opposite cells. Two remarks are useful for the implementation of this part of the algorithm. First, we use line vectors for the sample points, i.e. we use the indeterminates ξ1 , . . . , ξ16 , instead of the two-way representation, i.e. ξ1,1 , . . . , ξ4,4 . This choice does not present any problems in the multi-way case, as shown by examples in Chapter 4. Second, if we want to work with a minimal set of generators instead of a Gr¨obner basis of the ideal, it is sufficient to add the command replace the command G:=GBasis(I) with the command G:=MinGens(I). Now, we have a Gr¨obner basis of the relevant toric ideal. In our example, the CoCoA output is [x[2]x[8]x[13] - x[4]x[5]x[14], x[2]x[7]x[9] - x[3]x[5]x[10], -x[7]x[12]x[14] + x[8]x[10]x[15], x[3]x[12]x[13] - x[4]x[9]x[15], x[2]x[7]x[12]x[13] - x[4]x[5]x[10]x[15], x[3]x[8]x[10]x[13] x[4]x[7]x[9]x[14], -x[3]x[5]x[12]x[14] + x[2]x[8]x[9]x[15]]
Now, we use these binomials for the definition of the Markov basis. We transform each binomial in a vector, as described in Chapter 2. We can use the following simple CoCoA function: Define GensToMat(G); -- input: a list of binomials; -- output: a matrix containing the vector representation -- of the binomials; L:=Len(G); M:=NewMat(L,NumIndets(),0); For I:=1 To L Do ListMon:=Monomials(G[I]); L1:=Log(ListMon[1]); L2:=Log(ListMon[2]); M[I]:=L1-L2; End; Return M; EndDefine;
Applying this function to the Gr¨obner basis computed above, we obtain the following matrix
91 Mat[[0, 1, 0, -1, -1, 0, 0, 1, 0, 0, 0, 0, 1, -1, 0, 0], [0, 1, -1, 0, -1, 0, 1, 0, 1, -1, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 1, -1, 0, -1, 0, 1, 0, 1, -1, 0], [0, 0, 1, -1, 0, 0, 0, 0, -1, 0, 0, 1, 1, 0, -1, 0], [0, 1, 0, -1, -1, 0, 1, 0, 0, -1, 0, 1, 1, 0, -1, 0], [0, 0, 1, -1, 0, 0, -1, 1, -1, 1, 0, 0, 1, -1, 0, 0], [0, -1, 1, 0, 1, 0, 0, -1, -1, 0, 0, 1, 0, 1, -1, 0]]
Here each row represents a move. This matrix is a concise representation of the 7 moves found in Chapter 3, Section 3.2.4. Now, we can use this matrix for the second step of the algorithm, i.e. the simulation step. For this step we use the software Matlab. After using Matlab, we need an easy ASCII manipulation in order to have a suitable input format for Matlab. function simul=simul(tab,mle,m,N,bis,step); %tab = the observed table; %mle = maximum likelihood estimate; %m = the matrix representation of the moves; %N = number of MCMC replicates; %bis = number of burn-in steps; %step = reduction of correlation step; nc=length(tab); p=0; c=zeros(N,1); chisqref=chisq(tab,MLE); mm=-m; m=[m; mm]; [nmoves,ncm]=size(m); numit=N*step+bis; for i=1:numit tabp=zeros(1,nc); r=ceil(nmoves*rand(1)); tabp=tab+m(r,:); if tabp>=zeros(1,nc) mhr=1; for j=1:nc if m(r,j)~=0 mhr=mhr*prod(1:tab(j))/prod(1:tabp(j)); end; end; alpha=rand(1); if mhr>=alpha tab=tabp;
92
APPENDIX A. PROGRAMS end; end; if (rem(i,step)==0) & (i>bis) c((i-bis)/step)=chisq(tab,MLE); if c((i-bis)/step)>=chisqref p=p+1; end; end; end; p=p/N; simul=p;
The value of the Pearson statistic is computed through the following function: function chisq=chisq(tab,MLE); c=0; l=length(tab); for i=1:l if MLE(i)~=0 c=c+(tab(i)-MLE(i))*(tab(i)-MLE(i))/MLE(i); end; end; chisq=c;
Some remarks about the parameters of the simulation Matlab function. • the maximum likelihood estimate of the cell counts mle is computed with statistical software, for example SAS, see SAS/STAT User’s Guide (2000), unless we disposed of a closed form; • the number of Monte Carlo replicates B is chosen accordingly to the precision we want reach. In our examples we have used B=50.000, and the motivation of our choice is discussed in Example 3.6, Chapter 3; • the number of burn-in steps bis and the step for the reduction of the correlation step should be chosen for any particular problem based on inspection of the empirical distribution of the random variables and the traces of auto-correlation function, respectively. Finally, with a slight modification of the above program, one can obtain the approximate distribution of the test statistic.
Bibliography Agresti, A. (1992). Modelling patterns of agreement and disagreement. Statistical Methods in Medical Research 1, 201–218. Agresti, A. (1996). An Introduction to Categorical Data Analysis. New York: Wiley. Agresti, A. (2001). Exact inference for categorical data: Recent advances and continuing controversies. Statist. Med. 20, 2709–2722. Agresti, A. (2002). Categorical Data Analysis. New York: Wiley, 2nd ed. Agresti, A., Mehta, C. R. & Patel, N. R. (1990). Exact inference for contingency tables with ordered categories. J. Amer. Statist. Assoc. 85, 453–458. Baglivo, J., Olivier, D. & Pagano, M. (1992).
Methods for exact
goodness-of-fit tests. J. Amer. Statist. Assoc. 87, 464–469. Becker, M. P. (1990). Quasisymmetric models for the analysis of square contingency tables. J. R. Statist. Soc. 52, 369–378. Bergsma, W. P. & Rudas, T. (2002). Marginal models for categorical data. Ann. Statist. 30, 140–159. Bigatti, A., La Scala, R. & Robbiano, L. (1999). Computing toric ideals. J. Symb. Comput. 27, 351–365. Bigatti, A. & Robbiano, L. (2001). Toric ideals. Mat. Contemp. 21, 1–25. Birch, M. W. (1963). Maximum likelihood in three-way contingency tables. J. R. Statist. Soc. 25, 220–233. Bishop, Y. M., Fienberg, S. & Holland, P. W. (1975). Discrete multivariate analysis: theory and practice. Cambridge: MIT Press. 93
94
BIBLIOGRAPHY
Booth, J. G. & Butler, R. W. (1999). An importance sampling algorithm for exact conditional tests in log-linear models. Biometrika 86, 321–331. Capani, A., Niesi, G. & Robbiano, L. (2000). CoCoA, a system for doing Computations in Commutative Algebra. Available via anonymous ftp from cocoa.dima.unige.it, 4th ed. CDER (2000). Developing Medical Imaging Drugs and Biological products. Guidance for industry. U.S. Department of Health and Human Services, Food and Drug Administration. Chib, S. & Greenberg, E. (1995). Understanding the Metropolis–Hastings algorithm. Amer. Statist. 49, 327–335. Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20, 37–46. Collombier, D. (1980). R´echerches sur l’Analyse des Tables de Contingence. Phd thesis, Universit´e Paul Sabatier de Toulouse. Cox, D., Little, J. & O’Shea, D. (1992). Ideals, Varieties, and Algorithms. New York: Springer Verlag. Darroch, J. N. & McCloud, P. I. (1986). Category distinguishability and observer agreement. Australian Journal of Statistics 28, 371–388. De Loera, J. A., Hemmecke, R., Tauzer, J. & Yoshida, R. (2003). Effective lattice point counting in rational convex polytopes. Preprint. Diaconis, P. & Sturmfels, B. (1998). Algebraic algorithms for sampling from conditional distributions. Ann. Statist. 26, 363–397. Dillon, W. R. & Mulani, N. (1984). A probabilistic latent class model for assessing inter-judge reliability. Multivariate Behavioral Research 19, 438–458. Dobson, A. J. (1990). An Introduction to Generalized Linear Models. London: Chapman & Hall. Donner, A. & Klar, N. (1996). The statistical analysis of kappa statistics in multiple samples. Journal of Clinical Epidemiology 49, 1053–1058.
BIBLIOGRAPHY
95
Donner, A., Shoukri, M. M., Klar, N. & Bartfay, E. (2000). Testing the equality of two dependent kappa statistics. Statist. Med. 19, 373–387. Duffy,
D.
(2002).
The
gllm
package.
Available
from
http://cran.r-project.org, 0th ed. Fienberg, S. (1980). The Analysis of Cross-Classified Categorical Data. Cambridge: MIT Press. Fingleton, B. (1984). Models of Category Counts. Cambridge: Cambridge University Press. Fleiss, J. L. (1981). Statistical Methods for Rates and Proportions. New York: Wiley, 2nd ed. Garcia, L. D., Stillman, M. & Sturmfels, B. (2003). Algebraic geometry of Bayesyan networks. Preprint. Geiger, D., Meek, C. & Sturmfels, B. (2002). On the toric algebra of graphical models. Preprint. Goodman, L. A. (1979a). Multiplicative models for square contingency tables with ordered categories. Biometrika 66, 413–418. Goodman, L. A. (1979b). Simple models for the analysis of association in cross-classifications having ordered categories. J. Amer. Statist. Assoc. 74, 537–552. Goodman, L. A. (1985). The analysis of cross-classified data having ordered and/or unordered categories: Association models, correlation models, and asymmetry models for contingency tables with or without missing entries. Ann. Statist. 13, 10–59. Guggenmoos-Holzmann, I. & Vonk, R. (1998). Kappa-like indices of agreement viewed from a latent class perspective. Statist. Med. 17, 797– 812. Haberman, S. J. (1974). The Analysis of Frequency Data. Chicago and London: The University of Chicago Press. Haberman, S. J. (1978). Analysis of Qualitative Data. New York: Academic Press.
96
BIBLIOGRAPHY
Hanley, J. A. (1998). Receiver Operating Characteristic (ROC) curves. In Encyclopedia of Biostatistics, P. Armitage & T. Colton, eds. Wiley, pp. 3738–3745. Hirji, K. F. & Johnson, T. D. (1996). A comparison of algorithms for exact analysis of unordered 2 × k contingency tables. Comput. Statist. Data Anal. 21, 419–429. Klar, N., Lipsitz, S. R. & Ibrahim, J. G. (2000). An estimating equation approach for modelling Kappa. Biom. J. 42, 45–58. Kreuzer, M. & Robbiano, L. (2000). Computational Commutative Algebra 1. New York: Springer. Kreuzer, M. & Robbiano, L. (2003). Computational Commutative Algebra 2. New York: Springer. In preparation. Landis, R. J. & Koch, G. G. (1975). A review of statistical methods in the analysis of data arising from observer reliability studies, Parts I and II. Statist. Neerlandica 29, 101–123, 151–161. Lehmann, E. L. (1983). Theory of Point Estimation. New York: John Wiley & Sons. Lehmann, E. L. (1987). Testing Statistical Hypotheses. New York: John Wiley & Sons. Margalef Roig, J. & Outerelo Dominguez, E. (1992). Differential Topology. No. 173 in North-Holland Mathematixs Studies. Amsterdam: North-Holland Publishing Co. McCullagh, P. & Nelder, J. A. (1989). Generalized Linear Models. London: Chapman & Hall, 2nd ed. Mehta, C. R. & Patel, N. R. (1983). A network algorithm for perfoming Fisher’s exact test in r × c contingency tables. J. Amer. Statist. Assoc. 78, 427–434. Metha, C. R., Patel, N. R. & Senchaudhuri, P. (1988). Importance sampling for estimating exact probabilities in permutational inference. J. Am. Statist. Assoc. 83, 999–1005.
BIBLIOGRAPHY
97
Metha, C. R., Patel, N. R. & Senchaudhuri, P. (1991). Exact stratified linear rank tests for binary data. In Computing Science and Statistics: Proceedings of the 23rd Symposium on the Interface, E. M. Keramidas, ed. American Mathematical Society, pp. 200–207. Metha, C. R., Patel, N. R. & Senchaudhuri, P. (2000). Efficient monte carlo methods for conditional logistic regression. J. Am. Statist. Assoc. 95, 99–108. Pistone, G., Riccomagno, E. & Wynn, H. P. (2001a). Algebraic Statistics: Computational Commutative Algebra in Statistics. Boca Raton: Chapman&Hall/CRC. Pistone, G., Riccomagno, E. & Wynn, H. P. (2001b). Computational commutative algebra in discrete statistics. In Algebraic Methods in Statistics and Probability, M. A. G. Viana & D. S. P. Richards, eds., vol. 287 of Contemporary Mathematics. American Mathematical Society, pp. 267–282. Rudas, T. & Bergsma, W. P. (2002). On generalised symmetry. Tech. rep., Paul Sabbatier University, Toulouse, France. Preprint. SAS/STAT User’s Guide (2000). Version 8. Cary, NC: SAS Institute, 1st ed. Smith, P. W. F., Forster, J. J. & McDonald, J. W. (1996). Monte Carlo exact tests for square contingency tables. J. R. Statist. Soc. 159, 309–321. Strawderman, R. L. & Wells, M. T. (1998). Approximately exact inference for the common odds ratio in several 2 × 2 tables. J. Am. Statist. Assoc. 93, 1294–1307. Sturmfels, B. (1993). Algorithms in Invariant Theory. Texts and Monographs in Symbolic Computation. New York: Springer. Sturmfels, B. (1996). Gr¨obner bases and convex polytopes, vol. 8 of University lecture series (Providence, R.I.). American Mathematical Society. Tanner, M. A. & Young, M. A. (1985). Modelling agreement among raters. J. Am. Statist. Assoc. 80, 175–180.
98
BIBLIOGRAPHY
Williamson, J. M., Manatunga, A. K. & Lipsitz, S. R. (1994). Modeling kappa for measuring dependent categorical agreement data. Biostatistics 1, 191–202.