phi-divergence test statistics in multinomial sampling for hierarchical ...

1 downloads 0 Views 119KB Size Report
that generalize the likelihood ratio test as well as the chi-square test statistic, for .... Phi-divergence test statistics for hierarchical sequences of loglinear models ...
Monografías del Seminario Matemático García de Galdeano 33, 301–308 (2006)

P HI - DIVERGENCE TEST STATISTICS IN MULTINOMIAL SAMPLING FOR HIERARCHICAL SEQUENCES OF LOGLINEAR MODELS WITH LINEAR CONSTRAINTS Nirian Martín and Leandro Pardo Abstract. We consider nested sequences of hierarchical loglinear models when expected frequencies are subject to linear constraints and we study the problem of finding the model in the the nested sequence that is able to explain more clearly the given data. It will be necessary to give a method to estimate the parameters of the loglinear models and also a procedure to choose the best model among the models considered in the nested sequence under study. These two problems will be solved using the φ -divergence measures. We estimate the unknown parameters using the minimum φ -divergence estimator (Martín and Pardo [8]) which can be considered as a generalization of the maximum likelihood estimator (Haber and Brown [5]) and we consider a φ -divergence test statistic (Martín [7]) that generalize the likelihood ratio test as well as the chi-square test statistic, for testing two nested loglinear models. Keywords: Loglinear models, multinomial sampling, maximum likelihood estimator, minimum Phi-divergence estimator, Phi-divergence test statistics. AMS classification: 62H15, 62H17.

§1. Introduction Loglinear models define a multiplicative structure on the expected cell frequencies of a contingency table. We shall assume that we have k cells (if k = I × J we get a two-way contingency table) and we denote them by C1 , . . . ,Ck . Given a random sample Y1 ,Y2 , . . . ,Yn with b = ( pb1 , . . . , pbk )T with realizations from Y = {C1 , . . . ,Ck } we denote by p pbj =

Nj n

n

and

N j = ∑ I{C j } (Yi ) ,

j = 1, . . . , k.

(1)

i=1

Assuming multinomial sampling and denoting by p j (θθ 0 ) = Pr (C j ), j = 1, . . . , k, the statistic (N1 , . . . , Nk ) is obviously sufficient for the statistical model under consideration and is multinomially distributed with parameters n and p(θθ 0 ) = (p1 (θθ 0 ), . . . , pk (θθ 0 )). We shall denote, m j (θθ 0 ) ≡ E (N j ) = np j (θθ 0 ), and m(θθ 0 ) = (m1 (θθ 0 ), . . . , mk (θθ 0 ))T .

j = 1, . . . , k,

(2)

302

Nirian Martín and Leandro Pardo

Given a k × (t + 1) matrix X and rank(X) = t + 1, the set C (X) = {log m(θθ ) ∈ Rk : log m(θθ ) = Xθθ , θ ∈ Rt+1 }

(3)

represents the class of the loglinear models associated with X. We suppose, in the following,  T that J = 1, . . .(k . . . , 1 ∈ C (X). Taking into account (2), the parameter space is defined by Θ0 = {θθ ∈ Rt+1 : log m(θθ ) = Xθθ and JT m(θθ ) = n}. Now in addition to the previous model we shall assume that we have r − 1 < t linear constrains defined by CT m(θθ ) = d∗ ,

(4)

where C and d∗ are k × (r − 1) and (r − 1) × 1 matrices, respectively. If we consider the linear constraint JT m(θθ ) = n, associated to the multinomial sampling, we can write the parameter space for this new model by Θ∗ = {θθ ∈ Rt+1 : log m(θθ ) = Xθθ and LT m(θθ ) = d}, T where L = (J, C), d = n, (d∗ )T and rank(L) = rank(LT , d) = r. The problem that has motivated our research involves a nested sequence of hypotheses Hl : p = p(θθ ), (1)

(2)

(m)

where Θ0 ⊃ Θ0 ⊃ · · · ⊃ Θ0

(l)

θ ∈ Θ0 ,

l = 1, . . . , m,

m ≤ t < k − 1,

(5)

(l)

with dim(Θ0 ) = tl + 1, rank(Ll ) = rl , l = 1, . . . , m, such that

tl+1 ≤ tl and rl+1 ≥ rl ,

l = 1, . . . , m − 1,

(6)

where at least one of both inequalities is a strict inequality. In this framework, there is an integer m∗ (1 ≤ m∗ ≤ m) for which Hm∗ is true but Hm∗ +1 is not true. A common strategy for making inference on m∗ (e.g., Cressie and Read [2, p. 42]) is to test successively, HNull : Hl+1 against HAlt : Hl ,

l = 1, . . . , m − 1,

(7)

where we continue to test as long as the null hypothesis is accepted, and we infer m∗ to be the first l for which Hl+1 is rejected as a null hypothesis. The full operating characteristics of this sequence of tests of nested hypotheses are not known. Our goal in this paper is to present φ -divergence test statistics for testing a sequence of nested hypotheses as given in (5).

§2. Minimum φ -divergence estimator (l)

Since the parameter values in {Θ0 : l = 1, . . . , m} are generally unknown, most tests require their estimation. In this context the maximum likelihood estimator, under the linear constrains given in (4) is defined by (r) θb = arg max hT θ , θ ∈Θ∗

Phi-divergence test statistics for hierarchical sequences of loglinear models with linear constraints 303

where hT = (n∗ )T X and n∗ is an observation from (N1 , . . . , Nk ). It is a simple exercise to (r) establish that equivalently θb can be defined by (r) θb = arg min∗ DKullback (b p, p(θθ )) , θ ∈Θ

(8)

where DKullback (b p, p(θθ )) is the Kullback-Leibler (see Kullback [6]) divergence between the b and p(θθ ), probability vectors p k

DKullback (b p, p(θθ )) =

pbj

∑ pbj log p j (θθ ) .

j=1

The definition (8) hints at a much more general inference framework based on divergence measures, which was investigated by Martín and Pardo [8]. In the next several paragraphs, we give the essential details of the framework for estimation and hypothesis testing there. Consider the φ −divergence defined by Csiszár [3] and Ali and Silvey [1]   k pj Dφ (p, q) ≡ ∑ q j φ , φ ∈ Φ∗ , (9) q j j=1 where Φ∗ is the class of all convex functions φ (x), x > 0, such that at x = 1, φ (1) = 0, φ 00 (1) > 0, and at x = 0, 0 φ (0/0) = 0 and 0 φ (p/0) = p limu→∞ φ (u)/u. For every φ ∈ Φ∗ that is differentiable at x = 1, the function ψ (x) ≡ φ (x) − φ 0 (1) (x − 1) also belongs to Φ∗ . Then we have Dψ (p, q) = Dφ (p, q), and ψ has the additional property that ψ 0 (1) = 0. Because the two divergence measures are equivalent, we can consider the set Φ∗ to be equivalent to the set Φ ≡ Φ∗ ∩ {φ : φ 0 (1) = 0}. In what follows, we give our theoretical results for φ ∈ Φ, but we often apply them to choices of functions in Φ∗ . For more details about φ -divergences see Pardo[10]. Based in (8) and (9), in the cited paper of Martín and Pardo, was considered the minimum φ -divergence estimator in loglinear models when we have some linear constraints and (r) multinomial sampling, θb , is given by φ

(r) p, p(θθ )). θb φ = arg min∗ Dφ (b θ ∈Θ

(10)

In the next section we shall use this estimator to define a family of test statistics for testing the nested hypotheses {Hl : l = 1, . . . , m} given in (5).

§3. Phi-divergence test statistics In this section for testing nested hypotheses {Hl : l = 1, . . . , m} given in (5), we test HNull : Hl+1 against HAlt : Hl ,

l = 1, . . . , m − 1,

using the family of test statistics (l)

Tφ1 ,φ2 =

  (r),φ2  (r),φ2  2n b b D p θ , p θ , φ l+1 l φ100 (1) 1

(11)

304

Nirian Martín and Leandro Pardo

(r),φ2 (r),φ2 (l) where θb l and θb l+1 are defined by (10). When Tφ1 ,φ2 > c, we reject HNull in (7), where c is specified so that the size of the test is α:  (l) Pr Tφ1 ,φ2 > c Hl+1 = α, α ∈ (0, 1) . (12) (l)

In the next theorem we establish that, under HNull : Hl+1 , the test statistic Tφ1 ,φ2 converges in distribution to a chi-squared distribution with tl − tl+1 − rl + rl+1 degrees of freedom (χt2l −tl+1 −rl +rl+1 ), l = 1, . . . , m − 1. Thus, c could be chosen as the (1 − α)-th quantile of a χt2l −tl+1 −rl +rl+1 distribution, c = χt2l −tl+1 −rl +rl+1 (1 − α) ,

(13)

where Pr(χ 2f ≤ χ 2f (p)) = p. Notice that, when φ1 (x) = φ2 (x) = x log x − x + 1, we obtain the usual likelihood-ratio test, and that, when φ1 (x) = 21 (x − 1)2 and φ2 = x log x − x + 1, we obtain the Pearson test statistic (e.g. Haber and Brown [5]). Theorem 1. Suppose that data (N1 , . . . , Nk ) are multinomially distributed according to the loglinear model (3). Consider the nested sequence of hypotheses given in (5) and (6). Choose functions φ1 and φ2 ∈ Φ. Then for testing HNull : Hl+1 against HAlt : Hl , the asymptotic (l) null distribution of the φ -divergence test statistic Tφ1 ,φ2 is a chi-squared distribution with tl − tl+1 − rl + rl+1 degrees of freedom. Proof. Based on Theorem 2 in Martín and Pardo [8], it is not difficult to establish that     √ √ (r),φ2 n p θb i − p(θθ 0 ) = Ri n(b p − p(θθ 0 )) + oP (1), with Ri = Dp(θθ 0 ) Xi Hi (θθ 0 )XTi , i = l, l + 1, where Hi (θθ 0 ) = (XTi Dp(θθ 0 ) Xi )−1 − (XTi Dp(θθ 0 ) Xi )−1 XTi Dp(θθ 0 ) Li −1  × LTi Dp(θθ 0 ) Xi (XTi Dp(θθ 0 ) Xi )−1 XTi Dp(θθ 0 ) Li × LTi Dp(θθ 0 ) Xi (XTi Dp(θθ 0 ) Xi )−1 ,

i = l, l + 1.

(l)

Therefore, Tφ1 ,φ2 = ZTl Zl +oP (1), being     (r),φ  (r),φ2 2 −1/2 √ b Zl = Dp(θθ ) n p θ l − p θb l+1 0         (r),φ2 (r),φ2 −1/2 √ −1/2 √ b b = Dp(θθ ) n p θ l − p(θθ 0 ) − Dp(θθ ) n p θ l+1 − p(θθ 0 ) 0

−1/2 = Dp(θθ ) (Rl 0

0

√ − Rl+1 ) n(b p − p(θθ 0 )) + oP (1).

The asymptotic distribution of the φ -divergence test statistic (11) will be a chi-squared iff −1/2 −1/2 Σp(θθ 0 ) (Rl − Rl+1 )T Dp(θθ ) , where the matrix Σ p(θθ 0 ) = the matrix Σ Zl = Dp(θθ ) (Rl − Rl+1 )Σ 0 0 Dp(θθ 0 ) − p(θθ 0 )p(θθ 0 )T is idempotent and symmetric.

Phi-divergence test statistics for hierarchical sequences of loglinear models with linear constraints 305

It is clear that RiΣ p(θθ 0 ) = Dp(θθ 0 ) Xi Hi (θθ 0 )XTi Dp(θθ 0 ) − p(θθ 0 )p(θθ 0 )T = Dp(θθ 0 ) Xi Hi (θθ 0 )XTi Dp(θθ 0 ) , −1/2

1/2

1/2



i = l, l + 1,

1/2

and Ki = Dp(θθ ) Ri Dp(θθ ) = Dp(θθ ) Xi Hi (θθ 0 )XTi Dp(θθ ) is a symmetric matrix. Therefore to 0 0 0 0 establish that Σ Zl = (Kl − Kl+1 ) (Kl − Kl+1 ) is an idempotent matrix will be enough to see ΣZl = Kl − Kl+1 ). We establish that that Kl − Kl+1 is an idempotent matrix (Σ i) Ki Ki = Ki ,

i = l, l + 1,

ii) Kl Kl+1 = Kl+1 . Part i) follows because Hi (θθ 0 )XTi Dp(θθ 0 ) Xi Hi (θθ 0 ) = Hi (θθ 0 ), i = l, l + 1. Part ii) follows taking into account that Xl+1 is a submatrix of Xl . There exists a matrix B such that Xl+1 = Xl B and  −1 XTl Dp(θθ 0 ) Ll Xl Hl (θθ 0 )XTl Dp(θθ 0 ) Xl+1 = Xl B − Xl XTl Dp(θθ 0 ) Xl  −1  −1 × LTl Dp(θθ 0 ) Xl XTl Dp(θθ 0 ) Xl XTl Dp(θθ 0 ) Ll × LTl Dp(θθ 0 ) Xl B. Multiplying on the right side by Hl+1 (θθ 0 ), the last term is zero because Ll is a submatrix of Ll+1 and Hl+1 (θθ 0 )Bl+1 (θθ 0 ) = 0(tl+1 +1)×rl+1 , therefore, LTl Dp(θθ 0 ) Xl BHl+1 (θθ 0 ) = 0rl ×(tl +1) . (l)

The degrees of freedom of Tφ1 ,φ2 coincides with the trace of the matrix Σ Zl . It is not difficult to establish that tr (Ki ) = t i + 1 − ri , i = l, l + 1, therefore tr (Kl − Kl+1 ) = tl − tl+1 − rl + rl+1 . To test the nested sequence of hypotheses {Hl : l = 1, . . . , m} referred previously, we need (m∗ ) (2) (1) an asymptotic independence result for the sequence of test statistics Tφ1 ,φ2 , Tφ1 ,φ2 , . . . , Tφ1 ,φ2 , where m∗ is the integer 1 ≤ m∗ ≤ m for which Hm∗ is true but Hm∗ +1 is not true. This result is given in the theorem below. Theorem 2. Suppose that data (N1 , . . . , Nk ) are multinomially distributed according to the loglinear model (3). We first test HNull : Hl against HAlt : Hl−1 , followed by HNull : Hl+1 against (l−1) (l) HAlt : Hl . Then, under the hypothesis Hl+1 , the statistics Tφ1 ,φ2 and Tφ1 ,φ2 are asymptotically independent. (l)

Proof. The statistic Tφ1 ,φ2 can be written in the way, (l)

Tφ1 ,φ2 =

√ √ n (b p−p(θθ 0 ))T MTl Ml n (b p−p(θθ 0 )) + oP (1),

(14)

306

Nirian Martín and Leandro Pardo

where −1/2

Ml = Dp(θθ ) (Rl − Rl+1 ) 0

and Ri = Dp(θθ 0 ) Xi Hi (θθ 0 )XTi ,

i = l, l + 1.

Similarly, (l−1)

Tφ1 ,φ2 =

√ √ n (b p−p(θθ 0 ))T MTl−1 Ml−1 n (b p−p(θθ 0 )) + oP (1). (l)

(l−1)

By Theorem 4 in Searle [11], the quadratic forms Tφ1 ,φ2 and Tφ1 ,φ2 are asymptotically independent if MTl Ml Σp(θθ 0 ) MTl−1 Ml−1 = 0k×k . We have −1/2

−1/2

Σp(θθ 0 ) (Rl−1 − Rl )Dp(θθ ) MTl−1 , MTl Ml Σ p(θθ 0 ) MTl−1 Ml−1 = MTl Dp(θθ ) (Rl − Rl+1 )Σ 0

0

and since Hl (θθ 0 )Rl (θθ 0 ) = 0(tl +1)×rl , we have MTl Ml Σ p(θθ 0 ) MTl−1 Ml−1 = 0k×k , because, Ml Σ p(θθ 0 ) MTl−1 = Kl − Kl+1 Kl−1 − Kl + Kl+1 Kl = 0k×k . (l)

In general, theoretical results for the test statistic Tφ1 ,φ2 under alternative hypotheses are not easy to obtain. An exception to this is when there is a contiguous sequence of alternatives  that approach the null hypothesis Hl+1 at the rate of O n−1/2 . Consider the multinomial probability vector s pn ≡ p(θθ 0 ) + √ , n

θ 0 ∈ Θl+1 and θ 0 unknown,

(15)

where s ≡ (s1 , . . . , sk )T is a fixed k × 1 vector such that ∑kj=1 s j = 0, and n is the total-count parameter of the multinomial distribution. As n → ∞, the sequence of multinomial proba bilities {pn }n∈N converges to a multinomial probability in Hl+1 at the rate of O n−1/2 . We call s Hl+1,n : pn = p(θθ 0 ) + √ , θ 0 ∈ Θl+1 and θ 0 unknown, (16) n a sequence of contiguous alternative hypotheses, here contiguous to the null hypothesis Hl+1 . (l) Now consider testing HNull : Hl+1 against HAlt : Hl+1,n , using the test statistic Tφ1 ,φ2 given  (l) (l) > c Hl+1,n . In what is to follow, we by (11). The power of this test is πn ≡ Pr T φ1 ,φ2

(l)

show that under the alternative Hl+1,n , and as n → ∞, Tφ1 ,φ2 converges in distribution to a non-central chi-squared random variable with non-centrality parameter µ, where µ is given  in Theorem3, and tl − tl+1 − rl + rl+1 degrees of freedom χt2l −tl+1 −rl +rl+1 ,µ . Consequently, as n → ∞,  (l) πn → Pr χt2l −tl+1 −rl +rl+1 ,µ > c . (17) Theorem 3. Suppose that (N1 , . . . , Nk ) is multinomially distributed according to the loglinear (l) model (3). The asymptotic distribution of the statistic Tφ1 ,φ2 , under the contiguous alternative hypotheses (16), is chi-squared with tl −tl+1 −rl +rl+1 degrees of freedom and non-centrality parameter  µ = sT Xl Hl (θθ 0 )XTl − Xl+1 Hl+1 (θθ 0 )XTl+1 s.

Phi-divergence test statistics for hierarchical sequences of loglinear models with linear constraints 307 (l)

Proof. By (14), we have Tφ1 ,φ2 = ZTl Zl +oP (1), where √ −1/2 −1/2 √ p − p(θθ 0 )) = (Kl − Kl+1 ) Dp(θθ ) n (b p − p(θθ 0 )) Zl = Dp(θθ ) (Rl − Rl+1 ) n (b 0

0

  L (l) (l) Σs , being and Zl −−−→ N µ s ,Σ n→∞

−1/2

(l)

−1/2

µ s = Dp(θθ ) (Rl − Rl+1 ) s = (Kl − Kl+1 ) Dp(θθ ) s 0

0

and −1/2

(l)

−1/2

Σp(θθ 0 ) (Rl − Rl+1 ) Dp(θθ ) = Kl − Kl+1 . Σ s = Dp (θθ ) (Rl − Rl+1 )Σ 0

0

(l)

The matrix Σ s (see i) and ii) in the previous theorem) is idempotent and symmetric and its trace is tl − tl+1 − rl + rl+1 . (l) (l) µ s ,Σ Σs ). We apply a lemma by Ferguson (cf. [4, p. 63]): “Suppose that Zl is N (µ (l) (l) (l) (l) If Σ s is idempotent and Σ s µ s = µ s , the distribution of ZTl Zl is noncentral chi-square (l) with degrees of freedom equal to rank of the matrix Σ s and noncentrality parameter µ = (l) T (l) (l) (l) (l) µ s ) µ s ”. Therefore, the result follows if we establish that Σ s µ s = µ s . Applying that (µ Kl − Kl+1 is an idempotent matrix, we have −1/2

(l) (l)

−1/2

Σ s µ s = (Kl − Kl+1 ) (Kl − Kl+1 ) Dp(θθ ) s = (Kl − Kl+1 ) Dp(θθ ) s. 0

0

Now we are going to get the noncentrality parameter,  (l) (l) µ s )T µ s = sT Xl Hl (θθ 0 )XTl − Xl+1 Hl+1 (θθ 0 )XTl+1 s. (µ Now the result follows. Remark 1. Theorem 3 can be used to obtain an approximation to the power function of (7), as follows. Write       1 √   b(r),φ2  (r),φ (r),φ (r),φ p θbl 2 = p θbl+1 2 + √ n p θl − p θbl+1 2 n   (r),φ and define pn ≡ p θbl+1 2 + √1n s, where s=

  √   (r),φ2  (r),φ n p θbl − p θbl+1 2 .

Then substitute s into the definition of µ and, finally, µ into the right side of (17).

Acknowledgements This work was parcially supported by Grants DGES PB2003-892 and UCM2005-910707.

308

Nirian Martín and Leandro Pardo

References [1] A LI , S. M., AND S ILVEY, S. D. A general class of coefficient of divergence of one distribution from another. Journal of the Royal Statistical Society, Series B, 28 (1966), 131–142. [2] C RESSIE , N., AND R EAD , T. R. C. Multinomial goodness-of-fit tests. Journal of the Royal Statistical Society, Series B, 46 (1984), 440–464. [3] C SISZÁR , I. Eine Informationstheoretische Ungleichung und ihre Anwendung auf den Bewis der Ergodizität on Markhoffschen Ketten. Publications of the Mathematical Institute of the Hungarian Academy of Sciences 8 (1963), 84–108. [4] F ERGUSON , T. S. A Course in Large Sample Theory, Texts in Statistical Science. Chapman & Hall, London, 1996. [5] H ABER , M., AND B ROWN , M. B. Maximum likelihood methods for log-linear models when expected frequencies are subject to linear constraints. Journal of the American Statistical Association 81 (1986), 477–482. [6] K ULLBACK , S. Kullback information. In Encyclopedia of Statistical Sciences, vol. 4, S. Kotz and N. L. Johnson, Eds. John Wiley & Sons, New York, 1985. [7] M ARTÍN , N. Phi-divergencias en modelos loglineales con restricciones lineales. Trabajo de Investigación. Departamento de Estadística. Facultad de Matemáticas. Universidad Complutense de Madrid, Madrid, 2005. [8] M ARTÍN , N., AND PARDO , L. Minimum phi-divergence estimators for loglinear models with linear constraints and multinomial sampling. Statistical Papers, to appear (2006). [9] PARDO , L., AND M ENÉNDEZ , M. L. Analysis of divergence in loglinear models when expected frequencies are subject to linear constraints. Metrika 64 (2006), 63–76. [10] PARDO , L. Statistical Inference based on Divergence Measures. Statistics: Textbooks and Monographs. Chapman & Hall, New York, 2006. [11] S EARLE , S. R. Linear Models. John Wiley & Sons, New York, 1971. Nirian Martín Departamento de Estadística e I. O. III Universidad Complutense de Madrid Puerta de Hierro s/n 28040 Madrid, Spain [email protected]

Leandro Pardo Departamento de Estadística e I. O. I Universidad Complutense de Madrid Plaza de Ciencias 3 28040 Madrid, Spain [email protected]

Suggest Documents