Technical Report no. 585

1 downloads 5007 Views 445KB Size Report
Aug 1, 2011 ... 15http://en.wikipedia.org/wiki/Gamma distribution. 14 .... The function ˜F(·) is an incomplete gamma integral which can be obtained numerically.
Five NSASAG Problems with Discussion∗ Michael D. Perlman Department of Statistics, University of Washington, Seattle, WA 98195 1 August 2011

The Department of Statistics has participated in the NSASAG Mathematical Problems Project for over twenty years, originally under the direction of Werner Stuetzle. Each summer a visiting panel of NSA mathematician/statisticians presents a collection of 12-14 problems to the department’s participants, including faculty and graduate students, who in turn try to make useful comments. These comments may include simple clarifications of the problem statements, relation to work in the literature including lists of relevant references, ideas for solutions, and sometimes complete solutions including detailed computational algorithms and investigation of their numerical properties. The NSA problems are usually interesting and challenging, indeed several publications1 2 have arisen from the department’s contributions. Hopefully this collaboration will continue far into the future, with the present participants eventually replaced by younger ones. For this reason I have collected together five problems and the discussions that I have prepared as examples of our contributions to the project. Other faculty members such as Steve Gillispie, Peter Hoff, and Marina Meila, as well as our graduate students, also have made substantial contributions, some of which go into greater depth, including computational algorithms and simulation studies. Some of these may be found on the departmental wiki page www.stat.washington.edu/ amg81/nsawg. ∗ 1

Work supported by Department of Defense Contract H98230-10-C-0263. Di, Yanming; Perlman, M. D. (2008). Detecting linear sequences and subsequences. J. Statist. Planning

Inference 138 2634-2648. 2 Gillispie, S; Perlman, M. D. (2011). Efficient selection of binary choice bundles with cost considerations. Submitted to Algorithmic Operations Research.

1

1

NSASAG 07-04: Correlation of Temporal Sequences

Let E represent an event and T a vector that contains the times of successive occurrences of that event. There are n events where n can be in the millions. Furthermore, let Ei represent the ith event and Ti = {ti,1 , ti,2 , . . . , ti,ki } a sequence of times ti,j representing ki successive occurrences of the event Ei . The number of occurrences ki of an event Ei can range from under one hundred to the tens of thousands. Occurrences of an event Ei can be highly irregular. We have not been able to find a good model for the next occurrence of an event. There are certainly seasonal effects that can be seen in the data but much variability in the data remains unexplained. Over time new events Ei are detected and old events are discarded from further analysis. In addition new timestamps ti,j are constantly added to current events. Problem Statements: 1. The first goal of this problem is to develop a measure of correlation between two different temporal sequences Ei and Ej . These sequences can be highly correlated even if ki and kj are much different in magnitude. 2. Can this measure of correlation between two different temporal sequences be used to find strong clustering in the data? 3. As new timestamps of events are added to current events can the temporal measures of correlation be efficiently updated? 4. Can the “strong clustering” also be updated efficiently? 5. What is a reaonable set of “summary statistics” that captures the behavior of Ti = {ti,1 , ti,2 , . . . , ti,ki }? Can these summary statistics be updated efficiently as new observations ti,j arrive? 6. Can approximations to the summary statistics be estimated if the earliest observations age off over time and are no longer available? What about the computation of error bounds on the estimates?

2

This problem statement represents the synthesis of several different applications with similar characteristics. Consequently there are applications for which the efficient updating of the correlation scores and the strong clustering is not necessary. Hence, the proposer is also interested in tthe development of a temporal correlation measure for just the first subproblem. Discussion

1.1

Association between two events

Let (a1 < a2 < · · · < anA ) and (b1 < b2 < · · · < bnB ) be the occurrence times of two randomly occurring events A and B, respectively, in the total observed time interval (0, T ]. The first problem is to develop a measure of “correlation” between these two sequences. I shall use the term “association”, to avoid confusion with the ordinary use of “correlation” that appears in §2. Throughout the subsequent discussion we condition on nA and nB , i.e., consider them as fixed. Section 6.3(ii) (pp. 247-8) of Cox and Lewis (1966) contains “a simple test for the presence of coincidences” between the two sequences, which I will now outline but with some modifications and additions.3 Select a tuning parameter (my phrase) ∆ > 0 so that no two occurrences of A are likely to lie within ∆ of each other; in particular, nA ∆ < T . A coincidence is deemed to have occured at time ai if some bj occurs in the interval [ai , ai + ∆].4 The choice of ∆ ensures that these intervals do not overlap. Let NAB denote the observed number of coincidences, i.e. the number of bj that fall within some interval [ai , ai + ∆]. Under the null hypothesis of no association between the occurrences of A and B, NAB ∼ Binomial(nB , p ≡

nA ∆ T ).

(1)

If ∆ is chosen such that T  nA ∆ = O( nTB ), then (1) becomes NAB ≈ Poisson(λ ≡ 3

nB nA ∆ ). T

E.g., the variance-stabilizing transformation (3), the estimator (8), its variance (10)-(13), the test statis-

tic (16), and the discussion of multiple events in §2. 4

(2)

Alternatively, in [ai − ∆, ai + ∆], in which case replace ∆ by 2∆ in (1)-(3).

3

Under the (loosely conceived) hypothesis of positive association, NAB is expected to be stochastically larger than (2). Finally, if ∆ also can be chosen so that λ ≡

nB nA ∆ T

is moderately large, then under

the null hypothesis we have the variance-stabilized statistic RAB (∆) ≡



4NAB −



4nB nA ∆ T

≈ N (0, 1) .

(3)

A test for positive association of the two sequences rejects the null hypothesis for large values of RAB (∆). Thus we may take RAB (∆) as a numerical indicator of the degree of association of the two sequences. Application of this method to real data should provide guidance as to how to choose the tuning parameter ∆. Small (negative) values of RAB (∆) are indicative of negative association of the two sequences, which may also be of interest. The quantities NAB , nA , and nB depend on T , hence so does the statistic RAB (∆). Due to its simple form, RAB (∆) can be updated easily as T increases. Cox and Lewis remark, however, that if nA (T ) and nB (T ) are processes with a common non-homogeneous time trend5 then NAB would be misleadingly high under the null hypothesis. In such a case, they recommend recomputing RAB (∆) periodically, e.g., on a daily basis.6 Cox and Lewis also discuss a related estimation problem. Suppose that the event B can arise in one of two independent ways: (i) Given that an A event occurs at time s, B occurs in [s, s + ∆] with probability θAB ; (ii) B also occurs uniformly over (0, T ]. Thus NAB , the total number of coincidences in [0, T ], can be decomposed as NAB = U + V,

(4)

where, if T > nA ∆ = O( nTB ), U V |U 5 6

∼ Binomial(nA , θAB ) 

∼ Poisson λ ≡

(nB −U )nA ∆ T

(5)



.

(6)

For example, if both A and B occurrences tend to be more frequent on weekends. Because of the simple form of RAB (∆), it may be possible to obtain a valid sequential sampling scheme

(stopping rule) based on it, but I haven’t given this much thought yet. This might be worth pursuing if the statistic RAB (∆) is deemed useful.

4

Thus (nB −nA θAB )nA ∆ , T

E(NAB ) = nA θAB +

(7)

leading to the method-of-moments (MOM) estimator7 θ˜AB =

NAB nA



1−

nB ∆ T nA ∆ T

.

(8)

Notice that if T  nA ∆, nB ∆ then θ˜AB reduces to the simple estimate We approximate Var(θ˜AB ) as follows:

NAB nA .

Var(NAB ) = Var(U + V ) = Var{E[(U + V ) | U ]} + E{Var[(U + V ) | U ]} (nB −U )nA ∆ } T

= Var{U + 

=

1−

nA ∆ T

2

+ E{ (nB −UT )nA ∆ }

nA θAB (1 − θAB ) +

nA (nB −nA θAB ) , T

(9)

hence, if T  nA ∆, θAB (1−θAB ) nA

+



θAB (1−θAB ) nA

+



1 nA



1 nA

Var(θ˜AB ) =

 

(nB −nA θAB )∆



nA 1−

nA ∆ T

+

nB ∆ T

(10)

T

(nB −nA θAB )∆ nA T

θAB (1 − θAB ) + 1 4

2



nB ∆ T

.

(11)



.

(12) (13)

Therefore, if V˜AB denotes an approximation to Var(θ˜AB ) obtained by substituting θ˜AB for θAB in one of (10)-(12), the statistic −1/2 V˜AB (θ˜AB − θAB ) ≈ N (0, 1)

(14)

can be used to obtain approximate one-sided or two-sided confidence intervals for θAB , while use of (13) yields conservative confidence intervals. For a fixed θ0 ∈ (0, 1), we can test H0 : θAB ≤ θ0 7

vs. K : θAB > θ0

(15)

Cox and Lewis (eqn. (3) p.248) obtain a somewhat different MOM estimate by replacing the second

nA θAB in (7) by NAB , which seems unjustified to me.

5

by means of the statistic SAB (∆; θ0 ) ≡ [VAB (θ0 )]−1/2 (θ˜AB − θ0 ),

(16)

where VAB (θ0 ) is obtained by substituting θ0 for θAB in one of (10)-(12). In practice, θ0 would be chosen so that the alternative θAB > θ0 is deemed to indicate positive association between occurrences of A and B. (θ0 = 0.5?) As with RAB (∆), the statistic SAB (∆; θ0 ) can be updated easily as T increases, and it may be good to recompute it periodically, and/or to use it in conjunction with a sequential sampling scheme. Note: When I went back to Cox’s (1955) source paper (1955, Section 6.2, pp.147-8), I found an apparent discrepancy between his formula (6.2) p.147 and Cox and Lewis’s formula (2) (1966, p.248) (equivalent to my formula (6) above). I believe that the latter is correct, but I have emailed Cox to check. Incidentally, in Cox (1955), the notations A and B are the reverse of mine - I have followed Cox and Lewis (1966), who use “1” and “2” so I use A and B, whereas Cox (1955) uses B and A. Also, Cox and Lewis (1966) write N21 but I have chosen to write NAB rather than the discordant NBA which would agree with their notation.

1.2

Association among multiple events

Notice that the above treatment is not symmetric in A and B: A is treated as a “driver” (i.e., possibly causing B), whereas B is a “slave” influenced by A. This may be advantageous for the clustering problem where a single driver event A is prespecified and one seeks potential slave (≡ associated) events among multiple candidate events B1 , ... BN . A heavy upper tail in a standard Q-Q plot of the test statistics RAB1 (∆), ... RABN (∆) or SAB1 (∆; θ0 ), ... , SABN ;θ0 (∆) will readily reveal the existence of a cluster of slave events each positively associated with the driver event. A quantitative test for the significance of such a cluster can be based on the following well-known result8 : 8

cf. Galambos (1978, §2.3.2), where a slightly more precise centering constant for (17) is given.

6

Proposition 1.1.

Let U1 , . . . , UN be i.i.d.

max(U1 , . . . , UN ). Then



2 log N (ZN −



N (0, 1) random variables and set ZN = d

2 log N ) → Y,

(17)

where Y has the extreme order distribution with density function f (y) = exp(−e−y ),

−∞ < y < ∞.

(18)

Under the null hypothesis that B1 , ..., BN are not associated with A, RAB1 (∆), ... RABN (∆) and SAB1 (∆; θ0 ), ... SABN (∆; θ0 ) are each approximately distributed as U1 , . . . , UN , so for large N , 



2 log N max RABi (∆) − i







2 log N max SABi (∆; θ0 ) − i





2 log N

∼ Y,

(19)

∼ Y.

(20)



2 log N

A significantly large value of maxi RABi (∆) or maxi SABi (∆; θ0 ) based on (18) - (20) suggests the existence of at least one slave event Bi positively associated with A. What if no single driver event is prespecified? Denote the total set of events under consideration as A1 , . . . , AK . In this case one might calculate9 all

K

2

pairwise statis-

tics RAi Aj (∆), i < j. Under the null hypothesis of no association among the events, RAi Aj (∆) and RAi Aj  (∆) are independent if {i, j} ∩ {i , j  } = ∅. Also, I believe it is straightforward to show that NAi Aj (∆) and NAi Aj (∆) are positively correlated and that this implies that RAi Aj (∆) and RAi Aj (∆) are positively correlated under the null hypothesis. Thus, if the joint distribution of the

K

2

statistics RAi Aj (∆) is approximately multivariate

normal, Slepian’s inequality implies that, to the accuracy of this normal approximation, maxi h] = e−nλh as our measure of quality of that estimate, using a value of h considered reasonable for the application. Question A: We have been using Q-Q plots to assess goodness of fit. Are there any cheap tests – not requiring visual inspection of a plot - that can be done to assess the goodness of fit of this model? Now, assuming a lack of fit, we wish to model the Xi as gamma(ν, λ) random variables. While it is possible to get a maximum likelihood estimator for (τ, ν, λ), the calculation involves using an iterative algorithm (i.e., Newton’s method) to get the estimate for ν. Question B: Instead of computing the MLE for (τ, ν, λ) and attempting to report a measurement like Pr[T > τ + h], is there a cheap way (even if it be crude) to compute the quality of the estimate T for τ in the case where the Xi are assumed to be gamma? Question C: Assuming we are forced to compute the MLE for (τ, ν, λ), how then would be a good way to compute the goodness of fit without having to visually inspect a Q-Q plot? 12

Discussion

3.1

Question A

Let 0 ≡ Z0 < Z1 < · · · < Zn−1 denote the ordered values of Y1 − T ,..., Yn − T . Thus L = n1 (Z1 + · · · + Zn−1 ) ≡ n1 Sn−1 . For fixed λ, T is a complete sufficient statistic for τ and Z ≡ (Z1 , . . . , Zn−1 ) is ancillary (since it is location-invariant), hence T ⊥ ⊥ Z by Basu’s Lemma. We wish to use the vector of observations Z to test the exponential assumption. Under this assumption, Z is distributed as the vector of order statistics from a sample of size n − 1 from the exponential(λ) distribution, with λ unknown. Thus by the memory-free property of the exponential distribution, V1 ≡ (n − 1)Z1 , V2 ≡ (n − 2)(Z2 − Z1 ), .. . Vn−2 ≡ 2(Zn−2 − Zn−3 ), Vn−1 ≡ Zn−1 − Zn−2 , is distributed as a sample of size n − 1 from the exponential(λ) distribution. Note that V1 + · · · + Vn−1 = Z1 + · · · + Zn−1 = Sn−1 . Now set W1 ≡ V1 /Sn−1 = (n − 1)Z1 /Sn−1 , W2 ≡ (V1 + V2 )/Sn−1 = [Z1 + (n − 2)Z2 ]/Sn−1 , W3 ≡ (V1 + V2 + V3 )/Sn−1 = [Z1 + Z2 + (n − 3)Z3 ]/Sn−1 , .. . Wn−2 ≡ (V1 + · · · + Vn−2 )/Sn−1 = [Z1 + · · · + Zn−3 + 2Zn−2 ]/Sn−1 , Wn−1 ≡ (V1 + · · · + Vn−1 )/Sn−1 = 1. Under the exponential assumption, W ≡ (W1 , . . . , Wn−2 ) is independent of Sn−1 , does not depend on λ, and is distributed as the vector of order statistics based on a sample of size n − 2 from the uniform distribution on (0,1). 13

Thus to test the exponential assumption, we can either use a simple Q-Q plot to test the uniformity of W1 , . . . , Wn−2 , or else use any classical nonparametric goodnessof-fit statistic for uniformity on (0,1), for example the Kolomogorov-Smirnov or CramerVon Mises statistic, based on W1 , . . . , Wn−2 . Or, one can divide the interval (0, 1) into k equal subintervals = cells and use the Pearson chi-square statistic for testing that the cell probabilities are each equal to 1/k. (One might choose k = (n − 2)/10 if n − 2 ≥ 50, for example.)

3.2

Question B

Now consider the case where Y1 , . . . , Yn have the distribution gamma(τ, ν, λ), where τ is the shift parameter, ν is the shape parameter, and1/λ is the scale parameter.14 Then Mn ≡ T − τ is distributed as the minimum of Xi = Yi − τ , i = 1, . . . , n, which are n i.i.d. ordinary gamma(ν, λ) random variables. It follows from general extreme value theory (e.g. Galambos (1978)) that, suitably normalized, Mn has an asymptotic Weibull distribution. This can be derived directly when ν ≥ 1 is a positive integer, as follows. For small x > 0, P [Mn > x] = P [Xi > x]n

−λx

=

e

i=1



=

ν−1 

(λx)i i!

n n

(λx)ν 1− + O(xν+1 ) ν!

, 1

using the relation between the gamma and Poisson distributions.15 Thus if we set x = y/n ν for fixed y > 0, 

P Mn >

y 1







=



1−

→ e−

1 λν y ν +O 1 nν! n1+ ν

λν y ν ν!

n

as n → ∞,

(36)

which is a Weibull distribution. 14

This is called the “three-parameter gamma distribution” in the literature, where estimation of the three

parameters via maximum likelihood and other methods has been studied (see References below). 15

http://en.wikipedia.org/wiki/Gamma distribution.

14

I expect that the approximation (36) is valid for all ν ≥ 1, with ν! replaced by Γ(ν +1). (I’m not yet sure about the case ν < 1.) Thus, setting s = y ν , we can rewrite (36) as ν

λ s − Γ(ν+1)

P [nMnν > s] → e that is, d nMnν →



as n → ∞,

(37)



λν exponential . Γ(ν + 1)

(38)

To use these approximations to assess the accuracy of T as an estimate of τ , one needs estimates of ν and λ. Since the proposer is looking for a cheap and fast method, one might proceed as follows. Treat Z1 , . . . , Zn−1 as an independent sample from the gamma(ν, λ) distribution. The independence is justified if n is moderately large, since T , which is common to all the Zi , will converge to the constant τ hence should have small variance. Since the proposer wishes to avoid the standard iterative method for obtaining the MLE of ν, one might simply use the (inefficient) method of moments to estimate ν and λ: since gamma(ν, λ) has mean ν/λ and variance ν/λ2 , simply equate these to the sample mean and sample variance of the Zi and solve for ν and λ.16 However, approximations have been developed for the actual MLE νˆ which do not require iteration. One appears in Johnson and Kotz (1970, p.188), followed by a discussion of how to improve this approximation. Also the following approximation appears in http://en.wikipedia.org/wiki/Gamma distribution: 

(3 − s)2 + 24s 12S ¯ − (ln Z). S = ln(Z) νˆ ≈

3−S+

(39) (40)

ˆ using Once one has an approximation to νˆ, one can get an approximation to the MLE λ ˆ (cf. Johnson and Kotz (1970, p.187, eqn.(41.2)). the relation Z¯ = νˆ/λ Remark . See Johnson and Kotz (1970, §7.1) for non-iterative approximations to the MLEs ˆ and for the approximate variances of the MLEs. In particular, the approximation (ˆ τ , νˆ, λ), to τˆ could be useful as an alternative to using T to estimate τ . In particular see their eqn. (35.3) , p.185. (However, they mention at the bottom of p.186 that the MLEs “are of doubtful utility” when ν < 2.5.) 16

See Johnson and Kotz (1970, p.189).

15

3.3

Question C

In contrast to the estimation problem, there is relatively little literature about goodness-offit tests for the two-parameter or three-parameter gamma family, in fact for most parametric families other than the normal family. Generally, a nonparametric goodness-of-fit statistic, such as Kologorov-Smirnov (K-S, Cramer-von Mises (C-vM), or Anderson-Darling (A-D), is applied to measure the distance between the empirical cdf and the hypothesized cdf based on sample estimates of the parameter. However, the distribution theory of such statistics is not known, even asymptotically. In fact, Lehmann and Romano (2005, p.589) state that even if computable, the asymptotic distribution will generally depend on the unknown parameter values. They also state that the critical values are usually approximated by simulation. Since the proposers want a quick and cheap procedure, this may not be practical. For this purpose, if the sample size n is large enough one might simply ignore these niceties and apply one of the above nonparametric goodness-of-fit statistics to compare the ˜ cdf, call it F˜ , where ν˜, λ ˜ are consisempirical cdf based on Z1 , . . . , Zn−1 to the gamma(˜ ν , λ) tent estimates obtained by some method as discussed above Use the standard asymptotic null distribution of the K-S, C-vM, or A-D statistic to determine the critical value (see references below). This will be crude but not too unreasonable. For example, the A-D statistic can be expressed as A2 = −m −

1 m

m

i=1 (2i

− 1)[ln F˜ (Zi ) + ln(1 − F˜ (Zm−i+1 ))],

(41)

where m = n−1. The function F˜ (·) is an incomplete gamma integral which can be obtained numerically. However, my first search for approximations to the critical value of A2 did not turn up anything, but I expect there must be some results on this. Addendum: At our July 28 meeting, Dominique Perrault-Joncas noted that Stephens (1986, Ch.4), gives a general discussion of goodness-of-fit tests based on the empirical cdf. Section 4.9 deals with the exponential distribution (our Question A) while Section 4.12 deals with the gamma distribution (our Question C). (Note that Stephens’ test statistics W 2 , U 2 , and A2 are defined on his pp.100-101.) Unfortunately, Stephens offers no more for Question C than I outline above; he gives no quick and easy techniques. As I suggest above, he also suggests first finding the MLE νˆ, 16

then using the estimated incomplete gamma integral to transform the observations approximately to uniform(0, 1) observations, then using either the Anderson-Darling, Cramer-von Mises, or Watson statistic for testing uniformity. He does present tables of approximate critical values for these statistics for integer values of ν ranging from 1 to 20, plus the limit for ν → ∞. Of course ν must be estimated by νˆ, which will generally be non-integral so presumably interpolation must be used. Again, this does not seem quick and easy. References Galambos, J. (1978). The Asymptotic Theory of Extreme Order Statistics. New York: Wiley. Johnson, N. L., Kotz, S. (1970). Continuous Univariate Distributions–1. New York: Wiley. Lehmann, E. L., Romano, J. P. (2005). Testing Statistical Hypotheses, 3rd edition. New York: Wiley. G. R. Shorack (1972). “The best test of exponentiality against gamma alternatives.” J. Amer. Statist. Assoc. 67 213-214. Stephans, M. A. (1986). Tests based on EDF statistics. In Goodness-of-Fit Techniques, R. B. D’Agostino and M. A. Stephens, eds., New York: Marcel Dekker. Additional References on Goodness-of-Fit 1. Ozmen, Tamer (1993). A modified Anderson-Darling goodness-of-fit test for the gamma distribution with rnknown scale and location parameters. Master’s thesis, Air Force Institute of Technology, Wright-Patterson AFB, Ohio. http://handle.dtic.mil/100.2/ADA262486 Abstract: A new modified Anderson-Darling goodness-of-fit test is introduced for the threeparameter Gamma distribution when the location parameter is found by minimum distance estimation and scale parameter by maximum likelihood estimation. Monte Carlo simulation studies were performed to calculate the critical values for A-D test when A-D statistic is minimized. These critical values are then used for testing whether a set of observations follows a Gamma distribution when the scale and location parameters axe unspecified and are estimated from the sample. Functional relationship between the critical values of A-D 17

is also examined for each shape parameter by the variables, sample size (n) and significance level (a). The power study is performed with the hypothesized Gamma against alternate distributions. Comparison with the previous study which uses MLEs for location and scale showed that the modified test is better in most cases. 2. Tadikarmalla, P.R. (1990). Kolmogorov-Smirnov type test-statistics for the gamma, Erlang-2 and the inverse Gaussian distributions when the parameters are unknown. Commun. Statist. Simulation 19 305-314. 3. Romantsova, Yu. V. (1996). ‘On an asymptotic goodness-of-fit test for a two-parameter gamma-distribution. Journal of Mathematical Sciences 81(4) 2759-2765. Abstract: In this paper, a formula for the test statistic is given for a two-parameter gammadistribution which involves the first four moments. Additional references on estimation 4. Koutrouvelisa, I. A., Canavos, G. C. (1997). Estimation in the three-parameter gamma distribution based on the empirical moment generation function. Journal of Statistical Computation and Simulation 59(1) 47-62. 5. Bowman, K. O., Shenton, L. R., Karlof, C. (1995). ‘Estimation problems associated with the three-parameter gamma distribution. Communications in Statistics-Theory and Methods 24 1355-1376. 6. Bowman, K. O., Shenton, L. R. (1988). Properties of Estimators for the Gamma Distribution. Marcel Dekker, New York. 7. Bowman, K. O., Shenton, L. R., Lam, H. K. (1987). Simulation and estimation problems associated with the 3-parameter gamma density. Commun. Statist. Simulation and Computation 16 1177-1188. 8. Cheng, R. C. H., Taylor, L. (1995). Non-regular maximum likelihood problems (with discussion). J. Royal Statistical Society 57 3-44. 9. Cohen, A. C., Whitten, B. J. (1982). Modified moment and modified maximum likelihhood estimators for parameters of the three-parameter gamma distribution. Comm. Statist. Simulation and Computation 11 197-216.

18

10. Hirose, H. (1995). Maximum likelihood parameter estimation in the three- parameter gamma distribution. Computational Statistics & Data Analysis 20 343-354. 11. Hirose, H. (1998). Parameter estimation for the 3-parameter gamma distribution using the continuation method”. IEEE Trans. Reliability 47(2) 188-196.

4

NSASAG 11-02: “Tail Area” in a High-dimensional Gaussian Mixture

A Gaussian (= normal) mixture model of c components has been fit on a set of n points in Rd . The density of the mixture at any point x is f (x) =

c 

γi fi (x),

(42)

i=1

where fi (x) = (2π)− 2 |Σi |− 2 exp ( − 12 (x − µi ) Σ−1 i (x − µi )) d

1

− d2

− 12

≡ (2π)

|Σi |

exp ( − 12 x − µi 2Σi )

(43) (44)

is the density of the d-dimensional normal distribution N (µi , Σi ) with mean vector µi and covariance matrix Σi (assumed positive definite), and where the γi ≥ 0 are the mixture weights (γ1 + · · · + γc = 1). It is assumed that µi , Σi , and γi are all known. Given an observation x0 ∈ Rd , the proposer wishes to assess how well the point is explained by the model in terms of the “tail area” τf (f (x0 )), where for δ > 0, τf (δ) = Pf [f (X) ≤ δ].

(45)

Here X is generated under the mixture model, i.e., X ∼ f . A small tail area indicates a poor fit. The problem is to compute or approximate τf (f (x0 )) (quickly). Discussion

4.1

A bound for the chi-square distribution

Let Y ∼ Nd (µ, Σ), the d-dimensional Gaussian distribution with mean vector µ and covariance matrix Σ (positive definite) and let g(y) ≡ gµ,Σ (y) = (2π)− 2 |Σ|− 2 exp ( − 12 y − µ2Σ ) d

19

1

(46)

denote its pdf, where y − µ2Σ = (y − µ) Σ−1 (y − µ). It is well known that Y − µ2Σ ∼ χ2d , where χ2d denotes a (central) chi-square random variable with d degrees of freedom. Let Gd (z) = P [χ2d ≥ z],

z > 0,

(47)

denote its upper tail probability. It follows from (46) and (47) that for any δ > 0, τg (δ) ≡ Pg [g(Y ) ≤ δ]

(48) d

1

= Pg [ Y − µ2Σ ≥ −2 log {(2π) 2 |Σ| 2 δ} ]

(49)

d 2

(50)

1 2

= Gd (−2 log {(2π) |Σ| δ}).

Leinart and Massart (2000, p.1325, eqn. (4.3)) provide the exponential upper bound, √ (51) P [ χ2d ≥ d + 2 du + 2u ] ≤ e−u , u > 0. To obtain an upper bound for Gd (z) when z > d, we must solve the equation √ z = d + 2 du + 2u for u ≡ u(z). Set v =



(52)

u in (52) to obtain √ z = d + 2 d v + 2v 2 ,

(53)

√ √ − d ± 2z − d v= . 2 To guarantee that v > 0 we must take the + sign in (54), hence

which has the solutions

2

u=v =

z−



d(2z − d) . 2

(Note that u > 0.) Therefore from (51) and (55), Gd (z) ≤ e− 2 z+ 2 1

4.2

1



d(2z−d))

(54)

(55)

.

(56)

Approximation to τf (f (x0 )) by a normal density

¯ obtained by matching the Let g¯ denote the density of the Gaussian distribution N (¯ µ, Σ) mean vector and covariance matrix of the mixture pdf f , namely, µ ¯ = Ef (X) = ¯ = Cov(X) = Σ

n  i=1 n 

γi µi , γi Σi +

i=1

(57) n  i=1

20

γi (µi − µ ¯)(µi − µ ¯) .

(58)

An obvious first approximation to τf (f (x0 )) is simply τg¯(f (x0 )) ≡ Pg¯[ g¯(X) ≤ f (x0 ) ]

(59)

= Gd (z0 )

(60)

d ¯ 12 f (x0 )}. Thus (56) yields the upper bound by (50), where z0 = −2 log {(2π) 2 |Σ| √ d ¯ 12 f (x0 )e 12 d(2z0 −d) , τg¯(f (x0 )) ≤ (2π) 2 |Σ|

(61)

If f (x0 ) is not too small then z0 is not too large and Gd (z0 ) can be evaluated directly in terms of the chi-square distribution (e.g., Mathmatica). If f (x0 ) is very small so that z0 is very large, then the upper bound (61) for τg¯(f (x0 )) can be used. This approximation by a single Gaussian distribution should be reasonably good if the c components of the mixture f are close enough together so that f is unimodal, that is, if the collection {µi } of mean vectors cluster together fairly closely and the covariance matrices {Σi } are not too disparate.17 We suspect, however, that this is probably not the case in your applications, that is, the mixture f is probably multimodal. If the modes {µi } are fairly widely dispersed and if the observation x0 lies in a “central valley” among these modes, then x0 may be relatively close to µ ¯, the mode of the approximating Gaussian density g¯. Thus g¯ will assign much greater probability to neighborhoods of x0 than the mixture pdf f does, hence τg¯(f (x0 )) will overestimate τf (f (x0 )) substantially, thereby failing to detect the poor fit of f to x0 .

4.3

Exact upper bounds for τf (f (x0 ))

This difficulty incurred by a multimodal mixture might be alleviated by the following approach. We can obtain an exact upper bound for τf (f (x0 )) as follows: τf (f (x0 )) = Pf [f (X) ≤ f (x0 )]

(62)



=

{ f (x)≤f (x0 )}

f (x)dx



= 17

{



γj fj (x)≤f (x0 )}



(63) γi fi (x)dx

(64)

To illustrate, consider the simplest case: the 2-component (c = 2) univariate (d = 1) mixture model with

equal weights, equal unit component variances (σ12 = σ22 = 1), and means µ1 , µ2 = ±η. Here the mixture pdf f is unimodal iff η ≤ 1 (e.g., Shilling et al. (2002)).

21



=







=





γi

{





γi

γj fj (x)≤f (x0 )}

{γi fi (x)≤f (x0 )} 

γi τfi



f (x0 ) γi

fi (x)dx

fi (x)dx



d

1

as in (61), where z0,i = −2 log (2π) 2 |Σi | 2



(66)





d

 1

0) γi Gd −2 log (2π) 2 |Σi | 2 f (x γi √  d 1 1 |Σi | 2 e 2 d(2z0,i −d) ≤ (2π) 2 f (x0 )

=

(65)



(67) (68) (69)



f (x0 ) γi

and all summations range over 1, . . . , c.

The inequality (66) is obtained by approximating the mixture cdf f by the single component fi in the integral w.r.to the density fi . If the modes are well-dispersed, this approximation should be fairly close. The bound (68) should be used if a standard package is available to evaluate the chi-square tail probabilities, otherwise the bound (69) can be used. In the special case where all γi = d

1

1 c

(2π) 2 |Σ| 2 f (x0 ) =

and all Σi = Σ, 1 exp ( − 12 x0 − µi 2Σ ), c

(70)

so (68) and (69) become 





exp ( − 12 x0 − µi 2Σ } τf (f (x0 )) ≤ Gd − 2 log { √ c 1 ≤ e 2 d(2z0 −d) exp ( − 12 x0 − µi 2Σ ), i=1

where now z0 = −2 log



c i=1 exp (

(71) (72)



− 12 x0 − µi 2Σ ) .

For illustrative purposes, suppose in addition that x0 is equidistant from each mode µi , that is, x0 − µi 2Σ = ∆2 for i = 1, . . . , c. This would occur if the modes {µi } are regularly spaced on a d-dimensional sphere and the observation x0 lies at the center of the sphere. Then (71) and (72) simplify further to τf (f (x0 )) ≤ Gd (∆2 − 2 log c) 

≤ c exp − 12 [∆2 −





d(2∆2 − d − 4 log c) ] ,

(73) (74)

If d is moderately large then χ2d ≈ N (d, 2d), so it is convenient to express ∆2 as √ ∆2 ≈ d + m 2d

22

(75)

for some multiple m. Thus (73) becomes √ τf (f (x0 )) ≤ Gd (d + m 2d − 2 log c) √ ≡ Gd (d + rm 2d)

(76)

= ≈ 1 − Φ(rm ),

(78)

(77)

where Φ is the standard normal cdf and where the “reduced multiple” rm is given by 2 log c rm = m − √ . 2d

(79)

For additional specificity take d = c = 10. If m = 2, that is, if x0 does not depart too greatly from each fi , then r2 = 0.97 so (78) provides the upper bound τf (f (x0 )) ≤ .166, which appropriately does not indicate a significantly departure from the mixture density f . If m = 3, 4, or 5 then rm = 1.97, 2.97, or 3.97, respectively, and (78) provides the upper bound .0244, .0015, or .0000359, indicating increasingly significant departures of x0 from values likely under f , as desired. [More numerical investigation needed?] Remark. The relation (79) suggests that that this procedure behaves like a multiple testing procedure where the observed x0 is compared to each of the c component densities fi . The term log c in (73) and (79) serves as an adjustment for testing c hypotheses simultaneously. References Laurent, B. Massart, P. (2000). Adaptive estimation of a quadratic functional by model selection. Ann. Statist. 28 1302-1338. Shilling, M. F., Watkins, A. E., Watkins, W. (2002). Is Human Height Bimodal? American Statistician. 56(3) 223-229. Acknowledgement Warm thanks to Jon Wellner for helpful discussions.

23

5

NSASAG 11-05: Correlation of Cumulative Distribution Functions

ˆ be empirical Let F and G be two cumulative distribution functions on R, and let Fˆ and G cdfs obtained from samples X1 , . . . , Xm and Y1 , . . . , Yn drawn respectively from F and G. ˆ such The proposer would like a statistic S which measures the “correlation” of Fˆ and G that: ˆ is large when F = G, or when F (x) = G(x + t) for some 1. the correlation of Fˆ and G real t > 0, or when F (x) = G(x + φ(x)) for some slowly varying positive function φ; ˆ has a known distribution under the hypothesis that F = G, 2. the correlation of Fˆ and G or that F (x) = G(x + t) for some real t ≥ 0. Discussion Note: We replace the word “correlation” by “affinity” to avoid confusion with the standard usage of “correlation”. Similarly we replace the phrase “slowly varying” by “approximately constant”. In fact, we’ll ignore the appearance of the function φ in 1. because it does not appear in 2. and apparently is not intended to play a major role in the problem.

5.1

Testing for a general shift parameter

First let’s also ignore the positivity restriction t ≥ 0 on the shift parameter. In this case the null hypothesis in 2. can be stated as the hypothesis that the distributions of Xi and Yj differ only by location shifts for each. The maximal invariant statistic under these location shifts can be represented by the residuals among {Xi } and the residuals among {Yj } in one of the following two forms: either by the mean-based residuals ¯m ) ≡ (X1 − X, ¯ . . . , Xm − X) ¯ ¯1 , . . . , U (U and (V¯1 , . . . , V¯n ) ≡ (Y1 − Y¯ , . . . , Yn − Y¯ )

(80) (81)

¯ and if EF (Xi ) ≡ µF and E(Yj ) ≡ µG can be assumed to be finite so the sample means X Y¯ are stable; or else by the median-based residuals ˜m ) ≡ (X1 − X, ˜ . . . , Xm − X) ˜ ˜1 , . . . , U (U and (V˜1 , . . . , V˜n ) ≡ (Y1 − Y˜ , . . . , Yn − Y˜ ), 24

(82) (83)

˜ and Y˜ are stable more generally. since the sample medians X ¯k , U ¯l is − 1 while that between In the first case, the correlation between any pair U m−1 1 ¯ ¯ any pair Vk , Vl is − n−1 . Therefore, if both m and n are moderately large, we can treat ¯i } and the {V¯j } as approximately independent observations from the cdfs F¯0 (u) ≡ the {U ¯ 0 (u) ≡ G(u + µG ) respectively; we wish to test F¯0 = G ¯ 0 against F¯0 = G ¯0 F (u + µF ) and G ¯i } and the {V¯j }. This resembles the classical non-parametric two-sample based on the {U ¯ 0 is testing problem with a general alternative, except that here the alternative F¯0 = G ¯ 0 both have mean 0. restricted by the fact that the distributions determined by F¯0 and G Thus a standard two-sample test such as the Wilcoxon test is inapplicable here since it is intended to detect a shift in location, or more generally, a stochastically ordered alternative. Several researchers, noting this limitation of the Wilcoxon test, have proposed omnibus nonparametric two-sample tests intended to detect general alternatives, i.e., those where the two distributions differ either in location or scale or, more generally, in overall shape. Classical examples include the Kolmogorow-Smirnov, Cramer-von Mises, and Anderson-Darling two-sample tests. (See References below for these and other omnibus tests.) Here, however, we wish to detect alternatives that differ in scale or overall shape but not in location. Nonetheless, because it may be difficult [True???] to derive tests geared for this form of alternative, at this point it seems simplest to use one of the aforementioned omnibus tests. ˜i } and {V˜j } are used instead then the above discussion If the median-based residuals {U ¯ 0 replaced by F˜0 (u) ≡ F (u + mF ) and G ˜ 0 (u) ≡ G(u + mG ), where applies with F¯0 and G mF and mG are the medians of F and G respectively. Again two-sample tests intended to detect a shift of location are not desired; instead the omnibus tests mentioned above can ˜i } and {V˜j }. be applied to {U [Need more discussion? Numerical examples...]

5.2

Testing for a nonnegative shift parameter

Now impose the nonnegativity restriction t ≥ 0 on the shift parameter under the null hypothesis in 2. A na¨ive approach is to test this restricted null hypothesis in two steps: ¯ 0 against F¯0 = G ¯0 first test that the two distributions have a common shape i.e., test F¯0 = G ¯i } and the {V¯j } as above (or use the median-based versions). Second, if based on the {U

25

this hypothesis is accepted, that is, F (x) = G(x + t) for some real t, now use the original observations {Xi } and {Yj } to test that t ≥ 0 vs. t < 0. For this second step, the twosample Wilcoxon test for a nonparametric shift alternative will be appropriate. The overall significance level for this two-step test can be approximated by the Bonferoni bound. [Need more discussion? Numerical examples...] References [1] Doksum, Kjell (1974). Empirical probability plots and statistical inference for nonlinear models in the two-sample case. Ann. Statist. 2 267-277. Abstract: Let X and Y be two random variables with continuous distribution functions F and G and means µ and ξ. In a linear model, the crucial property of the contrast ∆ = ξ − µ d

is that X + ∆ = Y . When the linear model does not hold, there is no real number ∆ such d

that X + ∆ = Y . However, it is shown that if parameters are allowed to be function valued, d

there is essentially only one function ∆(·) such that X + ∆)S) = Y ., and this function can ˆ N (X) = G−1 (Fˆm (x)) − x of ∆(x) is be defined by ∆(x) = G−1 (F (x)) − x. The estimate ∆ n

considered, where Gn and Fm are the empirical distribution functions. Confidence bands ˆ N (·) is derived. For based on this estimate are given and the asymptotic distribution of ∆ general models in analysis of variance, contrasts that can be expressed as sums of differences of means can be replaced by sums of functions of the above kind. [2] Pettitt, A. N. (1976). A two-sample Anderson-Darling rank statistic. Biometrika 63 161-168. Abstract: A two-sample Anderson-Darling statistic is introduced and small-sample percentage points are given. An approximation to the distribution is also given. The statistic is related to Wilcoxon and Mood’s rank statistics. Asymptotic power comparisons are made with other two-sample rank statistics for shifts in location and scale. [3] Scholz, F. W. and Stephens, M. A. (1987). K-sample Anderson-Darling tests. J. Amer. Statist. Assoc. 82 918-924. Abstract: Two k-sample versions of an Anderson-Darling rank statistic are proposed for testing the homogeneity of samples. Their asymptotic null distributions are derived for the continuous as well as the discrete case. In the continuous case the asymptotic distributions

26

coincide with the (k − 1)-fold convolution of the asymptotic distribution for the AndersonDarling one-sample statistic. The quality of this large sample approximation is investigated for small samples through Monte Carlo simulation. This is done for both versions of the statistic under various degrees of data rounding and sample size imbalances. Tables for carrying out these tests are provided, and their usage in combining independent one- or k-sample Anderson-Darling tests is pointed out. The test statistics are essentially based on a doubly weighted sum of integrated squared differences between the empirical distribution functions of the individual samples and that of the pooled sample. One weighting adjusts for the possibly different sample sizes, and the other is inside the integration placing more weight on tail differences of the compared distributions. The two versions differ mainly in the definition of the empirical distribution function. These tests are consistent against all alternatives. The use of these tests is two-fold: (a) in a one-way analysis of variance to establish differences in the sampled populations without making any restrictive parametric assumptions or (b) to justify the pooling of separate samples for increased sample size and power in further analyses. Exact finite sample mean and variance formulas for one of the two statistics are derived in the continuous case. It appears that the asymptotic standardized percentiles serve well as approximate critical points of the appropriately standardized statistics for individual sample sizes as low as 5. The application of the tests is illustrated with an example. Because of the convolution nature of the asymptotic distribution, a further use of these critical points is possible in combining independent Anderson-Darling tests by simply adding their test statistics. [4] Podgor, M. J., Gastwirth, J. L. (1994). On non-parametric and generalized tests for the two-sample problem with location and scale change alternatives. Statistics in Medicine 13(5-7) 747-58. Abstract: Various tests have been proposed for the two-sample problem when the alternative is more general than a simple shift in location: non-parametric tests; O’Brien’s generalized t and rank sum tests; and other tests related to the t. We show that the generalized tests are directly related to non-parametric tests proposed by Lepage. As a result, we obtain a wider, more flexible class of O’Brien-type procedures which inherit the level robustness property of non-parametric tests. We have also computed the tests’ empirical sizes and

27

powers under several models. The non-parametric procedures and the related O’Brien-type tests are valid and yield good power in the settings investigated. They are preferable to the t-test and related procedures whose type I errors differ noticeably from nominal size for skewed and long-tailed distributions. [5] Baumgartner, W., Weiss, P., Schindler, H. (1998). A nonparametric test for the general two-sample problem. Biometrics 54 1129-1135. Abstract: For two independently drawn samples of data, a novel statistical test is proposed for the null hypothesis that both samples originate from the same population. The underlying distribution function does not need to be known but must be continuous, i.e., it is a nonparametric test. It is demonstrated for suitable examples that the test is easy to apply and is at least as powerful as the commonly used nonparametric tests, i.e., the Kolmogorov-Smirnov, the Cram´er-von Mises, and the Wilcoxon tests. [6] B¨ uning, Herbert (2001). Kolmogorov-Smirnov and Cram´er-von Mises type two-sample tests with various weight functions. Communications in Statistics B 30 847-865. Abstract: For the general two-sample problem we introduce modifications of the KolmogorovSmirnov and Cram´er-von Mises test by using various weight functions. We compare these modified tests with the classical Kolmogorov-Smirnov- and Cram´er-von Mises tests as well as with the Lepage test for location and scale alternatives including the same shape and different shapes of the distributions of the X- and Y-variables. The power comparison of the tests is carried out via Monte Carlo simulation assuming short-, medium- and long-tailed distributions as well as distributions skewed to the right. It turns out there is mostly a considerable gain of power by applying these modified versions of the Kolmogorov-Smirnov and Cram´er-von Mises tests. On the basis of the power results an adaptive test is proposed which takes into account the given data set. [7] B¨ uning, Herbert (2002).

Robustness and power of modified Lepage, Kolmogorov-

Smirnov and Cram´er-von Mises two-sample tests. J. Applied Statistics 29(6) 907-924. Abstract: For the two-sample problem with location and/or scale alternatives, as well as different shapes, several statistical tests are presented, such as of Kolmogorov-Smirnov and Cram´er-von Mises type for the general alternative, and such as of Lepage type for location and scale alternatives. We compare these tests with the t-test and other location tests,

28

such as the Welch test, and also the Levene test for scale. It turns out that there is, of course, no clear winner among the tests but, for symmetric distributions with the same shape, tests of Lepage type are the best ones whereas, for different shapes, Cram´er-von Mises type tests are preferred. For extremely right-skewed distributions, a modification of the Kolmogorov-Smirnov test should be applied. [8] Neuh¨ auser, M. (2005). Exact tests based on the Baumgartner-Weiss-Schindler statistic – a survey. Statistical Papers 46 1-29. Abstract: t is the purpose of this paper to review recently-proposed exact tests based on the Baumgartner-Weiss-Schindler statistic and its modification. Except for the generalized Behrens-Fisher problem, these tests are broadly applicable, and they can be used to compare two groups irrespective of whether or not ties occur. In addition, a nonparametric trend test and a trend test for binomial proportions are possible. These exact tests are preferable to commonly-applied tests, such as the Wilcoxon rank sum test, in terms of both type I error rate and power. [9] Cao, Ricardo; Van Keilegom, Ingrid (2006). Empirical likelihood tests for two-sample problems via nonparametric density estimation. Canadian J. Statistics 34(1) 61-77. Abstract: The authors study the problem of testing whether two populations have the same law by comparing kernel estimators of the two density functions. The proposed test statistic is based on a local empirical likelihood approach. They obtain the asymptotic distribution of the test statistic and propose a bootstrap approximation to calibrate the test. A simulation study is carried out in which the proposed method is compared with two competitors, and a procedure to select the bandwidth parameter is studied. The proposed test can be extended to more than two samples and to multivariate distributions. [10] Zhang, J. (2006). Powerful two-sample tests based on the likelihood ratio. Technometrics 48(1) 95-103. Abstract: A new approach to constructing nonparametric tests for the general two-sample problem is proposed. This approach not only generates traditional tests (including the two-sample Kolmogorov-Smirnov, Cram´ervon Mises, and Anderson-Darling tests), but also produces new powerful tests based on the likelihood ratio. Although conventional twosample tests are sensitive to the difference in location, most of them lack power to detect

29

changes in scale and shape. The new tests are location-, scale-, and shape-sensitive, so they are robust against variation in distribution. [11] Neuh¨ auser, Markus; Leuchs, Ann-Kristin; Ball, Dorothee (2011). A new location-scale test based on a combination of the ideas of Levene and Lepage. Biometrical Journal 53(3) 525-534. Abstract: Lepage’s test combines the Wilcoxon rank-sum and the Ansari-Bradley statistics. We propose to replace the latter statistic by a Wilcoxon rank-sum calculated after Levene’s transformation. We use the medians for this transformation, i.e. absolute deviations from sample medians are calculated. The new location-scale test can be carried out as a permutation test based on permutations of the original observations, the Levene transformation has to be applied for each permutation in an intermediate step to calculate the test statistic. Simulations indicate that the new test can be more powerful than an O’Brien-type test and Lepage’s test, the latter is the standard nonparametric location-scale test. The new test is illustrated using real data about colony sizes of yellow-eyed penguins and an SAS program to perform the test is freely available. Acknowledgements Warm thanks to Fritz Scholz, Galen Shorack, and Jon Wellner for helpful discussions.

30