On Comparison of Hypothesis Tests in the Bayesian Framework ...

1 downloads 0 Views 319KB Size Report
Nov 1, 2004 - introducing a loss function. Three different linear orders on the set of all possible hypothesis tests are studied. The most natural order estimates ...
Journal of Modern Applied Statistical Methods Volume 3 | Issue 2

Article 12

11-1-2004

On Comparison of Hypothesis Tests in the Bayesian Framework without Loss Function Vladimir Gercsik Russian Academy of Science

Mark Kelbert University of Wales-Swansea, United Kingdom, [email protected]

Follow this and additional works at: http://digitalcommons.wayne.edu/jmasm Part of the Applied Statistics Commons, Social and Behavioral Sciences Commons, and the Statistical Theory Commons Recommended Citation Gercsik, Vladimir and Kelbert, Mark (2004) "On Comparison of Hypothesis Tests in the Bayesian Framework without Loss Function," Journal of Modern Applied Statistical Methods: Vol. 3 : Iss. 2 , Article 12. DOI: 10.22237/jmasm/1099267920 Available at: http://digitalcommons.wayne.edu/jmasm/vol3/iss2/12

This Regular Article is brought to you for free and open access by the Open Access Journals at DigitalCommons@WayneState. It has been accepted for inclusion in Journal of Modern Applied Statistical Methods by an authorized editor of DigitalCommons@WayneState.

Copyright © 2004 JMASM, Inc. 1538 – 9472/04/$95.00

Journal of Modern Applied Statistical Methods November, 2004, Vol. 3, No. 2, 399-405

On Comparison of Hypothesis Tests in the Bayesian Framework without Loss Function Vladimir Gercsik

Mark Kelbert

International Institute of Earthquake Prediction Theory & Mathematical Geophysics Russian Academy of Science

Department of Mathematics University of Wales-Swansea United Kingdom

The problem is how to compare the quality of different hypothesis tests in a Bayesian framework without introducing a loss function. Three different linear orders on the set of all possible hypothesis tests are studied. The most natural order estimates the Fisher information between indicators of event and decision. Key words: Bayesian risk, Neyman-Pearson test, Fisher information, tests of independency

Introduction hypothesis tests could be linearly ordered at least in three different ways: Let p and 1 − p be the Bayesian probabilities of random experiments ω∈Ω with cumulative distribution functions F1 and F2, respectively. A test is defined by a function Φ(ω) = 1 on the critical set B and 0 on Bc. If Φ(ω) = 1 then the alternative F2 is accepted, and the hypothesis F1 is accepted in the case Φ(ω) = 0. Clearly, the problem is symmetric with respect to interchange of hypothesis F1 and F2 and a simultaneous interchange of Φ(ω) and 1 − Φ(ω). The type I error is denoted (i.e., the probability to accept F2 when F1 is true) by α1, and the type II error (i.e., the probability to accept F1 when F2 is true) by α2. Considered are only unbiased tests (Barra, 1981), i.e., assume that α1 + α2 ≤ 1. (If this condition is violated one could get an unbiased test by selecting Bc instead of B as a critical set.) Consider a random variable X1 = 0 if F1 is true, X1 = 1 if F2 is true, and call it an indicator of events. In a similar manner we define a random variable X2 = 0 if a test accepts F1, and X2 = 1 if the test accepts F2, called an indicator of decision. The joint distribution P(x1, x2) is defined by the relations α1 = P(X2 = 1 ⎜ X1 = 0), α2 = P (X2 = 0 ⎜ X1 = 1), in particular,

It is well-known that no universal measure of a hypothesis test quality exists in statistics. In the Bayesian framework a linear ordering of tests is possible for a given loss function. It is said that a test is optimal if it minimizes the Bayesian risk. However, the selection of a loss function is often somewhat arbitrary, and it is not always natural to measure the losses under different types of errors in the same units. For example, the cost of prevention measures in earthquake prediction is naturally expressed in money units. However, the losses from an earthquake including the psychological traumas, maiming and even the loss of human lives could be hardly expressed in money terms. Even if this expression is imposed, any estimation of these losses in money terms would depend a great deal on the variable economical and political situation. This loss function hardly looks as neutral and scientifically unbiased. The subject of interest is in a situation when the loss function is unknown but the quality of any two hypothesis tests should be quantitatively compared. It turned out that all

Vladimir Gercsik is a Senior Researcher in geophysics and earthquake prediction. Mark

Kelbert is a Reader in statistics. Email him at [email protected].

399

ON COMPARISON OF HYPOTHESIS TESTS

400

P (X2 = 0, X1 = 0) = p(1 - α1), P(X2 = 1, X1 = 0) = pα1,

(1)

P (X2 = 0, X1 = 1) = (1 - p) α2, P(X2 = 1, X1 = 1) = (1 - p)(1 - α2).

Another possibility to test the independence is to consider the maximal correlation coefficient

(2)

ρ(Φ) = sup E [ φ1( X 1 )φ2( X 2 ) ]

The marginal one-dimensional probabilities take the form: P (X1 = 0) = p, P(X1 = 1) = (1 - p), P (X2 = 0) = p(1 - α1) + α2 (1 - p),

(3)

P (X2 = 1) = pα1 + (1 - p)(1 - α2).

(4)

Clearly, the worst possible unbiased test is determined by the condition that indicator of event X1 and indicator of decision X2 are independent. On the other extreme, an ideal test (normally, it does not exist) is one that provides the correct solution without any errors. Generically, the quality of a test is measured by some non-negative function of X1 and X2 = Φ(ω) which takes the value 0 iff X1, X2 are independent, and the value 1 iff X1= X2.

(8)

where sup is taken over the set of functions φ1, φ2 such that E [φ i (Xi)] = 0, σ2[φ i(Xi)] = E[φi (Xi)2]=1, i=1,2. Clearly, ρ(Φ) = 0 iff X1 and X2 are independent. As function φi(x), i =1,2 could take only two values φi(0) and φi(1), and these values are defined in a unique way by the conditions imposed, the following relation holds

⎡ ( X − E( X 1 )) ( X 2 − E( X 2 )) ⎤ ρ(Φ) = E ⎢ 1 ⎥ (9) ( ) ( ) X X σ σ 1 2 ⎣ ⎦ A straightforward computation yields:

p(1 − p) P(1 − P )

ρ(Φ) = β

(10)

where Measuring the quality of a test Any of the following well-known Rachev (1991) functions used in tests of independency is acceptable as a measure of the quality of a test.

β1 (Φ) = sup P( x1 , x2 ) − P( x1 )P( x2 ) ,

(5)

x1 , x2

⎡ ⎤ (6) β 2 (Φ ) = E x1 ⎢sup P( x2 | x1 ) − P( x2 ) ⎥ ⎣ x2 ⎦ ( E x1 stands for the expectation with respect of distribution of X1). It is easy to check that β1(Φ) =β2(Φ) = p(1 - p)(1 - α1 - α2).

(7)

The quality of a test is measured by β = (1 − α1 − α2). This is quite popular in practice as the Bayesian risk R = E[w] = α1 + α2 appears for the simplest loss function w(x1, x2): w(0,0) = w(1,1) = 0, w(0,1) = w(1,0) = 1.

P = P (X2 = 0) = α2(1 − p) + (1 − α1)p = α2 + βp. The correlation coefficient ρ (Φ) is non-negative for any unbiased test; it equals to 1 when X1 = X2 . Perhaps, the most interesting way to measure the quality of a test is to consider an information I(Φ) in the indicator of solution X2 = Φ about the indicator of event X1. To formalize this idea consider

I (Φ) =

∑ ∑ P( x , x ) log

x1 = 0,1

x2 = 0,1

1

2

2

P( x1 , x2 ) P( x1 )P( x2 )

(11)

which equals to I(Φ) = S(X1) − E[S(X1 | X2)]. S(X1) stands for the Fisher information of the prior distribution P(x1), S(X1) = p log2 p + (1 − p) log2 (1 − p). Intuitively, it means that in a random trial with n outcomes where hypothesis F1 and F2 appear with probabilities p and 1 − p, respectively, there are ≈2nS quite

GERCSIK & KELBERT

401

probable outcomes, and the rest could be neglected as n → ∞. Next, S(X1 ⎜ X2) is the conditional entropy of the conditional distribution P (x1 ⎜ x2) under condition that the decision x2 is taken. Therefore, the information I(Φ) equals to the mean reduction of uncertainty obtained by the using of the test Φ. Clearly, for an ideal test without any errors I(Φ) takes its maximal value S(X1), and I(Φ) = 0 iff X1 and X2 are independent. It is interesting to note that a sequence of random events X1(n) and decisions X 2 (n ) =1,2,… can be treated as a message transmission over a channel without noise. In this interpretation the observation x1(n)(ω) could be treated as the coding of the random outcome Fi, i = 1,2, and the decision x2(n) (ω) as its decoding. The maximization of I(Φ) means that the optimal decoding is applied. Now following is proved: Lemma 1. Fix the Type I error probability α1. Then, the Neyman-Pearson test Φ* minimizing the Type II error probability α2, maximizes also the information I(Φ*)among all unbiased tests. Proof.

∂ρ = ∂α 2 −

This derivative is non-positive for an unbiased test with α1 + α2 ≤ 1. Hence, the information I(Φ) is maximal for a minimal possible value of α2, i.e. for the test Φ*. A symmetric statement with interchanging of α1 and α2 is also true.• The same property holds also for β (Φ) and ρ (Φ); this is immediate for β (Φ), and follows from the equality

(15a)

⎡ P(1 − P) − pα 2 (1 − α 2 ) ⎤ ∂2I = (1 − p)⎢ ⎥, (15b) 2 ∂α1 ⎣ α 2 (1 − α 2 ) P(1 − P) ⎦ ∂2I p (1 − p ) = , ∂α1∂α 2 P(1 − P)

(15c)

∂2 I = det ∂α i ∂α j p(1 − p )(1 − α1 − α 2 ) > 0. P(1 − P )α1 (1 − α1 )α 2 (1 − α 2 )

(15d)

2

This

(13)

P(1 − P)

⎡ P(1 − P) − pα 1 (1 − α 1 ) ⎤ ∂2I = p⎢ ⎥, 2 ∂α 2 ⎣ α 1 (1 − α 1 ) P(1 − P) ⎦

yields

⎡ ⎛ ⎛ αp ⎞ (1−α1) p ⎞⎤ (1− p) ⎢log2 ⎜1− p + 1 ⎟ − log2 ⎜1− p + ⎟⎥ α2 ⎠⎦ 1−α2 ⎠ ⎝ ⎝ ⎣

p(1 − p) ≤0 P(1 − P)

(14) in the case of ρ (Φ). Next, observe that the Fisher information I(Φ) is a convex function of α1 and α2 for all 0 < α1 < 1, 0 < α2 < 1, 0 < p < 1. The prove this, it suffices to compute

A straightforward computation

∂I = ∂α2

α1 P + (1 − α1 )(1 − P )

convexity

property

randomized test Φ =



Φ∈ A

implies

that

a

ΦP (dΦ) where P

(dΦ) is a probability measure on a suitable set A, P (A)=1, can not be optimal in the sense of Fisher information. Indeed, Jensen's inequality yields

I (Φ) = I



Φ∈ A

(∫

Φ∈ A

)

ΦP(dΦ) ≤

I (Φ)P(dΦ) ≤ sup I (Φ)

(16)

Φ∈ A

Hence, there always exists a non-randomized test Φ' such that I (Φ) ≤ I (Φ′) . Clearly, a similar statement holds for β (Φ) as well. However, it is not true generically for ρ (Φ).

ON COMPARISON OF HYPOTHESIS TESTS Fig.1 demonstrates this presenting the surface ρ(α1,α2) for p = 1/3. Fig.2 presents the surface I(α1, α2) for p = 1/3. Each of the characteristics β (Φ),ρ (Φ) and I(Φ) divides the set of all tests into equivalence classes with the same value of this characteristic inside the class, and defines a linear order between different equivalence classes. Numerical Examples Figure 3 presents the results of computation of optimal tests with respect of three criteria as above. As hypothesis F1 and F2, select the normal distributions with pdf’s

⎛ ( x + 0.5) 2 ⎞ 1 exp ⎜ − ⎟, 2 2π ⎝ ⎠ ⎛ ( x − 0.5) 2 ⎞ 1 f 2 ( x) = exp ⎜ − ⎟, 2 2π ⎝ ⎠ f1 ( x) =

(17)

respectively. The prior probability of appearing F1 is p = 1/3. Fig.3 presents the type II error α2 = g(α1) for Neyman-Pearson’s problem in the cases σ2 = 0.2, σ2 = 0.6 and σ2 = 1. Each of these curves represents all the tests of the form

⎧1, if f 2 ( x) ≥ λf1 ( x) Φ( x ) = ⎨ ⎩0, if f 2 ( x) < λf1 ( x)

(18)

with different λ which are optimal for NeymanPearson problem, i.e., minimizes α2 for a given value of α1. These curves of errors serve as a boundary of a convex domain of errors for all possible tests. The points are indicated (α1, α2) on the boundary where each of three characteristics of quality as above achieves its maximum. Clearly, the position of maximum is distinct in all three cases. Checking Hypothesis of Uselessness of a Test It is desirable to use empirical data for checking the uselessness of a test Φ. In the case of independence of the indicator of event and the indicator of decision any of three characteristics of quality as above equals 0. To reject the hypothesis of useless of test Φ with some level of confidence it is sufficient to demonstrate that α1 + α2 < 1 with this level of confidence. Consider a series of n independent trials where the number M of appearance of

402

distribution F2 equals m, the number L of selection of distribution F2 by the test Φ equals l, and the number of correct choices K of F2 by the test Φ equals k. The hypothesis of uselessness is formalized in the following form of H0: α1+α2=1. If H0 is true, then (with unknown probability p) P( X = 0) = p, 1 P( X = 0) = α (1 − p ) + (1 − α ) p ≡ α 2 2 1 2

(19)

and X1 and X2 are independent. Probabilities P(X2=0, X1=0), P (X2=1, X1=0), P (X2=0, X1=1) and P (X2=1, X1=1) are estimated by the empirical frequencies (n − m − 1 + k) / n, (l − k) / n, (m − k) / n and k / n, respectively. Hence, the estimates of conditional probabilities α1 = P(X2 = 1 ⎜ X1 = 0) and α2 = P(X2 = 0 ⎜ X1 = 1) looks like (l − k) / (n − m) and (m − k) / m, respectively. Clearly, the inequality k > ml / n corresponds to the alternative H1: α1 + α2 < 1. It means that the critical sets have the form {K > K*} for given values M = m and L = l. If hypothesis H0 is true then the independence of X1 and X2 and independence of different trials imply an explicit expression for test size

P(K > K *, M = m, L = l) = ⎛ n ⎞ m n−m × ⎜ ⎟ p (1 − p ) ⎝m⎠ m in { m , l } ⎛ m ⎞ k m−k ∑ ⎜ ⎟ q (1 − q ) k k = K* ⎝ ⎠ ⎛n−m⎞ l− k (1 − q ) n − m − l + k ⎜ ⎟q l k − ⎝ ⎠ ⎛ n ⎞ = ⎜ ⎟ p m (1 − p ) n − m q l (1 − q ) n − l ⎝m⎠ m in { m , l } ⎛ n − m ⎞ ⎛ m ⎞ ∑ ⎜ ⎟⎜ ⎟ l − k ⎠⎝ k ⎠ ⎝ * k = K . (20)

403

GERCSIK & KELBERT

Figure 1. Dependence of correlation coefficient on errors.

Figure 2. Dependence of Fisher information coefficient on errors.

ON COMPARISON OF HYPOTHESIS TESTS

404

Figure 3. Location of maximum points for different characteristics of test efficiency.

Figure 4. Dependence of maxima locations for different characteristics of test efficiency on variance.

Fig. 4 presents the dependence of locations of maximum points αi(β), αi(ρ) and αi(I), i = 1,2, for three characteristics of quality as above as functions of σ2. Observe that the Type II error tends to 0 as σ2 → 0, as the PDF f2 tends to a delta-function located at the point x = 0.5.

GERCSIK & KELBERT

405 Here q = α2. Using the equalities

⎛n⎞ P( M = m) = ⎜ ⎟ p m (1 − p ) n − m , ⎝ m⎠ ⎛ n⎞ P( L = l ) = ⎜ ⎟ q l (1 − q) n −l , ⎝l⎠

This conditional probability represents the level of confidence for the critical region {K > K*} for given values M = m and L = l. (21)

obtain

⎛ m⎞⎛ n − m⎞ min{m, l} ⎜ k ⎟ ⎜ l − k ⎟ ⎝ ⎠⎝ ⎠ P( K ≥ K * | M = m, L = l ) = ∑ n ⎛ ⎞ k = K* ⎜ ⎟ ⎝l⎠ (22)

References Barra J. R. (1981). Mathematical basis of statistics. New York: Academic Press. Rachev S. (1991). Probability metrics and the stability of stochastic models. Chichester: John Wiley.

Suggest Documents