Finite Sample Inference for the Mean of an Unknown Bounded ...

1 downloads 0 Views 183KB Size Report
Unknown Bounded Random Variable without. Assumptions. 1. Karl H. Schlag2. May 25, 2007. 1The author would like to thank Dean Foster, Fortunato Pesarin, ...
Finite Sample Inference for the Mean of an Unknown Bounded Random Variable without Assumptions1 Karl H. Schlag2 May 25, 2007

1 The

author would like to thank Dean Foster, Fortunato Pesarin, Richard Spady and

David Thesmar for comments. The author would also like to thank Dr. Manu Sabeti for providing data. 2 Economics Department, European University Institute, Via della Piazzuola 43, 50133 Florence, Italy, Tel: 0039-055-4685951, email: [email protected]

Abstract Consider a …nite sample of iid observations of a random variable that generates outcomes belonging to a known bounded set. No further assumptions are made on the underlying distributions or on how large the sample has to be, the setting can be nonparametric. We provide tight lower bounds for the type II error of any test with a given size provided both the null and the alternative hypothesis can be described in terms of the mean only. We also derive for given coverage a tight lower bound on the inaccuracy of any family of con…dence intervals where inaccuracy is measured by the maximal expected width. We then design a nonrandomized exact test that can be implemented in …ve simple steps. Explicit formulae for the upper bound of the type II error and for the inaccuracy of the associated family of con…dence intervals are provided. We illustrate our methods in examples involving between 5 and 51 observations. Keywords: exact, distribution-free, nonparametric inference, …nite sample theory, hypothesis testing, accuracy. JEL classi…cation: C12, C14.

1

1

Introduction

We are interested in exact hypothesis tests for the mean of an unknown underlying random variable. Inference is based on a …nite sample of independent observations. The speci…c feature of our analysis is that we focus on environments in which the support of the underlying data generating process is contained in a known bounded set. This bounded set can be a …nite set consisting of three or more elements (to add distributions other than the binomial) or it can be an in…nite set such as an interval. With this special focus, we do not have to make any further assumptions. In particular our approach is distribution-free, and when this bounded set contains in…nitely many di¤erent outcomes it is also nonparametric. We are keen to investigate inference that is only based on assumptions that can be veri…ed (with probability one) and hence make no assumptions on the underlying distribution apart from on its support. We …nd veri…ability to be valuable in particular when there is a lot at stake such as in policy evaluation or drug testing. Existence of a known bounded set containing the support allows us to derive results in terms of power and accuracy for a given sample size. Environments that generate such bounds are plentiful if not typical, they arise in practically any randomized laboratory or …eld experiment. Without such bounds, as …rst pointed out by Bahadur and Savage (1956), it is impossible to make nontrivial inference without adding nonveri…able elements such as parametric restrictions or other restrictions on distributions (shifts in location parameter are typically unveri…able). As an illustrating example we evaluate the e¤ectiveness of a medical treatment after 25 patients have been treated and pain after the treatment has been measured on a scale from 0 to 100: In a second example we consider indices, by de…nition contained

2 in [0; 1] ; that were constructed to measure the protection of minority shareholders against expropriation in 72 countries. We consider the data of this second example as a pseudo random experiment by assuming that indices belonging to countries with the same origin of law are independent. In both of these examples the setting reveals a bounded set containing any possible outcome. Tests in terms of the mean represent a corner stone of any statistical package. Tests in terms of means are typically a …rst step to investigate di¤erences in distributions. They are the basis of regression analysis. Means play a central role as a …rst evaluation of a data generating process is often in terms of the mean, possibly due to the central role that means represent both in the description of a random variables and in the theory of decision making. Quantifying objectives in terms of means can simplify the analysis. Yet only little known about exact hypothesis tests for means outside of the Bernoulli setting. There are some exact tests in the literature (see Diouf and Dufour, 2006 and Romano and Wolf, 2000 for overviews). However theoretical results for …nite samples other than size or coverage are not available. Romano and Wolf (2000) present asymptotic results yet ever since Lehmann and Loh (1990) we know that caution is needed when using such results to gain insights for …nite sample inference even when the support is known to be bounded. When the environment is nonparametric then one cannot simulate all possible underlying distributions. Consequently, comparisons based on simulations are only of little value if one is not willing to make parametric assumptions. Yet it is important to understand how existing tests compare before applying them to data as otherwise one is tempted to try di¤erent methods on the same data set and then to select the best results without mentioning the other attempts. (There is a clear incentive of the practitioner not to mention failed attempts as such multiple

3 testing increases size and reduces coverage.) It is equally important to understand the restraints on inference imposed by small sample sizes. How should and how can tests be compared? The answer is straightforward when one test is uniformly more powerful than the other as this means that the type II error as well as the expected width of the con…dence interval is smaller. However, tests can only rarely be uniformly ranked in terms of power as the set of possible underlying distributions is too rich. Uniformly most powerful tests only exist in very special cases, in particular they do not exist for our setting. To ease comparability we consider only hypotheses that can be described in terms of the underlying mean only. It is as if we collect all distributions in the same set (an equivalence class) and then consider minimal power of a test within this set associated to a single mean. We thus construct particular maximin tests and consider inference that is only based on the parameter of interest, the mean. This enables us to identify a test we call parameter most powerful that makes the most powerful inference for all such hypotheses. For a given size this test minimizes type II error whenever the alternative hypothesis can be described in terms of the mean only. Speci…cally, for a given coverage this test induces a family of con…dence intervals that minimizes inaccuracy among all families of con…dence intervals where inaccuracy is measured in terms of maximal expected width. However, as this test (and its induced family of con…dence intervals) is randomized it is only of little practical use for making recommendations based on data. Its value lies in its characteristic of being the …rst benchmark for evaluating properties of a test for a given …nite number of observations on an absolute scale. It uncovers the limits to inference when the null hypothesis cannot be rejected. Unfortunately most existing exact tests are too intricate so that we are not able to compare their performance to this absolute …nite sample benchmark. An exception

4 is the test of Bickel et al. (1989) that however is so crude that it does not perform very well (see appendix). The challenge is to …nd a test that is simple enough that it can be compared to the benchmark yet that is su¢ ciently sophisticated in order to perform well in moderate sample sizes. We present a nonrandomized test that turns out to perform better than that of Bickel et al. (1989). Our test relies on …ve simple steps. The crucial and novel feature is that it …rst adds noise to the data in order to simplify inference and then eliminates randomness to produce a recommendation that is not randomized. The speci…c way of adding noise was independently discovered by Cucconi (1968) for designing a nonparametric sequential probability ratio test, by Gupta and Hande (1992) for designing statistical procedures for making the best selection and by Schlag (2003) for learning without priors in a two armed bandit. After linearly normalizing the data so that outcomes are contained in [0; 1] the idea is to randomly transform each observation yj 2 [0; 1] into f0; 1g using a mean preserving transformation. Conditional on performing this transformation it is as if the statistician faces a Bernoulli distributed random variable with the same mean as the original data generating process. Consequently inference is simple, one can apply the well known uniformly most powerful one-sided test for Bernoulli distributed random variables. Use of the UMP test in this construction implies that we obtain a nonparametric test that minimizes inaccuracy and that attains minimal type II error whenever the alternative hypothesis can be described in terms of mean only. In fact, this test is unbiased. However, as the recommendation of the UMP test is made conditional on the transformed outcomes, ex-ante this procedure will lead to a randomized recommendation and thus to randomized con…dence intervals. Randomization is then eliminated by using a simple cuto¤ strategy at the cost of

5 increasing size and type II error and reducing coverage and accuracy. Our test relies on a parameter

(to be speci…ed) such that the …nal recommendation is to reject the

null hypothesis if and only if the randomized test rejects the null with probability greater than . The case of

= 0:5 has previously been used by Gupta and Hande

(1992) for statistical decision making. We derive the properties of the test conditional on : It turns out that lower (higher) values of

are better for reducing type II error

when the true mean is close to (further away from) the mean speci…ed in the null hypothesis. We show how to select , in particular we suggest to choose the value of that minimizes inaccuracy of the associated family of equi-tailed con…dence intervals. Once

has been selected then a recommendation can be derived for a given data

set. We illustrate the procedure in our two examples. We derive an upper con…dence bound on the expected pain after the treatment and a con…dence interval for the mean index of countries that have law systems with similar origins. In summary, this is the …rst analysis of power and/or accuracy when making nonparametric inference (in terms of hypothesis testing) involving means. We proceed as follows. We introduce the …ve steps of our test in Section 2 and then present its properties for given threshold

in Section 3. Section 4 contains the

proofs, in Section 5 we show how to select the threshold . In Section 6 we use our test to derive inferences in the two examples and then conclude in Section 7. In Appendix A we show how to extend our results to two-sided tests and to con…dence intervals, in Appendox B we brie‡y compare our test to that Bickel et al. (1989).

6

2

The Test

We introduce a nonparametric distribution-free test of the mean of a single sample and then present and prove some of its properties. Consider a random variable Y and let P be the underlying distribution where it is only known that Y 2 [! 1 ; ! 2 ] for given ! 1 ; ! 2 2 R with ! 1 < ! 2 : No further assumptions are made on the distribution of outcomes. Let unknown mean of Y . Fix some

0

>

0

(P ) denote the

2 (! 1 ; ! 2 ) : In the following we design an exact

test for the testing the null hypothesis H0 : hypothesis H1 :

=

=

0

against the one-sided alternative

based on N independent observations y1 ; ::; yN of Y: Let

be

the desired size of this test. We present a test that is indexed by a parameter

2 (0; 1). Later we will show

how to determine : The test can be described using …ve steps. (i) First normalize the outcomes so that we know that they are contained in [0; 1] : (ii) Next independently and randomly transform each normalized observation, belonging to [0; 1], into a binary valued outcome in f0; 1g : (iii) On this resulting binary valued sample evaluate the uniformly most powerful test based on size

. (iv) Now repeat steps (ii) and (iii)

in…nitely often to generate a randomized recommendation. (v) Finally, reject the null hypothesis if and only if at the end of step (iv) the null hypothesis is rejected with probability greater than . We now explain each step in more detail. 1. Normalization Take each observation yj and replace it with y^j := (yj

! 1 ) = (! 2

!1) ; j =

1; ::; N: Thereafter it is as if we have a sample drawn from a random variable Y^ := (Y

! 1 ) = (! 2

! 1 ) that has mean ^ := (

! 1 ) = (! 2

! 1 ) : Set ^ 0 :=

7 (

! 1 ) = (! 2

0

! 1 ) : Thus we are testing H0 : ^ = ^ 0 against H1 : ^ > ^ 0

where ^ = E Y^ and Y^ 2 [0; 1] : 2. Randomization Trick In this step we transform f^ yj gN y j gN j=1 into a binary value sequence f~ j=1 using the following random mean preserving transformation. Independently for each j = 1; ::; N set y~j = 1 with probability y^j and set y~j = 0 with probability 1

y^j .

Note that y^j = y~j if and only if y^j 2 f0; 1g : 3. Evaluating the UMP Test At …rst recall the uniformly most powerful (UMP) test for Bernoulli random variables for testing the null hypothesis that the mean is equal to ^ 0 against the one-sided alternative that the mean is greater than ^ 0 ; here formulated for size (see Rohtagi, 1976, Example 2, p. 415). Find

2 [0; 1) and k 2 f0; 1; ::; N g

such that N k ^ 0 (1 k

N k

^0)

+

N X

j=k+1

N j ^ 0 (1 j

^ 0 )N

j

=

:

(1)

The UMP test speci…es to reject the null hypothesis if there are k + 1 or more successes (i.e. outcomes equal to 1) in the data, to reject the null with probability

if there are k successes in the data and not to reject the null otherwise.

Apply this UMP test to the binary valued data y~ = (~ yj )N j=1 , so

PN

j=1

y~j is the

number of successes, and record the probability of recommending to reject the null hypothesis, denoted by (~ y ), so (~ y ) 2 [0; 1] : 4. Repetition Repeat steps (ii) and (iii) in…nitely often, …rst applying the randomization trick and then determining the random recommendation

for that realization of the

8 transformation y~. Let

=

(y) denote the average recommendation based on

this iteration. One can also avoid this loop by deriving the recommendation formally as a linear function of the (yj )j , however the resulting formula is only of little practical use as it typically involves an extremely large number of summands. 5. Rounding Trick The recommendation of our test procedure is to reject the null hypothesis if >

and not to reject the null hypothesis if

.

Above we described a one-sided test for the mean of a single sample against alternatives with higher mean. Using the standard methodology one can use it to construct lower con…dence bounds for the mean. Analogously one can also construct a one-sided test against alternatives with a lower mean to thus obtain upper con…dence bounds. Combining the two tests one then can generate equi-tailed two-sided tests and equi-tailed con…dence intervals for the mean.

3

Properties

We present some properties of our test and prove these in the next section.

3.1

Hypothesis Testing

The test described in Section 2 is nonrandomized and has size , it wrongly rejects the null hypothesis with probability at most : It is exact as this result holds for the given number of N observations and there is no constraint on how large N has to be.

9 The properties of this test in terms of power can be exhibited when considering the following nested set of alternative hypotheses indexed by The null hypothesis remains unchanged, so H0 : is given by H1 1 :

0.

=

1

with

0


^ 0 for Bernoulli distributed P: So (^ 0 ; ^ 1 ; where

=

(

)=

N k ^ 1 (1 k

) and k = k (

N k

^1)

+

N X

j=k+1

N j ^ 1 (1 j

^ 1 )N

j

(3)

) are de…ned in (1).

We also derive the lower bound on the type II error of any test of size

based on

N independent observations. This lower bound is equal to 1

(4)

(^ 0 ; ^ 1 ; )

and is tight as it is attained by the randomized test generated by stopping after step 4 and setting

= 1 (so the UMP test with size

is evaluated in step 3). (In fact,

this test is unbiased.) More generally, the randomized test de…ned by the …rst four steps and minimizes the type II error among all tests with size H0 : H1 :

0

2 fP :

whenever H0 :

=

= 1 0

or

and the alternative hypothesis is de…ned in terms of means only, so 2 Ag where A is a closed subset of ( 0 ; ! 2 ] : It is as if we collect all

distributions that have the same mean into an equivalence class and then consider inference on these classes. In fact, this randomized test also minimizes type II error

10 when one relaxes the assumption that any outcome in [! 1 ; ! 2 ] is possible and only requires that Y 2

for given

with f! 1 ; ! 2 g

[! 1 ; ! 2 ] :

As this test cannot be outperformed in terms of type II error when hypotheses are speci…ed in terms of means only and as the mean is the parameter of interest we call it parameter most powerful (PMP) for size : Comparing (2) and (4) we …nd that our nonrandomized test generates at most the size and 1= (1

) the type II error as compared to the PMP test that minimizes

type II error for size

.

Of course the upper bound on the type II error of our nonrandomized test presented in (2) is not tight. If the type II error of the UMP test used in step 3 is greater than 1

then the upper bound presented for our nonrandomized test is set equal

to 1. However, our nonrandomized test will correctly reject the null with a strictly positive probability, thus its true type II error is strictly below 1 for all We illustrate the performance of our test in Table 1. For we evaluate its type II error when

1

1

>

= 0:25 and

0:

0

= 0:5

= 0:7 for di¤erent values of N: We contrast

this to the minimal number of observations n needed to guarantee this type II error by evaluating the PMP test. Relative parameter e¢ ciency of our test is then de…ned by n=N: So for instance when N = 50 then the type II error of our test is 0:35. If N 0:34 50%:

24 then 1

(0:5; 0:7; 0:05) > 0:35 while if N = 25 then 1

(0:5; 0:7; 0:05) =

0:35. Thus n = 25 when N = 50 and relative parameter e¢ ciency is equal to

11

Table 1: Upper Bound on Type II Error for

0

= 0:5,

N Upper Bound on

1

= 0:7,

= 0:25 and

= 0:05

20

30

40

50

60

70

80

95

0:91

0:69

0:5

0:35

0:23

0:15

0:1

0:05

2

9

17

25

34

42

51

64

10%

30%

43%

50%

57%

60%

64%

67%

Type II Error n Relative Parameter E¢ ciency (=: n=N )

It is quite plausible that there exists a nonrandomized test that performs better than ours. However the PMP test gives insights to the limits of inference in small samples across all tests. From the above table it follows that at least 51 observations are needed for any test with size 0:05 for testing

= 0:5 against

0:7 in order to

obtain type II error below 0:1: If instead N = 20 then we derive from (4) that

0:8

is necessary for type II error of a test to be below 0:1: Regardless of which test is used, not rejecting the null hypothesis in this case only reveals little information about the underlying data generating process unless the mean is very large. In the appendix we brie‡y present the corresponding equi-tailed two-sided tests.

3.2

Con…dence Bounds

Consider now a family of lower con…dence bounds with coverage 1 by fL (y)gy such that PrP (

L (y))

1

; formally given

holds for all P . It is standard to

measure the performance of such a lower con…dence bound in terms inaccuracy as measured by the maximal expected value of underestimating the true mean. For true

12 mean

1

this value is given by supP :

and [x]+ = 0 if x

(P )=

1

EP [

L (y)]+ where [x]+ = x if x > 0

1

0.

Let Lr be the family of (randomized) lower con…dence bounds derived from the PMP test with size =

1

. We obtain for any

that Lr minimizes inaccuracy when

1

among all families of lower con…dence bounds that have coverage 1

.

Formally, inf

L:PrP (

L) 1

8P

(

sup

P : (P )=

EP [

)

L]+

1

1

This value of minimal inaccuracy when

=

1

=

sup P : (P )=

EP [

1

L r ]+ :

1

can be derived from the type II error

of the PMP test as follows: sup P : (P )=

EP [

r

1

L ]+ =

1

Z

^1

(1

(x; ^ 1 ; )) dx:

(5)

0

Accordingly, the family of lower con…dence bounds Lr derived from our PMP test for size

: In particular, Lr minimizes

is called parameter most accurate for coverage 1

the maximum inaccuracy across all true means. A family of nonrandomized lower con…dence bounds can be derived from our nonrandomized test by letting L (y) be equal to lowest value of hypothesis equal to

1

=

0

cannot be rejected in favor

>

0

0

such that the null

based on y: If the true mean is

then the associated inaccuracy is bounded above by Z

0

^1

min 1;

1

(x; ^ 1 ; 1

)

dx:

(6)

In Table 2 we compare minimal inaccuracy as computed in (5) to the upper bound on inaccuracy given in (6) for some values of N . The ratio of the value minimal inaccuracy to the bound on inaccuracy of our nonrandomized test is approximately equal to 63% in this table.

13

Table 2: Inaccuracy for ^ 1 = 0:5, N

20

30

Upper Bound

0:273

0:23

40

= 0:25 and 50

= 0:05

60

0:202 0:182 0:167

70

80

0:156

0:146

Absolute Lower Bound 0:177 0:147 0:129 0:116 0:106 0:0984 0:0923

In the appendix we present con…dence intervals derived from the corresponding equi-tailed two-sided tests.

3.3

Proofs

In this section we prove the claims made in Sections 3.1 and 3.2. The underlying distribution P is known to have support in [! 1 ; ! 2 ] so P 2 P :=

[! 1 ; ! 2 ] where

A

denotes the set of all distributions with support in A. Let PN be the distribution of the N independent draws from P: Fix hypothesis H0 :

(P ) =

0

1

2 ( 0 ; ! 2 ) and consider tests of the null

against the alternative hypothesis H1 1 :

(P )

1:

A test is described by a function f such that f (y) determines the probability of rejecting the null hypothesis conditional on observing y = (y1 ; ::; yN ) : The expected R probability of a rejection, denoted by EP (f ), is given by f (y) dPN (y) : A test f has size supP 2P:

if EP (f )

(P )

1

(1

when

(P ) =

0.

The type II error of the test is given by

EP (f )) :

Let f r be the randomized test described by steps 1 to 4 for of the UMP used in step 3. We now show that f r has size

= 1 so

is the size

and minimizes type II

error among all tests with size . Let Y^ 2 [0; 1] be the outcome after normalizing Y

14 as in step 1 and let Y~ 2 f0; 1g be the Bernoulli distributed outcome generated after applying the random transformation in step 2 to Y^ . Let P^ and P~ be the respective distributions. Then ^ =

P^ = Pr Y~ = 1 :

Assume that Y only attains the extreme outcomes ! 1 and ! 2 so P 2

f! 1 ; ! 2 g :

Then Y^ is a Bernoulli distributed random variable with mean ^ and Y^ = Y~ . By construction, f r is identical to the rejection probability under the UMP test with size

for testing

= ^ 0 against

> ^ 0 for Bernoulli distributed random variables.

Hence EP (f r ) = where

(^ 0 ; ^ ; ) if P 2

f! 1 ; ! 2 g

(7)

is the power function of the UMP test as de…ned in (3). Using the property EP (f r )

of a UMP test we …nd that 1

1

EP (f ) holds for any alternative test f

that has size : Now consider a more general distribution P that is only constrained by its support belonging to [! 1 ; ! 2 ] : Since the random transformation used in step 2 is mean preserving, it is as if the data is drawn from a distribution with support in f! 1 ; ! 2 g that has the same mean. Hence EP (f r ) = EP~ (f r ) :

(8)

Combining (7) and (8) we obtain that EP (f r ) = [! 1 ; ! 2 ] : Hence, EP (f r ) =

(^ 0 ; ^ ; ) holds for all P 2

if ^ = ^ 0 , so f r has size . In fact, f r is unbiased.

Concerning the type II error of f r , sup P 2P: (P )

(1

EP (f r )) = 1

If f is some alternative test that also has size sup P 2P: (P )

(1 1

EP (f ))

(9)

(^ 0 ; ^ 1 ; ) :

1

then (1

sup P1 2 f! 1 ;! 2 g: (P1 )

1

EP1 (f ))

1

(^ 0 ; ^ 1 ; ) :

15 This shows that f r attains the minimum type II error among all tests of size

and

hence that it is PMP. Consider now a family of tests f = f =

1

whenever

with coverage 1

>

1

0:

of size

0; 1

for testing

=

0

against

Let L be the associated family of lower con…dence bounds

. Following standard arguments going back to Pratt (1961), L]+ =

EP [

Z

Pr (x 2 (L (y) ; )) dx =

!1 P

Z

(1

EP (fx; )) dx:

(10)

!1

Now consider the family of lower con…dence bounds Lr derived from the randomized test f r . Since it is as if the statistician is facing a Bernoulli distribution P~ , sup P : (P )=

r

EP [

L ]+ =

1

1

Z

^1

(1

(11)

(x; ^ 1 ; )) dx

0

which proves (5). As f r is PMP we obtain from (10) and (11) sup P : (P )=

EP [

r

1

L ]+

Z

=

^1

(1

(x; ^ 1 ; )) dx

0

1

sup P : (P )=

1

Z

1

(1

EP (fx; )) dx =

sup P : (P )=

!1

EP [

1

L]+ :

1

Thus, Lr is parameter most accurate. Finally consider the nonrandomized test as de…ned in Section 2 for given , denoted here by f and in Section 2 by (P ) =

0.

; where f = 1ff r

>

Then above we showed that EP (f r ) = r

Z

r

= EP (f ) = f (y) 1ff r > g dPN (y) + Z > 1ff r > g dPN (y) = EP (f ) which implies that EP (f )

so f has size :

g : Consider …rst P such that . Hence,

Z

f r (y) 1ff r

g dPN (y)

16 Consider now more general P 2 Z

r

EP (f ) =

1ff r + (1

>

g dPN (y) +

[! 1 ; ! 2 ] : Then Z

1ff r

g dPN (y) = + (1

)

Z

1ff r

>

g dPN (y)

) EP (f )

so (1 Hence, if

(P )

sup

(1

P 2P: (P )

1

EP (f ))

1

EP (f r ) :

(12)

then following (9) and (12),

EP (f ))

1

) (1

1 1

sup P 2P: (P )

(1

EP (f r )) =

1

1 1

(1

(^ 0 ; ^ 1 ;

This proves the upper bound on the type II error of our nonrandomized test given in (2). Combining this with (10) then yields the upper bound on the inaccuracy of our nonrandomized test given in (6). The necessary derivations for equi-tailed tests and families of con…dence intervals are now straightforward as presented in the appendix.

3.4

Selecting

We now show how one can select the parameter . To illustrate consider N = 30, ^ 0 = 0:5 and for

= 0:05. In Figure 1 below we graph our upper bound on type II error

2 f0:05; 0:205; 0:5g together with the tight lower bound on the type II error,

both as functions of the normalized true mean ^ .

)) :

17

1 0.8 0.6 0.4 0.2 0 0.5

0.6

0.7 mu hat 0.8

0.9

1

Figure 1: The upper bound on type II error of our test for = 0:05 (dashed line),

= 0:225 (solid line) and

= 0:5

(dotted line) together with the tight lower bound on the type II error of any test (dot dashed line).

The intuitive choice for could be

= 0:5 as this means that the test recommends

a rejection if and only if the underlying randomized test is more likely to reject the null than not to reject it. Following Figure 1 we see that a lower

reduces the

type II error for low ^ and increases it for high ^ : As none of these values of

is

uniformly better than the other the question then arises how to select : We present three approaches. One may choose to select

by …nding the value that can be outperformed the

least in terms of reducing the upper bound on the type II error for some value of ^ : To illustrate we plot the di¤erences between the upper bounds of

= 0:05 and

18 = 0:205 as well as between

= 0:5 and

= 0:205:

0.15 0.1 0.05 0

0.6

0.7 mu hat 0.8

0.9

1

-0.05

Figure 2: Di¤erence in upper bounds on type II error between = 0:05 and

= 0:205 (dashed) and between

= 0:5 and

= 0:205 (dotted line).

Compared to

= 0:205 we …nd that

= 0:05 and

= 0:5 perform up to 0:048

better. The value 0:205 was chosen to minimize this possibility of outperformance. More speci…cally, setting N = 30, any

0

that there exists

the type II error under

1 0

= 0:05 and ^ 0 = 0:5, we verify numerically for

and some value of ^ 1 > ^ 0 such that the upper bound on exceeds that under

An alternative way to select the null and then to choose

1

by at least 0:0475 when ^ (P ) = ^ 1 :

is to assign some loss function to wrongly rejecting

in order to minimize maximum loss. For instance one

can consider loss in terms of regret (Savage, 1951) in which case loss of wrongly not rejecting the null is equal (1

EP (f )) (

0 ),

>

0.

0

and hence expected loss or risk is given by

One then chooses

maximum regret or risk over all P such that

(P ) >

that minimizes over all 0:

the

In this example one obtains

19 numerically

= 0:16 with maximal regret equal to 0:142. For comparison, maximal

regret is equal to 0:153 when

= 0:05 and is equal to 0:177 when

= 0:5. Note that

we can derive a tight lower bound on maximal regret using the PMP test based on size

= 0:05 and obtain a value of maximal regret equal to 0:076.

A third alternative is to select that minimizes the upper bound on the inaccuracy of the lower con…dence bound presented in (6). For N = 30; we …nd that

= 0:05 and ^ 0 = 0:5

= 0:25 minimizes this upper bound provided the true mean is equal

to 0:5: Inaccuracy when

= 0:5 can thereby be bounded above by 0:22. If instead

2 f0:05; 0:5g then the upper bound is equal to 0:23. The tight lower bound on inaccuracy, equal to 0:148, is attained by the PMP test. A less restrictive approach is to consider an upper bound on inaccuracy across all true means and to then choose that minimizes this upper bound. For N = 30 and

= 0:05 it turns out that

= 0:27 minimizes the upper bound across all true means, thereby bounding maximal inaccuracy above by 0:24. The lowest upper bound on inaccuracy, attained by the PMP test, is equal to 0:151: One may say here that our nonrandomized test achieves 63% relative accuracy as 0:151=0:24 Similarly one can select

0:63:

to minimize inaccuracy of the equi-tailed con…dence

interval as constructed in Appendix A. For N = 30 and

= 0:05 we …nd that

= 0:22 minimizes inaccuracy across all true means, the value of inaccuracy being equal to 0:5. The corresponding tight lower bound is equal to 0:34.

20

4

Two Examples

4.1

Shock-Wave Therapy

We consider the randomized medical experiment of Sabeti-Ashraf et al. (2005) designed to evaluate the outcome of two alternative diagnosis methods. 50 patients su¤ering from shoulder tendinitis received shock wave therapy. Prior to the intervention patients were randomly assigned to one of two treatments. For 25 of the patients the location of the therapy was determined manually. For the other 25 patients a computer assisted in determining the location. After the intervention the level of pain of each patient was measured on the visual analog scale (VAS), a scale that takes values in [0; 100] : The descriptive statistics are presented in Table 3a.

Table 3a: Shock-Wave Therapy (Descriptive Statistics) Assistance

Manual

Computer

N

25

25

Empirical mean

18

33

We wish to determine for each of the two treatment methods a 95% upper con…dence bound on the mean level of pain after the intervention. For this we we …rst determine the value of

that minimizes the maximum inaccuracy over all possible

means. We …nd numerically that

= 0:28 when

= 0:05: The associated inaccuracy

is equal to 0:263. The lower bound on maximal inaccuracy is 0:165, so the relative accuracy of our test is approximately 63%: We use this value of upper con…dence bounds and present these in Table 3b.

to determine the

21

Table 3b: Shock-Wave Therapy (Upper Con…dence Bounds) Assistance 95% Upper Con…dence Bound for

Manual Computer = 0:28

34

50

Recall that the UMP test evaluated in step 3 is based on size 0:05

0:28. The

upper con…dence bound U is then determined such that in step 4 the null hypothesis = U is rejected with probability 0:28 when testing against the alternative that

Suggest Documents