SUMMARY The main objective of this paper is to ...

5 downloads 0 Views 1MB Size Report
value, e for evidence) as an alternative to the classic p-value. A correction ..... TZ Irony; CAB Pereira (1995), Bayesian hypothesis test: using surface integrals to.
ADAPTIVE SIGNIFICANCE LEVEL: BLENDING BAYESIAN AND CLASSICAL CONCEPTS CARLOS ALBERTO DE BRAGANÇA PEREIRA UNIVERSITY OF SÃO PAULO SUMMARY The main objective of this paper is to find a close link between the adaptive level of significance, presented here, and the sample size. We, statisticians, know of the inconsistency, or paradox, in the current tests of significance that are based on p-value statistics that is compared to the canonical significance levels (10%, 5% and 1%): "Raise the sample to reject the null hypothesis" is the recommendation of some ill-advised scientists! This paper will show that it is possible to eliminate this problem of significance tests. Lindley's paradox – "increase the sample to accept the hypothesis" – also disappears (Lindley, 1978). Obviously we present here only the beginning of a possible interesting research. The intention is to extend its use to more complex applications such as survival analysis, reliability tests and other areas. The main tools used here are the Bayes Factor and the extended Newman-Pearson Lemma, discussed by DeGroot (1986). Key words: Significance Level, Sample Size, Bayes Ratio, Likelihood Function, Optimal Decision, Significance test 1. INTRODUCTION Recently, the use of p-values in tests of significance has been criticized. The question posed by Johnson (2013) and discussed by Gaudart et al, Gelman & Robert, Johnson & Pericchi et all (2014) concerns the misuse of canonical values of significance level (0.10, *, 0.05, **, 0.01 , ***, and 0.001, ****). More recently, a publication by the American Statistical Association, Wasserstein & Lazar (2016), makes recommendations for scientists to be concerned with choosing the appropriate level of significance. Pericchi & Pereira (2016) considers the calculation of adaptive levels of significance in an apparently successful solution for the correction of the significance level choices. This suggestion eliminates the risk of a breach of the principle of likelihood. However, that article deals only with simple null hypotheses, although the alternative may be compounded. Another constraint is the dependence of the parametric space dimension; it was only about one-dimensional parametric spaces. More recent is the article Benjamin et all (2017) commented on Nature Human Behavior (2017). In a genuinely Bayesian context, Pereira & Stern (1999) introduced the value e (evalue, e for evidence) as an alternative to the classic p-value. A correction to make the null hypothesis invariant under transformations was presented by Madruga et all (2002). A more theoretical review is the article of Pereira et all (2008). The e-value was 1

the basis of the solution of an astrophysical problem described by Chakrabarty (2016). The relationship between p-values and e-values is discussed by Diniz et all (2012). However, while the e-value works independently of dimensions, setting the significance level is not an easy task. This has made us look for a way to obtain a modified p-value that allows us to better understand how to obtain the optimal significance level of a problem of any finite dimension. The project developed here is based on three of our articles – Pereira & Wechsler (1993), Irony & Pereira (1995) and Montoya-Delgado et all (2001) –. It has taken a long time to see the possibility of using them in combination and with reasonable adjustments: Bayes Factor takes the place of the Likelihood Ratio and the average value of the Likelihood function replaces its maximum value. The mean of the likelihood function under the null hypothesis will be the density used in the calculation of the new value p, the P-value. The basis of all our work is the generalized NeymanPearson lemma in its Bayesian form, see DeGroot (1986), sections Optimal Tests (Theorem 1) and Bayes Test Procedures (pages 451 and 452). 2. BLENDING BAYESIAN AND CLASSICAL CONCEPTS 2a. Statistical Model As usual, let 𝑥 and 𝜃 be random vectors (could be scalars) such that: 𝑥  𝑛 ,  being the sample space; and 𝜃 ∈   𝑛 ,  being the parametric space. To state the relation between the two random vectors, the statistician considers the following: a family of probability density functions indexed by the conditioning parameter 𝜃, {𝑓(𝑥|𝜃); 𝜃 ∈ }; a prior probability density function 𝑔(𝜃); and the posterior density function 𝑔(𝜃|𝑥). In order to be appropriate, indexed by 𝑥, the family of likelihood functions {𝑓(𝑥|𝜃); 𝑥 ∈  } must be measurable in the prior -algebra. With the defined statistical model, a partition of the parametric space is defined by the consideration of a null hypothesis that should be confronted with its alternative: 𝐇: 𝜃 ∈ 𝐇 and 𝐀: 𝜃 ∈ 𝐀 where 𝐇 𝐀 =  with 𝐇 𝐀 = . In the case of composite hypotheses with the partition elements having the same dimension, the model would be complete. These cases would not be involved with partitions for which there are components with Lebesgue zero measure. In case of precise hypotheses - the partition components have different dimensions - we must add other elements: i. ii.

Positive probabilities of the hypotheses, 𝜋(𝐇) > 0 e 𝜋(𝐀) = 1 − 𝜋(𝐇) > 0; and A density over the set of smaller dimension. The choice of this density should be coherent with the original prior density over the global parameter space.

2

Consider the common case for which the null hypothesis is the one defined by the subset of smallest dimension. In this case we use the surface integral to normalize the values of the prior density in the null set so that the sum or volume of these values is equal to the unit. Figure 1 illustrates how this procedure is taken. Recall that an a priori density can be looked at as a preference system in the parametric space and the preference systems must be kept even within the null hypothesis: coherence in access to a priori distributions is crucial. Further details on this procedure can be found in Pereira (1985) and Irony & Pereira (1995). 2b. Significance Index By significance index we mean a real function over the sample space that is used for decision-making with respect to accept/reject the null hypothesis, H. We begin this section by stating the extended Neyman-Pearson Lemma presented by De Groot (1985). Let 𝑓𝐇 (𝑥) and 𝑓𝐀 (𝑥) be probability density functions over the sample space, . The decision problem is to choose one of these densities as being the true generator of the observed data. Consider now a binary function 𝛿(𝑥) used to define the decision procedure. Defining a partition of the sample space as 𝑿𝐇 𝑿𝐀 = 𝑿 with 𝑿𝐇 𝑿𝐀 = , the test function is 0 𝑖𝑓 𝑥 ∈ 𝑿𝐇 𝛿(𝑥) = { 1 𝑖𝑓 𝑥 ∈ 𝑿𝐀 To define the relevance of a hypothesis in relation to its alternative, one should choose two positive real numbers, say a and 𝑏: a > 𝑏, a = 𝑏 and a < 𝑏, respectively, means preference for the null hypothesis, indifference, and preference for the alternative. The decision rule is reject the null hypothesis, 𝐇, whenever the function equal one and do not reject otherwise. The optimal test is obtained by the following theorem for which the probabilities of the two types of errors – type I and type II – are 𝛼(𝛿) = Pr{𝑟𝑒𝑗𝑒𝑖𝑡𝑎𝑟 𝐇|𝐇 é 𝑣𝑒𝑟𝑑𝑎𝑑𝑒𝑖𝑟𝑎} = Pr{𝛿(𝑥) = 0|𝑓𝐇 } 𝛽(𝛿) = Pr{𝑎𝑐𝑒𝑖𝑡𝑎𝑟 𝐇|𝐇 é 𝑓𝑎𝑙𝑠𝑎} = Pr{𝛿(𝑥) = 1|𝑓𝐀 } Theorem: Neyman-Pearson-DeGroot Let 𝛿 ∗ be a test that reject H favoring A if a𝑓𝐇 (𝑥) < 𝑏𝑓𝐀 (𝑥), do not reject H if a𝑓𝐇 (𝑥) > 𝑏𝑓𝐀 (𝑥), and being indifferent if a𝑓𝐇 (𝑥) = 𝑏𝑓𝐀 (𝑥). Then, for any other test 𝛿, a𝛼(𝛿) + 𝑏𝛽(𝛿) ≥ aα(𝛿 ∗ ) + 𝑏𝛽(𝛿 ∗ ). The proof of this important result, left to the reader, is very simple as can be seen in DeGroot (1986).

3

To obtain the Bayesian version of the theorem consider a loss function that is zero if the decision is correct, 𝑤A (𝑤H ) if the decision favors 𝐀 (𝐇) when 𝐇 (𝐀) is the true state of nature. In addition, if 𝜋 is the prior probability of 𝐇 and using 𝛿 as the test function, the risk function would be 𝒓(𝛿) = 𝑤A 𝜋𝛼(𝛿) + 𝑤H (1 − 𝜋)𝛽(𝛿) Consequently, to obtain the Bayesian version of the theorem it is enough to replacing (𝜋𝑤A ) e (1 − 𝜋)𝑤H for a and 𝑏, respectively. Both the classical and the Bayesian versions of the theorem are enunciated comparing in fact the ratio

𝑓𝐇 𝑓𝐀

with the

constant 𝐾=

𝑏 a

=

(1−𝜋)𝑤𝐇 𝜋𝑤𝐀

.

Important is to remember that this version of the general Neyman-Pearson Lemma, from the classical point of view, works out only for the case of simple against simple hypotheses. In fact there is no definition of a density function that is conditional to a compound hypothesis. However, there are some classical methods that use optimization whenever the maximum over Important is to remember that this general version of Newman-Pearson's theorem, from the classical point of view, will only apply to simple versus simple hypotheses. It is not common to consider a density function under a composite hypothesis. However, it is true that some classical methods use optimization by considering the maximum of the likelihood function both under 𝐇 and under 𝐀: recall that the likelihood function can be represented as 𝑥 = {𝐿(𝜃|𝑥) = 𝑓(𝑥|𝜃); ∀𝜃 ∈ }. Also under the Bayesian paradigm, the likelihood function 𝐿 (𝐿 for likelihood) plays an important role, as it could not be otherwise, since it is the only considered objective function that shows association between the sample 𝑥 and the parameter 𝜃. However, instead of optimization, integration is the Bayesian tool. With the a priori densities, the following conditional expectations are calculated: 𝑓𝐇 (𝑥) = 𝐸{𝐿(𝜃|𝑥)|𝑥, 𝜃 ∈ 𝐇 } and 𝑓𝐀 (𝑥) = 𝐸{𝐿(𝜃|𝑥)|𝑥, 𝜃 ∈ 𝐀 } These functions are the Bayesian predictive densities under the respective hypotheses. Both are density functions over the sample space𝑿. The ratio between the two functions is known as the Bayes factor, RB(𝑥) =

𝑓𝐇 (𝑥) 𝑓𝐀 (𝑥)

.

To define a confidence index, alternative to the usual p-value, it is necessary to stablish an order over the sample space. Motoya-Delgado et all (2001) suggests the use of the Bayes ratio values of all sample points to induce the necessary order. The steps to perform a significance test are as follows: 1. Access a prior density for the parameter of interest, 𝑔(𝜃); 2. Clearly define the alternative hypotheses H and A; 4

3. Obtain the predictive functions under the two alternative hypotheses. In the case for which the parametric subspaces defined by the hypotheses are of different dimensions, the definition of a priori density under the smaller dimension, say H, is obtained as follows: 0 𝑖𝑓 𝜃𝐀 𝑔(𝜃|𝐇) =

𝑔(𝜃) 𝑔(𝑦)𝑑𝑦 { ∮

𝑖𝑓 𝜃𝐇

𝐇

The denominator is the surface integral over the subspace 𝐇. In addition to this density and only in the case distinct dimensions of 𝐇 and 𝐀 , consider a positive probability 𝜋 of H be the true hypothesis. Figure 1 well illustrates how should be the choice of 𝑔(𝜃|𝐇). FIGURE 1: Bivariate Normal density cut in the subspace of equal marginal means to show the prior density in that subspace.

4. Define the loss function considering mainly the differences of importance social, for example - between the hypotheses. 5. Use the Bayes' ratio to ordering the sample space: {RB(𝑥): 𝑥𝑿}   establishes the order of each 𝑥𝑿 . This ordering can be used independently of the dimensions of either spaces, 𝑿 and . 6. Using the above theorem, compute the optimal errors and use the value of 𝛼(𝛿 𝑥 ) as the adaptive level of significance, which will depend on the loss function, on the probability densities, sampling and a priori, and especially on the sample size. 7. Calculate the significance index, the P-value, which will take the following form: being 𝑥0 the observed value of the statistic and 𝐶0 = {𝑥; 𝑥 < 𝑥0 } the observed tail, the P-value will take the expression P0 = ∫𝐶 𝑓𝐇 (𝑥)𝑑𝑥. Clearly, 0

this may be either single or a multiple integral. 5

8. Compare the value P0 with the value of 𝛼(𝛿 ∗ ). Reject (do not reject) H < if P0 (>) 𝛼(𝛿 ∗ ). In the case of equality, take either decision without prejudice to optimization. 9. Finally, if 𝛼(𝛿 ∗ ) is fixed a priori, calculate the sample size needed to make this fixed value optimal according to the Neyman-Pearson-DeGroot theorem. 3. ILLUSTRATIVE EXAMPLES This section introduces two simple examples to illustrate the appropriateness of the new P-valor and how the adaptive level of significance relates with sample sizes. Example 1: A doctor wants to show that the incorporation of a new technology in a treatment can produces better results than a conventional one. He planned a clinical trial with two arms, case/control, each with eight patients. The cases arm used the new treatment and the arm of the controls was for the conventional one. For instance, details of an alike clinical trial are shown by Lopes et all (2014). The observed results were that only one of the patients in the controls arm responded positively although in the arm of the cases the positive respondents were four. The most common classical significance tests result in the following p-values: the Pearson  p-value was 0.106 that changed to 0.281 when the Yates continuity correction was applied and the Fisher's exact p-value was 0.282. Traditional analysts would conclude that there were no statistically significant differences between the two treatments, whenever they would use anyone of the canonical significance levels. Note that these procedures were for testing a sharp hypothesis against a composed one: 𝐇: 𝜃0 = 𝜃1 and 𝐀: 𝜃0 ≠ 𝜃1 , comparing the proportion of success of the two treatments. In the sequel we calculate the proposed 𝑃 − 𝑣𝑎𝑙𝑢𝑒 and use the optimal significance level 𝛼(𝛿 ∗ ) to making the decision of choosing one of the hypotheses. To be fair in our comparisons we consider independent uniform (noninformative) prior distributions for 𝜃0 and 𝜃1 . We also consider that the two hypotheses are of equal importance, 𝜋 = 0.5 𝑎𝑛𝑑 𝑤𝐇 = 𝑤𝐀 = 1. With these suppositions and the likelihoods being binomials with sample sizes n = 8, the predictive probability functions under the two hypotheses are 𝑓𝐇 (𝑥, 𝑦) =

17( 𝑥8 ) ( 𝑦8 ) 16 (𝑥+𝑦 )

˄ 𝑓𝐀 (𝑥, 𝑦) =

1  𝑥, 𝑦{0,1, … ,8} × {0,1, … ,8} 81

The variables 𝑥 and 𝑦 represent the possible observed values of the number of positively respondents of the two arms. To obtain the optimal solution we minimize the sum of the errors, 𝛼(𝛿) + 𝛽(𝛿). This optimal solution is the result of comparing 𝑓 the Bayes ratio 𝐵𝑅 = 𝐇⁄𝑓 with 𝐾 = 1 to make the choice according to the Neyman𝐀 6

Pearson-DeGroot theorem. Table 1 presents, for all possible results, the Bayes Ratio value. Recalling the observed result of the clinical trial, (𝑥, 𝑦) = (1,4), the observed likelihood is Pr{𝑥 = 1˄𝑦 = 4} = 0.611 ÷ 81 =7.54E-03. Hence, the resulting 𝑃 − 𝑣𝑎𝑙𝑢𝑒 is the sum of all probabilities smaller than 7.54E-03; i. e., the sum of all cells, in Table 1, that are smaller than 0.611 divided by 81. Finally, the significance index, the 𝑃 − 𝑣𝑎𝑙𝑢𝑒, is 𝑃 = 0.0923. By summing all the red numbers and diving it by 81 we obtain the optimal adaptive level of significance 𝛼(𝛿 ∗ ) = 0.1245. The probability of the second kind of error is the number of cells with bold black numbers divided by 81, 𝛽(𝛿 ∗ ) = 39 = 0.4815. The high value of the probability of the 81 second kind of error is expected whenever the sample sizes are small. Contrary to the classical results, the conclusion now is the most intuitive one; the null hypothesis is rejected since 𝑃 < 𝛼(𝛿 ∗ ).

x 0 1 2 3 4 5 6 7 8 Sum

0 4.765 2.382 1.112 0.476 0.183 0.061 0.017 0.003 4E-04 9

Table 1:

1 2.382 2.541 1.906 1.173 0.611 0.267 0.093 0.024 0.003 9

2 1.112 1.906 2.052 1.710 1.166 0.653 0.290 0.093 0.017 9

3 0.476 1.173 1.710 1.866 1.633 1.161 0.653 0.267 0.061 9

y 4 0.183 0.611 1.166 1.633 1.814 1.633 1.166 0.611 0.183 9

5 0.061 0.267 0.653 1.161 1.633 1.866 1.710 1.173 0.476 9

6 0.017 0.093 0.290 0.653 1.166 1.710 2.052 1.906 1.112 9

7 0.003 0.024 0.093 0.267 0.611 1.173 1.906 2.541 2.382 9

8 4.E-04 0.003 0.017 0.061 0.183 0.476 1.112 2.382 4.765 9

Sum 9 9 9 9 9 9 9 9 9 81

Bayes Ratio of all possíble reults in a clinical trial with arms size of n=8 -

Cells with red numbers form the critical region.

The physician, owner of the data in Example 1, looking at our analysis, asked about the sample size needed to obtain at most 10% of a level of significance of our procedure. The answer could be obtained by the next example that shows the case of two arms with 20 patients each. Example 2: Consider now a Clinical Trial as in Example 1 but with arms size of n = 20. The observed result now is (𝑥, 𝑦) = (4,10). We leave to the reader the simple exercise of repeating the calculus of Example 1 with different samples. Table 2 presents, for all possible results, the corresponding Bayes Ratio. The observed likelihood is Pr{𝑥 = 4˄𝑦 = 10} = 0.415 ÷ 441 =9.41E-04.

7

Hence, the resulting 𝑃 − 𝑣𝑎𝑙𝑢𝑒 is the sum of all probabilities smaller than 9.41E-04; i. e., the sum of all cells, in Table 1, that are smaller than 0.415 divided by 441. Finally, the significance index, the 𝑃 − 𝑣𝑎𝑙𝑢𝑒, is 𝑃 = 0.02901. By summing all the red numbers and diving it by 441 we obtain the optimal adaptive level of significance 𝛼(𝛿 ∗ ) = 0.0995. The probability of the second kind of error is the number of cells with bold black numbers divided by 441, 𝛽(𝛿 ∗ ) = 164 = 0.3719. The classical  p441 value is p = 0.0467 that indicates the rejection of the null hypothesis considering the canonical 5% level of significance. This agrees with our decision of also rejecting the null hypothesis since again 𝑃 < 𝛼(𝛿 ∗ ). It is interesting to see the relative distance between the index and the level of significance. For the test we have 1− .0467 =.53 and .1 .029 the adaptive case obtains 1− .0995=.71. y 0 1 2 3 4 5 6 7 8 9 10 11 0 10.76 5.378 2.620 1.241 0.570 0.253 0.109 0.045 0.018 0.007 0.002 0.001 1 5.378 5.516 4.137 2.683 1.584 0.869 0.447 0.217 0.099 0.043 0.017 0.006 2 2.620 4.137 4.249 3.541 2.580 1.700 1.030 0.579 0.304 0.148 0.068 0.029 3 1.241 2.683 3.541 3.642 3.187 2.472 1.738 1.121 0.668 0.369 0.188 0.089 4 0.570 1.584 2.580 3.187 3.283 2.955 2.383 1.747 1.175 0.727 0.415 0.218 5 0.253 0.869 1.700 2.472 2.955 3.050 2.796 2.314 1.746 1.207 0.766 0.446 6 0.109 0.447 1.030 1.738 2.383 2.796 2.892 2.686 2.263 1.741 1.226 0.789 7 0.045 0.217 0.579 1.121 1.747 2.314 2.686 2.785 2.611 2.228 1.736 1.235 8 0.018 0.099 0.304 0.668 1.175 1.746 2.263 2.611 2.716 2.565 2.208 1.733 9 0.007 0.043 0.148 0.369 0.727 1.207 1.741 2.228 2.565 2.676 2.542 2.201 10 0.002 0.017 0.068 0.188 0.415 0.766 1.226 1.736 2.208 2.542 2.664 2.542 11 0 0.01 0.03 0.09 0.22 0.45 0.79 1.235 1.733 2.2 2.54 2.68 12 0 0 0.01 0.04 0.1 0.24 0.46 0.800 1.238 1.73 2.21 2.56 13 0 0 0 0.02 0.05 0.11 0.25 0.469 0.800 1.24 1.74 2.23 14 0 0 0 0.01 0.02 0.05 0.12 0.246 0.463 0.79 1.23 1.74 15 0 0 0 0 0.01 0.02 0.05 0.114 0.237 0.45 0.77 1.21 16 0 0 0 0 0 0.01 0.02 0.046 0.104 0.22 0.41 0.73 17 0 0 0 0 0 0 0.01 0.015 0.038 0.09 0.19 0.37 18 0 0 0 0 0 0 0 0 0.01 0.03 0.07 0.15 19 0 0 0 0 0 0 0 0 0 0.01 0.02 0.04 20 0 0 0 0 0 0 0 0 0 0 0 0.01 Sum 21 21 21 21 21 21 21 21 21 21 21 21 Table 2: Bayes Ratio for all possible results in a clinica trial with arms of size n = 20 x

12 0 0 0.011 0.038 0.104 0.237 0.463 0.800 1.238 1.733 2.208 2.565 2.716 2.611 2.263 1.746 1.175 0.668 0.304 0.099 0.018 21

13 0 0 0.004 0.015 0.046 0.114 0.246 0.469 0.800 1.235 1.736 2.228 2.611 2.785 2.686 2.314 1.747 1.121 0.579 0.217 0.045 21

14 0 0 0 0.005 0.018 0.049 0.117 0.246 0.463 0.789 1.226 1.741 2.263 2.686 2.892 2.796 2.383 1.738 1.030 0.447 0.109 21

15 0 0 0 0.002 0.006 0.019 0.049 0.114 0.237 0.446 0.766 1.207 1.746 2.314 2.796 3.050 2.955 2.472 1.700 0.869 0.253 21

16 0 0 0 0 0 0.01 0.02 0.05 0.1 0.22 0.41 0.73 1.18 1.75 2.38 2.955 3.283 3.187 2.580 1.584 0.570 21

17 0 0 0 0 0 0 0.01 0.02 0.04 0.09 0.19 0.37 0.67 1.12 1.74 2.472 3.187 3.642 3.541 2.683 1.241 21

18 0 0 0 0 0 0 0 0 0.01 0.03 0.07 0.15 0.3 0.58 1.03 1.700 2.580 3.541 4.249 4.137 2.620 21

19 0 0 0 0 0 0 0 0 0 0.01 0.02 0.04 0.1 0.22 0.45 0.869 1.584 2.683 4.137 5.516 5.378 21

20 0 0 0 0 0 0 0 0 0 0 0 0.01 0.02 0.04 0.11 0.253 0.570 1.241 2.620 5.378 10.76 21

Sum 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 441

Cells with red numbers form the critical region

The response to the question about the sample size needed to obtain a significance level of at most 10% the answer is n = 20 in each arm. Pericchi and Pereira (2016) presented a closed asymptotic formula that relates sample size and level of significance in the simple case of testing 𝐇: 𝜃 = 𝜃0 vs 𝐀: 𝜃 ≠ 𝜃0 , in a binomial with parameters 𝜃 and 𝑛. The natural future project is to find this type of relation in other complex statistical problems such the one presented in the above examples. The following example is an attempt to show that our P-value should not violate the principle of verisimilitude. Recall that violation of this principle produced the main criticisms of the Bayesian community about classical p-values. Example 3: The main example for the violation of the likelihood principle is the case of positive binomials in comparison with negative binomials. For the same values of x, the number of successes in n independent Bernoulli trials, the two distributions 8

produce different p-values that can lead to different decisions if compared with the same level of significance. The present example shows that the method introduced here will produce the same decisions if the sample size and the number of successes are the same. The reasons are that, although different, the p-values are compared with different levels of significances: the decisions about the null hypothesis are going to be the same and there will be no violation of the Likelihood Principle. Changing the notation let the sample vector be composed by the number of success and the number of failures, (𝑥, 𝑦), and the corresponding vector of probabilities be (𝜃0 . 𝜃1 ) with 𝜃0 = 1 − 𝜃1 . Consider that 𝐇: 𝜃 = 0.5 vs 𝐀: 𝜃 ≠ 0.5 are the hypotheses to be confronted. The predictive densities to built the significance tests are as follows: 1. For positive binomial 𝑥 + 𝑦 1 𝑥+𝑦 𝑓𝐇 (𝑥) = ( )( ) ˄ 𝑓𝐀 (x) = (x + y + 1)−1 𝑥 2 2. For negative binomial 𝑥 + 𝑦 − 1 1 𝑥+𝑦 (𝑥) 𝑓𝐇 =( )( ) ˄ 𝑓𝐀 (𝑥) = 𝑦[(𝑥 + 𝑦)(𝑥 + 𝑦 + 1)]−1 𝑥 2 Clearly, the Bayes’ Ratios, 𝑓𝑓𝐇, are equal for the two models and since from the 𝐀

theorem they will be compared with the same constant, the decisions about the null hypothesis shall be the same. On the other hand, both the p-value and the significance level are different for the two models. For instance, if we consider the observations (𝑥, 𝑦) = (3,10) and (𝑥, 𝑦) = (10,3) for positive binomial we obtain the same results for both samples; 𝛼 = 0.09, 𝛽 = 0.43 and 𝑃 = 0.02. For the negative binomial the two observed points will produce different models, the first (second) with the rule of stopping observing whenever the number of successes reach 3 (10). For the first result, we have 𝛼 = 0.18, 𝛽 = 0.48 and 𝑃 = 0.01, and for the second 𝛼 = 0.12, 𝛽 = 0.33 and 𝑃 = 0.01. The decisions based on the positive binomial are the same if the model is a negative binomial. The next and last simple example shows that the critical region is not always the tails of the null distribution; it can be a union of disjoint intervals. Example 4. This is an example of Pereira and Wechsler (1993). Let 𝑥 be a normal random variable with zero mean and unknown variance 𝜎 2 . The interest was to test 𝐇: 𝜎 2 = 2 vs 𝐀: 𝜎 2 ≠ 2. A 12 (qui-squared distribution with one degree of freedom) is taken as a prior density for 𝜎 2 . After some integration exercise we can establish the predictive densities for our significance test as −1

𝑓𝐀 (𝑥) = {π(1 + 𝑥 2 )}−1 ˄ 𝑓𝐇 (𝑥) = (2√π) exp (−

𝑥2 4

):

Respectively, the Cauchy density and a normal density with zero mean and variance equal two. Figure 2 shows the Bayes Ratio for all sample points that is confronted with the constant 1.1 to indicate the decision about the null hypothesis. The sample points 9

that do not favor the null hypothesis are just a center area together with the heavy tails of the Cauchy density. The set that favors 𝐇 does not include the central area: 𝑋𝐻 = {𝑥|𝑥 ∈ (−2.8; −0.6) ∪ (0.6; 2.8)} The critical region under other side includes the interval (−0.6; 0.6), a considerable center area.

Figure 2: Bayes Ratio for N(0;2) vs Cauchy: null set = {(-2.8;-0.6),(0.6;2.8)} 1.7 1.4 1.1 0.8 0.5 BR

K

0.2 -4

-2

0

2

4

3. Final Remarks Many of the users of statistics ask about the logic of using the canonical significance levels. I believe there is the logic of optimization for the adaptive significance level we have presented in this paper. It is important to understand that there are complex problems that will bring difficult to compute these indexes. Although I together with my colleagues have already some additional work in testing different hypotheses, it still has a lot to do in order to make our 𝑃 − 𝑣𝑎𝑙𝑢𝑒 popular. For example, considering the simple cases presented here. Only small changes in the hypotheses comparing two proportions can bring difficulties: 𝜋 ≤ 𝜃 against 𝜋 > 𝜃 does give us more work than expected. Imagine now working in large sample problems. It still remains a lot of work to be done, mainly in multivariate problems. There will have problems that give no space for improper priors to work. Hope this is the starting point of a new statistical significance testing area.

REFERENCES 1. DV Lindley (1957), "A Statistical Paradox", Biometrika 44 (1–2): 187–92 10

2. MH DeGroot (1986), Probability and Statistics, Addison-Wesley 3. VE Johnson (1013). Revised standards for statistical evidence, PNAS 110(48): 19313–17 4. VE Johnson (1014). Reply to Gelman, Gaudart, Pericchi: More reasons to revise standards for statistical evidence, PNAS 111(19): E1937 5. J Gaudarta; L Huiartb; PJ Milliganc; R Thiebautd; R Giorgia (2014). Reproducibility issues in science, is P value really the only answer? PNAS 111(19): E1934 6. A Gelman; CP Robert (1014). Revised evidence for statistical standards, PNAS 111(19): E1933 7. L Pericchi; CAB Pereira; María-Eglée Pérez (1014). Adaptive revised evidence for statistical standards, PNAS 111(19): E1935 8. RL Wasserstein; NA Lazar (2016). The ASA’s statement on p-values: Context, process, and purpose. TAS 70(2):129–33 9. LR Pericchi; CAB Pereira (2016). Adaptive significance levels using optimal decision rules: Balancing by weighting the error probabilities. BJPS 30(1):70-90 10. DJ Benjamin; J Berger; M Johannesson; BA Nosek; E Wagenmakers-J; R Berk … V Johnson (2017, July 22). Redefine statistical significance. PsyArxiv Preprints. Retrieved from osf.io/preprints/psyarxiv/mky9j 11. Nature News (27 July 2017). Big names in statistics want to shake up much-maligned P value, https://www.nature.com/articles/d41586-017-02190-5?WT.mc_id=TWT_NatureNews&sf101140733=1 12. CAB Pereira; JM Stern (1999). Evidence and credibility: a full Bayesian test of precise hypothesis. Entropy 1:104–15. 13. MR Madruga; CAB Pereira; JM Stern (2002). Bayesian evidence test for precise hypotheses. J Statistical Planning & Inference 117:185-98. 14. CAB Pereira; JM Stern; S Wechsler (2008). Can a significance test be genuinely Bayesian? Bayesian Analysis 3(1):79-100 15. D Chakrabarty (2017). A New Bayesian Test to Test for the Intractability-Countering Hypothesis. JASA 112(518): 561-77. 16. MA Diniz; CAB Pereira; A Polpo; J Stern; S Wechsler (2012). Relationship between Bayesian and frequentist significance indices. Int. J for Uncertainty Quantification 2(2):161–72. 17. CAB Pereira: S Wechsler (1993). On the concept of p-value. BJPS 7:159–77. 18. TZ Irony; CAB Pereira (1995), Bayesian hypothesis test: using surface integrals to distribute prior information among the hypotheses, Resenhas 2(1);27-46 19. CAB Pereira (1985). Testing hypotheses of different dimensions: Bayesian view and classical interpretation (in Portuguese). Professor thesis, Inst Math & Statistics, USP 20. TZ Irony; CAB Pereira (1995). Bayesian hypothesis test: using surface integrals to distribute prior information among the hypotheses, Resenhas 2(1);27-46. 21. LE Montoya-Delgado; TZ Irony; CAB Pereira; MR Whittle (2001). An unconditional exact test for the Hardy-Weimberg equilibrium law: Sample space ordering using the Bayes factor. Genetics 158(2):875–83. 22. AC Lopes; BD Greenberg; MM Canteras; MC Batistuzzo; MQ Hoexter; AF Gentil; CAB Pereira; MA Joaquim; ME de Mathis; CC D’Alcante; A Taub; DG de Castro; L Tokeshi; LA Sampaio; CC Leite; RG Shavitt; JB Diniz; G Busatto; G Norén; SA Rasmussen; EF Miguel (2014) Gamma Ventral Capsulotomy for Obsessive-Compulsive Disorder: A Randomized Clinical Trial. JAMA Psychiatry 71(9):1066-76 23. L Varuzza; CAB Pereira (2010). Significance test for comparing digital gene expression profiles: Partial likelihood application. Chilean J. Stat. 1, 91–102

11

Suggest Documents