Psychology Science, Volume 46, 2004 (2), p. 227-242
The misuse of asterisks in hypothesis testing 1
2
1
2
DIETER RASCH , KLAUS D. KUBINGER , JÖRG SCHMIDTKE , JOACHIM HÄUSLER Abstract
This paper serves to demonstrate that the practise of using one, two, or three asterisks (according to a type-I-risk α either 0.05, 0.01, or 0.001) in significance testing as given particularly with regard to empirical research in psychology is in no way in accordance with the Neyman-Pearson theory of statistical hypothesis testing. Claiming a-posteriori that even a low type-I-risk α leads to significance merely discloses a researcher’s self-deception. Furthermore it will be emphasised that by using sequential sampling procedures instead of fixed sample sizes the „practice of asterisks“ would not arise. Besides this, a simulation study will show that sequential sampling procedures are not only efficient concerning a lower sample size but are also robust and nevertheless powerful in the case of non-normal distributions. Key words: type-I-risk, Neyman-Pearson theory, sequential sampling, robustness
1 2
BioMath Company for Applied Mathematical Statistics in Biology and Medicine, Ltd. Department of Psychology University of Vienna, Division for Assessment and Applied Psychometrics
228
D. Rasch, K. D. Kubinger, J. Schmidtke, J. Häusler
1. Introduction In various publications on the results of using statistical tests in psychological and medical research work, we often still find the observed statistics accompanied by one of the symbols *, ** and ***. In agricultural research, after long discussions, this practice is now applied very seldom but was often quite common in the past. The meaning of the symbols is that the corresponding test statistic for a one-sided alternative hypothesis exceeds the 95 %(*), 99%- (**) or the 99.9%-quantile (***) respectively. For a two-sided alternative hypothesis, the corresponding numbers are the 97.5%-, 99.5%- and 99.95%- quantiles. However, if a researcher makes use of this „asterisks practice convention“ it merely discloses that he/she did not really design his/her experiment or survey in advance; or that he/she might not even understand his/her design. This will be explained in the following by the simple case of comparing two means of normally distributed variables being independently sampled, assuming equal variances in both of the underlying populations; that is, the pertinent case of using a Student’s two-sample t-test – and we will refer to its sequential counterpart. In the following we will establish our objections to the „practice of asterisks“ concerning the Neyman-Pearson theory of statistical hypothesis testing. We will also introduce a sequential two-sample (t-based) testing procedure which terminates at least after a fixed maximum number of observations. Above all, such a procedure establishes the need for fixing the probabilities of type-I-error and type-II-error of the test in advance; therefore no room for the use of asterisks is left.
2. Basics of statistical hypothesis testing As indicated, we will restrict the problem to testing the hypothesized equality of two population means µ1 and µ2. For instance, the considered variable is the score of a certain psychological test, the two populations: men and women. All the subsequent definitions will be given for this specific situation only. This being in order to gain as easy and understanding as possible. Bear in mind that all the following considerations also accordingly apply to other parameters like for instance, correlation coefficients of two variables. The Neyman-Pearson theory of statistical hypothesis testing involves, of course, the fact that even if there is a very small difference between the means of any two populations, we can be sure of detecting it by getting only a sufficiently large number of observations. In other words, significance is just a question of, so to say, „being busy enough“! Consequently, any kind of analysis is not really worth our while; since the researcher looks only for any kind of differences regardless of whether these differences are of any practical magnitude. As a matter of fact, not even the screws’ means produced on different days are exactly the same – a commonly used example within introductory books to statistics. From the point of didactics, a very often cited study of Belmont and Marolla (1973) serves as an impressive example of the artificial use of significance-based interpretation of empirical results. This study deals with differences concerning the intelligence of testees without and with siblings up to the number of eight siblings. The authors sampled intelligence test data from n = 386 114 Dutch recruits altogether, which established a continuous descendent IQ: Apart from recruits without any siblings – these being similar recruits with
The misuse of asterisks in hypothesis testing
229
two siblings – the IQ becomes smaller and smaller starting with recruits with exactly one sibling. Zajonc (1976) speculated on a certain socio-economic model as an explanation of this phenomenon; this model would have severe consequences if taken seriously by society. In actual fact, he established his model with reference to several more studies based also on very large samples, those being 800 000 17 year old scholarship candidates in the USA, 120 000 6 to 14 year old testees in France, and 70 000 11 year old testees in Scotland. However, neither he nor many readers of his papers realized that all the significant differences are of almost no relevance. Take, for example, recruits with either none, one, two, and three siblings. Then the 386 114 recruits have to be reduced to at least 3/4 which is about 290 000, which is still a very large sample; however then the largest difference in mean is hardly larger than a tenth of the standard deviation (cf. Kubinger & Gittler, 1983). That is for an IQ with σ = 15 a mean difference of 1,5 IQ-points results which no serious psychologist would ever interpret as worthwhile: Pertinent intelligence tests with reliabilities up to 0.95 would lead to a 95-percentage confidence interval with a length of at least (twice) 6,6 IQ-points! This serves to complete our argument that significance does not qualify a study per se, it is rather the content relevance that does. Therefore we should not ask for any difference at all but ask rather for a certain relevant difference, say δ = µ1 - µ2. And this should be fixed beforehand. A researcher should not start with any kind of data sampling and analysis just for interest’s sake but rather based only on deliberate considerations: what extent of difference δ would cause practical or theoretical consequence given that these results have been empirically established.
3. The power function of statistical hypothesis testing In the case of a one-sided test, a well known graphical representation of the two probabilities of errors for a fixed sample size is shown in Figure 1. As usual we call the probability of a type-I-error (rejecting a null-hypothesis which is true) the type-I-risk α and correspondingly the probability of a type-II-error (accepting a null-hypothesis which is wrong) the type-II-risk β. If the difference of practical interest δ = µ1 - µ2 is standardised as
δ*=
µ1 − µ2 * (this can also be called the non-centrality parameter), and δ equals 3, the two σ
risks can be found in Figure 1 under the corresponding density curve. That means α is shown below the left symmetric curve of the central t-distribution (if the null-hypothesis is true) and β is shown below the density curve of the non-central t-distribution (if the alternative hy* pothesis δ = 3 is true). If for a fixed sample size we make α smaller (shift the quantile line t(df, 1 - α) to the right) β becomes larger. To make both risks smaller, we have to increase the sample size because then the shape of both curves becomes steeper and steeper. The power function is defined as the probability of rejecting the null-hypothesis – in our case this is the equality of two means – as a function of the difference between the two means. If the two population means under discussion are actually equal, this probability quantifies the type-I-error, that being the risk of rejecting the null-hypothesis although it is true. On the other hand, if the two means differ from each other, the probability of rejecting the null-hypothesis for the t-test increases with the difference of these means (cf. Fig. 2 the
230
D. Rasch, K. D. Kubinger, J. Schmidtke, J. Häusler
given standardised difference
µ1 − µ2 = δ * > 0 and the resulting probability 1–β for rejectσ
ing the null-hypothesis).
Figure 1: The density function of the test statistic (here of the t-test; f(t)) under the null-hypothesis (left curve) and – just for simplification – for a simple one-sided alternative hypothesis µ1 - µ2 = δ > 0 (right curve for δ* = δ /σ = 3) is considered
Figure 2 : The power function of the t-test for a two-sided alternative hypothesis: µ1 ≠ µ2 and a type-Irisk of α = 0,05, for n = 5 and n = 20
The misuse of asterisks in hypothesis testing
231
Of course, any researcher aims to get a relatively high probability for 1–β , given the mean difference is of a minimum δ which corresponds to the relevance. Now, as denoted in * Figure 2, the power function amounts to 1–β for the standardised difference δ = δ /σ , as β is the probability of a type-II-error (that being the risk of confirming the null-hypothesis although it is wrong). Consequently, the probability 1–β of rejecting the null-hypothesis becomes larger for any δ0 > δ and therefore the factual power and the factual type-II-risk β0 is smaller than β.
4. Statistical hypothesis testing-designed experiments and surveys It has already been indicated that sampling data and analysing it without pre-determining type-I-, type-II-risk, and the extent of the relevant effect (i.e. here the relevant difference δ of two means) would lead to results as one pleases but without any scientifically-based merits. In such cases, the results are results by chance, and could just as well be obtained by throwing a coin with the same justification. Whether the researcher gains „important“ results – important because of the fact that they are significant – depends merely on the sample size he/she was able to recruit or he/she has decided on for convenience sake (see the two curves in Fig. 2 which disclose the tendency of the power function if n becomes larger). As already stated, designing an empirical study, an experiment or survey, in order to investigate whether any given relevant (minimum) difference δ between two populations exists, requires the fixation of type-I-risk α and type-II-risk β beforehand. Either the sample size should be appropriately calculated or a sequential testing procedure should be carried out. The latter is especially advisable if the elements can be sequentially sampled quite easily as is often the case with patients in clinical research work or the like. In doing so – either confirming the adequate sample size or testing sequentially – it is self-evident that one should use that particular α that was determined at the very beginning for making a decision on „significant vs. non-significant“. Therefore, it would be totally senseless to use any other α at the very end of the analysis. Either the sample size n has been confirmed for a certain and fixed α which must not be altered afterwards because then every conclusion would be based on an altered value of β, too; or sequential testing has terminated because of significance at level α – further effort would not pay off just to try to prove significance at a level α0 0. In this case, a very important result would then be found if the null-hypothesis is not rejected because then the extent of the association is (at least) ρ0 – given a certain type-I-risk. This approach is hardly implemented anywhere in psychology research work. This is probably because only a few researchers dare to fix the effect that is of relevance before they know the results of analysis. Another problem within psychological research work is that using *, **, or *** is supported by statistical software packages, in particular SPSS and SAS. However, the presentation of the respective p-value is only a matter of a traditional technical simplification, in other words, it is more convenient to programme p-values for the output than to ask the researcher at the very beginning for a fixed α-level in order to compare the resulting p-value
The misuse of asterisks in hypothesis testing
233
with the critical p-value at α at the end. So, statistically unsure researchers are mislead into interpreting a lower p-value as more important that is as though caused by a greater effect.
5. Formalization of statistical hypothesis testing – the conventional approach In order to prove our argument more formally we will start with the following. We have two populations 1 and 2 with the normally distributed variables y1 and y2, these supposedly 2 2 have the means µ1, µ2 and the variances σ1 , σ2 . We consider two independent random samples (y 11, ..., y1n1) and (y21, ..., y2n2) of sizes n1 and n2 from the two populations under consideration in order to test the null-hypothesis Ho : µ1 = µ2 against one of the following one- or two-sided alternative hypotheses a) H A : µ1 > µ2 3 b) H A : µ1 < µ2 or c) HA : µ1 ≠ µ2. +
The sample sizes n1 and n2 should be determined in such a way that for a given type-I-risk
α, the type-II-risk β0, as a function of (δ = µ1 - µ2), does not exceed a predetermined upper bound β as long as, for the respective alternative hypothesis, we have either a) µ1 - µ2 > δ, b) µ1 - µ2 < -δ, or c) |µ1 - µ2| ≥ δ. Once more, the minimal difference of practical interest, respectively the (relevant) effect
δ between µ1 and µ2 has to be fixed by the researcher at the very beginning of the design of his/her experiment or survey. As presupposed in our case with equal variances, the hypotheses above are best tested either with the one-sided Student’s t-test – cases a) and b) – or the two-sided Student’s t-test – case c). As is already well-known, the test is based on the pooled estimator of the common 2 variance σ given by n1
s 2p =
3
n2
∑( y1i- y1.) 2 + ∑( y 2i- y 2.) 2 i=1
i=1
n1+n 2-2
the correct formulation of the null-hypothesis is then Ho : µ1 ≥ µ2
.
(1)
D. Rasch, K. D. Kubinger, J. Schmidtke, J. Häusler
234
The test statistic is:
t=
y1 - y 2 sp
n1n 2 n1 + n 2
(2)
and H0 is rejected in case a) if t > t(n1+n2–2; 1–α), in case b) if t < –t(n1+n2–2; 1–α), and in case c) if │t│ > t(n1+n2–2; 1–
α
2
), otherwise it should be accepted.
As the variances are supposed as being equal, this test leads to the smallest possible sample sizes for a certain α and β. Given the total number of observations n1 + n2 then the variance of all possible differences of the means of both the populations under consideration is minimised with n1 = n2 = n, i.e. when the two sample sizes are equal. The value of n can be determined iteratively from (cf. Rasch, Verdooren & Gowers, 1999, p. 83):
2σ 2 2 n = CEIL 2 ⋅ t ( 2n − 2; P ) + t ( 2n − 2;1 − β 0 ) d
(3)
where P = 1–α for a one-sided alternative hypothesis and P = 1–α/2 for the two-sided alternative hypotheses – here CEIL is an abbreviation of ceiling and means: „take the argument as it is, if it is integer, otherwise take the next larger integer“. 2 If no a-priori information is available concerning the common variance σ , we can use an algorithm already described by Rasch (2003). However, it is very easy to calculate the sample size by the module „MEANS“ of the CADEMO software (Rasch, Herrendörfer, Bock, Victor & Guiard, 1996, 1998; or Rasch, Verdooren & Gowers, 1999).
6. Formalization of statistical hypothesis testing – the approach of sequential testing Sequential testing means a multiple-staged sequencing approach to data sampling and analysing. If we consider, in the following, for reasons of generality the symbol θ for some parameter in question – in our problem above θ = δ – then we may have to test: H0 : θ = θ0 against: H1 : θ = θ1 with θ0 ≠ θ1.
(4)
The misuse of asterisks in hypothesis testing
235
Contrary to a conventional test with only two possible decisions (accepting or rejecting H0), a sequential test leads, after each step of data sampling, to one of the following three (!) decisions: Rejecting H0 or rejecting H1 or continuing with data sampling. The random „walk“ during a sequential test can be illustrated graphically by a sequential path in a coordinate system: The sample size or any monotonic function of it is shown on the abscissa and the values of the test statistic calculated from the cumulating data are shown on the ordinate. In this coordinate system a „continuation region“ is defined, indicating that data sampling should not be stopped. As long as the sequential path is inside this region, the procedure continues in collecting the next observation. If one of two boundaries is reached or passed by this path, one of the two decisions „rejecting H1“ or „rejecting H0“ is established. The first sequential test was the sequential probability ratio test of Wald (1946). However, this test and also other sequential probability ratio tests based on Wald’s test (as Hajnal’s test described in Rasch & Guiard, 2003; we call these tests Wald-type tests) have a serious disadvantage: They end with probability 1 after a finite number of steps but nobody knows the upper bound for the number of steps. These tests are therefore called „open“ tests. In the case of testing a mean against some hypothesized constant, the continuation region of the Wald test is given by two parallel lines. As long as the user is between the two lines, the procedure continues. If the upper line is passed, then the alternative hypothesis is to be rejected, that is to say, the null-hypothesis has stood the test (or to say it slovenly: the nullhypothesis is to be „accepted“); if the lower line is passed, then the null-hypothesis is to be rejected. Among all sequential tests with the same risks, the Wald’s test requires the smallest expected sample size. On the other hand, so-called closed tests have been developed which do end after a maximum number of steps. As a consequence, a small increase in the expected sample size in comparison to that of the Wald test results, but this expected sample size is often much smaller than the sample size for a conventional test with a fixed sample size in advance. One of these closed tests is the „triangular sequential test“, the name of which was chosen because of the continuation region is a triangle. The triangular sequential test was described by Whitehead (1983) and has been slightly improved by Schneider (1992). We consider an observed sample (y1, y2, …, yn) from n independently and identically distributed random variables yi, with probability density f ( yi , θ ) for a given but unknown parameter θ. We call this function for fixed yi the likelihood function. The log-likelihood of a single observation yi is:
l(yi , θ ) = ln f ( yi , θ ) .
(5)
Assuming appropriate continuity conditions for f with respect to θ, Taylor series expansions give approximately negligible third and higher order terms:
1 l(yi , θ ) = const. + θ ⋅ Zi − θ 2 Vi , 2 while
(6)
236
D. Rasch, K. D. Kubinger, J. Schmidtke, J. Häusler
Zi = lθ ( yi , 0 ) and Vi = −lθθ ( yi , 0 )
(7)
– the suffices θ and θθ indicate first and second order derivation with respect to θ. The Zvalues are called the „efficient score statistic“ and the V-values are „Fisher’s information statistic“. n
We consider the sample statistic Z = ∑ Zi and correspondingly the sample statistic V. i =1
It was shown by Wald (1947) that the random variable Z as a function of a random sample is asymptotically normally distributed with expectation θV and variance V. Hence, the continuation region of the triangular sequential test for α = β is defined by the inequality – a + 3cV < Z < a + cV
(8)
with
1 a = 2 ln 2α
θ1 / θ1 and c = 4 .
(9)
The test then terminates at
Vmax
a = = c
1 1 4 ln 8ln 2α 2α and Z 2a = = max 2
θ1
θ1
(10)
How to proceed, if α ≠ β is shown by Schneider (1992). The application of the triangular sequential test can be carried out very easily with the program PEST or TRIQ of CADEMO, even for cases of α ≠ β . Note that sequential testing must be carried out on the data in exactly the order in which it is observed. This is automatically guaranteed if an analysis is done after each observation. That is to say that the data must not been rearranged in order to suit the theory of sequential testing. The same is true regarding continuous sequential testing after a final decision (rejection of one of the hypotheses) has been made.
7. A numerical example Aiming for a conventional Student’s t-test we wish to determine the minimal sample size for testing the null-hypothesis of the equality of two means from independent normally distributed variables with common variances; the alternative hypothesis is a two-sided one. We 2 choose α = 0.05, β = 0.2 and δ = 5. A rough idea about σ from earlier experiments might be
σˆ 2 = 25 as a consequence of which the difference of practical interest is about σ. Then
The misuse of asterisks in hypothesis testing
237
using formula (3) we obtain, in each sample, 17 observations that have to be sampled. As a matter of fact, this result is always the same as long as α = 0.05, β = 0.2 and the ratio
δ = 1. σ
After determining the sample size in this way we have to use α = 0.05 and we reject the null-hypothesis if after data sampling the t-value in (2) with n1 = n 2 = 17 exceeds the 97.5% quantile of the t-distribution with 32 d.f.; i.e. t≥ 2.037. If the t-value calculated from (2) also exceeds the value which corresponds to the 99.5% quantile of the t-distribution with 32 d.f., this is of no use and must not be emphasized by using the symbol **. Let us apply a sequential triangular test and again choose α = 0.05, β = 0.2, and δ = 5. Let us assume that the first four observations are as given in Table 1 above the bold line.
Table 1: The data analysed by TRIQ - the first four observations above the bold line; all observations in the order of their occurrences Step number 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Number of group 1 2 1 2 1 2 1 2 1 2 1 2 1 2
Observed value 4.6 5.2 5.1 6.2 4.9 5.9 4.8 5.2 4.4 6.7 5.3 6.9 4.8 5.9
n1 n2 1 1 2 2 3 3 4 4 5 5 6 6 7 7
In Figure 3 the path of the Z-values from (7) along the sample size (or the V-values from (7); the curve starting from point zero at the ordinate) remains inside the grey area (continuation region), so we have to continue. After 10 other data are sampled we terminate (see Tab. 1 again). Because the curve of Z-values finally leaves the grey area, we have to reject the null-hypothesis. Of course, this decision is possible exclusively at the given α-level of 0.05, but no other α-level is feasible for interpretation. In order to emphasize the senselessness of the „asterisks practice convention“ we performed the conventional Student’s t-test for these 14 observations and obtained by means of SPSS the result t = -4,207 and „Sig. (two-tailed)“ = ,001 which many researchers would be encouraged to promote with ***!
D. Rasch, K. D. Kubinger, J. Schmidtke, J. Häusler
238
Figure 3: Graph for the triangular sequential design for α = 0.05; β = 0.2 and δ = σ. Remark: The fixed sample size needed for the conventional t-test was approximately calculated by TRIQ by using the normal quantiles in place of the quantiles of the t-distribution and is thus in the figure given by 32 (N = n1 + n2) in place of 34 calculated from (3)
The realised sample size N = n1 + n2 = 14 was smaller than that (N = n1 + n2 = 34) needed for the t-test but also smaller than the expected sample size of TRIQ (22 and 16 respectively).
8. A simulation study for TRIQ We have seen that TRIQ is a very good alternative to the conventional t-test if sequential sampling is reasonable. But questions still remain: • • •
The risks α,β applied in TRIQ – as in any sequential test – are known as being just approximations of those values fixed in advance; how accurate are these approximations? The range of the finally realized sample sizes is also of interest. And then there is also the question of how robust TRIQ is with regard to nonnormality (the fact that the t-test is quite robust was shown by Rasch & Guiard, 2004).
The misuse of asterisks in hypothesis testing
239
In the following, therefore, the average sample size of the triangular sequential t-test will be estimated by a simulation study. Because probability estimations of α and β are aimed for, the statistical test should be repeated 10 000 times (cf. for instance Rasch & Guiard, 2004). That is, TRIQ has been used 10 000 times for each of the following situations – the hypotheses were H0 : µ1 = µ2 HA : µ1 = µ2 = 5 = δ, the standard deviation was σ = 5, and α and β were fixed by the pairs Case 1: Case 2: Case 3: Case 4:
α = 0.05 and β = 0.2, α = 0.05 and β = 0.1, α = 0.01 and β = 0.1, α = 0.01 and β = 0.05.
For each of these cases a normally distributed variable was simulated in situation 1. In situations 2, 3, and 4 the Fleishman distribution was simulated (see Rasch & Guiard, 2004): Situation 1 Situation 2 Situation 3 Situation 4
normal distribution distribution with skewness 0 and kurtosis 4 distribution with skewness 1 and kurtosis 4 distribution with skewness 2 and kurtosis 7
– the last one is very extreme and hardly occurs in practical experimentation, but all the three last cases serve to get an idea of just how robust the test is with regard to non-normality. The normally distributed random variables were generated by the standard generator of SPSS. The distributions in the three other situations were generated from the Fleishman system (Fleishman, 1978) with coefficients taken from Table A3 in Nürnberg (1982). A Fleishman random number y with mean 0, variance 1, skewness γ1 and kurtosis γ2 can be obtained from a normal random number u with mean 0 and variance 1 by the transformation
y = −c + bu + cu 2 + du 3 , the values of b, c, and d are: Situation 2 3 4
b 0.737381231250 0.776585015190 0.761585274860
c 0 0.116977231170 0.260022598940
d 0.080925091736 0.065499503056 0.053072273491
The transformed values were finally multiplied by the standard deviation σ = 5 and the mean µ1 and µ2 were respectively added. From each of the 10 000 runs in each (case × situation)-combination the sample size needed was recorded and it was then decided whether the
240
D. Rasch, K. D. Kubinger, J. Schmidtke, J. Häusler
null-hypothesis should be accepted or rejected. From each of the 10 000 runs we calculated the average, the minimum and the maximum sample size and the relative frequency of rejecting frej and accepting facc the null-hypothesis. Two values of the non-centrality parameter, analogue to the example in paragraph 7, were simulated: A: B:
µ1 = µ2 = 0 and δ =µ1 = µ2 = σ = 5.
In Table 2 frej is given for case A, thus estimating the nominal risk α (0.05; 0.01) and facc is given for case B, thus estimating the nominal risk ß (0.2; 0.1; 0.05) at δ =µ1 = µ2 = σ = 5. The average total size N = n1 + n2 is given for case A as N(A) and for case B as N(B).
Table 2: Relative frequencies of accepting (facc) and rejecting (frej) the null-hypothesis and empirical 0.025- and 0.975-quantiles (in []) of these frequencies from 10 000 simulations for each (case × situation)-combination; the sums N(A) and N(B) of the average of both sample sizes needed for a decision and empirical 0.025- and 0.975-quantiles (in []). The sum N of the two sample sizes needed for the respective non-sequential fixed sample size t-test is given in the first column (calculated by CADEMO from (3))
Case 1 α=0.05 β=0.2 N = 34 Case 2 α=0.05 β=0.1 N = 46 Case 3 α=0.01 β=0.1 N = 64 Case 4 α=0.01 β=0.05 N = 72
Situation 1 frej: .054 [.050; .058] N(A): 22.9 [7; 38.8] facc: .182[ .175; .189] N(B) 20.5[4.6; 36.4] frej: .049 [.045; .054] N(A): 29.8 [11.6; 48] facc: .089[ .084; .094] N(B): 27.6[ 6.5; 48.7] frej: .009 [.008; .01] N(A): 53.8 [27.1; 80.5] facc: .086[ .081; .091] N(B): 43.4[ 15.7; 71.1] frej: .009 [.008; .01] N(A): 52.2 [18.9; 85.5] facc: .046[ .042; .050] N(B): 52.5[ 19.2; 85.8]
Situation 2 frej: .053 [.049; .057] N(A): 21.4 [7.3; 35.5] facc: .153[ .146; .160] N(B): 19.7[ 4.4; 35] frej: .047 [.043; .051] N(A): 28 [9.3; 46.7] facc: .083[ .078; .088] N(B): 26.2[ 6.1; 46.3] frej: .009 [.01; .008] N(A): 46.4 [11; 81.8] facc: .087[ .082; .092] N(B): 43.8[ 16.3; 71.3] frej: .010 [.009; .011] N(A): 46.9 [26.4; 67.4] facc: .042[ .039; .045] N(B): 49.9[ 18.2; 81.6]
Situation 3 frej: .047 [.043; .051] N(A): 22.3 [6.6; 38] facc: .158[ .151; .165] N(B): 20[ 4.5; 35.5] frej: .049 [.045; .053] N(A): 29.1 [7.7; 50.5] facc: .082[ .077; .087] N(B): 27.5[ 7.3; 47.7] frej: .008 [.007; .009] N(A): 47.1 [21.5; 72.7] facc: .089[ .084; .094] N(B): 43.5[ 16.2; 70.8] frej: .008 [.007; .009] N(A): 55.4 [21.8; 89] facc: .050[ .046; .054] N(B): 49.6[ 17.2; 82]
Situation 4 frej: .045 [.041; .049] N(A): 22.9 [6.7; 39.1] facc: .142[ .136; .148] N(B): 19.5[ 4.2; 34.8] frej: .045 [.041; .049] N(A): 30.3 [10.3; 50.3] facc: .074[ .069; .079] N(B): 26.5[ 6.3; 46.7] frej: .008 [.007; .009] N(A): 48 [19; 77] facc: .081[ .076; .086] N(B): 42.8[ 15.7; 69.9] frej: .008 [.007; .009] N(A): 56.5 [25.4; 87.6] facc: .043[ .040; .046] N(B): 50.7[ 18.5; 82.9]
Let us discuss the entry in case 1 × situation 1. This is the case of assumption of normality (cf. paragraph 7). We found frej: .054 [.050; .058]. This means the actual frequency of rejecting a null-hypothesis which is true is in average 0.054 and was in 9500 of the runs between 0.05 and 0.058 (for a nominal α = 0.05). facc: .182[ .175; .189] analogously means that the actual frequency of accepting the null-hypothesis if it is wrong (forµ1 = µ2 = σ = 5) is in average 0.182 (or the power is 0.818 ) and was in 9500 of the runs between 0.175 and 0.189 (for a nominal β = 0.2, i.e. a nominal power of 0.8). The sample sizes N(A) varied in
The misuse of asterisks in hypothesis testing
241
9500 of the runs between 7 and 38.8 with a mean of 22.9 and the sample sizes N(B) varied in 9500 of the runs between 4.6 and 36.4 with a mean of 20.5. It is not surprising that a test developed under normal assumptions behaves better in some of the non-normal cases than in normal cases: This could already be found in the robustness research reported in Rasch and Guiard (2004).
9. Conclusion In all simulations, even those for extreme non-normal distributions, the triangular sequential test is more powerful than the conventional t-test (the actual Type-II-risk is always below the value being fixed in advance). Only in case 1× situation 1, the actual Type-I-risk is above the value that was fixed in advance; however, things improve with smaller Type-IIand Type-I-risks and also in some cases of non-normality. In other words, we should even be glad if we have to deal with non-normality and it is no need to use non-parametric tests – that nonparametric tests are sensitive against violation of the assumed continuity of the underlying distribution is shown in Rasch, Häusler, Guiard, Herrendörfer and Kubinger (in prep.). Furthermore, the average sample size does not increase with increasing skewness or kurtosis of the distribution. In fact triangular sequential t-tests prove to be completely robust against any violation of the normal distribution. Hence, testing sequentially can be advised for the following reasons: • • •
There is no opportunity to use asterisks Time and money can be saved because fewer observations are required The test is extremely robust against non-normality.
References 1. 2. 3.
4.
5. 6.
Belmont, L. & Marolla, F.A. (1973): Birth order, family size, and intelligence. Science, 182, 1096-1101. Fleishman, A. J. (1978): A method for simulating non-normal distributions. Psychometrika. 43, 521-532. Kubinger, K.D. & Gittler, G. (1983): Der Zusammenhang von Geschwisterkonstellation und Intelligenz in kritischer Sicht [Some critical remarks: The relationship between birth or family size on intelligence]. Sprache und Kognition, 2, 254-263. Nürnberg, G. (1982): Beiträge zur Versuchsplanung für die Schätzung von Varianzkomponenten und Robustheitsuntersuchungen zum Vergleich zweier Varianzen [On the estimation of variance components and the robustness of comparisons of two variances]. Probleme der angewandten Statistik, 6, Dummerstorf-Rostock. Rasch, D. (2003): Determining the Optimal Size of Experiments and Surveys in Empirical Research. Psychology Science, 45, 3-47. Rasch, D., Herrendörfer, G., Bock, J., Victor, N., & Guiard, V. (1996): Verfahrensbibliothek Versuchsplanung und -auswertung, Band I. [Collection of procedures for designing and analysing experiments, vol. I]. Munich: Oldenbourg.
242
7.
8. 9.
10. 11.
12. 13. 14.
D. Rasch, K. D. Kubinger, J. Schmidtke, J. Häusler
Rasch, D., Herrendörfer, G., Bock, J., Victor, N., & Guiard, V. (1998): Verfahrensbibliothek Versuchsplanung und - auswertung, Band II. [Collection of procedures for designing and analysing experiments, vol II] Munich: Oldenbourg. Rasch, D. & Guiard, V. (2004): The Robustness of Parametric Statistical Methods. Psychology Science, 46, 175-208. Rasch, D., Verdooren L.R. & Gowers, J.I. (1999): Fundamentals in the Design and Analysis of Experiments and Surveys - Grundlagen der Planung und Auswertung von Versuchen und Erhebungen. Oldenbourg: Munich. Rasch, D., Häusler, J., Guiard, V., Herrendörfer, G. & Kubinger, K.D. (in prep.): How Robust are Non-parametric significance tests in cases of non-continuous distributions? Schneider, B (1992): An Interactive Computer Program for Design and Monitoring of Sequential th Clinical Trials. Proceedings of the XVI International Biometric Conference 1992, Hamilton, New Zealand (pp. 237-250). Wald, A. (1947): Sequential Analysis. Wiley: New York. Whitehead, J. (1983): The design and analysis of sequential clinical trials. Chichester: Ellis Horwood. Zajonc, R.B. (1976): Family configuration and intelligence. Science, 192, 227-236.
Tobias Richter
Epistemologische Einschätzungen beim Textverstehen Textverstehen ist nicht passives „ Einlesen“ von Fakteninformationen. Vielmehr nehmen wir beim Lesen von (Sach-)texten auch epistemologische Einschätzungen vor: Wir nutzen unser Wissen und unsere Überzeugungen zu einem Inhaltsbereich, um zu beurteilen, ob wir die Aussagen und Argumente in einem Text für wahr oder gut begründet halten. Tatsächlich ist eine epistemologische Verarbeitung von Texten in vielen Fällen ausschlaggebend für ein adäquates Textverständnis. Trotzdem kommen wissensgestützte Bewertungen in kognitionspsychologischen Theorien des Textverstehens bisher nicht vor. In dieser Arbeit wird ein integratives kognitionspsychologisches Modell epistemologischer Einschätzungen beim Textverstehen entwickelt und in seinen zentralen Annahmen empirisch überprüft. 2003, 462 Seiten, ISBN 3-89967-094-9, Preis: 30,- Euro PABST SCIENCE PUBLISHERS Eichengrund 28, D-49525 Lengerich, Tel. ++ 49 (0) 5484-308, Fax ++ 49 (0) 5484-550,
E-mail:
[email protected], Internet: http://www.pabst-publishers.de