Endodontic leakage studies reconsidered. Part II. Statistical aspects

12 downloads 0 Views 624KB Size Report
Finally, the power ofthe statistical tests of endodontic leakage studies is evaluated. Power and sample sizes in testing equality of means. One is seldom abie to ...
InUmational Endodontic Jourml (1993) 2 6 , 4 4 - 5 2

Endodontic leakage studies reconsidered. Part II. Statistical aspects A, H, B, SCHUURS, M,-K, WU, P, R, WESSELINK& H, ], DUIVENVOORDEN* Department ofCariology and Endodontology, Academic Centre for Dentistry Antsterdam (ACTA), and 'Department of Psychology and Psychotherapy, Erasmtis University, Rotterdam, The Netherlands

Summary

Introduction

llie aim of many endodontic studies is to compare two or more treatment methods, techniques or materials, for example, to detect differences in mean leakage scores. As it is not feasihle to study large populations, samples are taken. The important question then arises as to how large the sample sizes have to be in order to establish the 'true' ( = populations') mean scores. First, it must be determined which magnitude of the difference (=v) between the mean scores is of endodontic interest. Based upon V and a few related statistical parameters, one may calculate how large the sample must be in order that a statistical test yields a significant result for a difference that is of endodontic importance. In other words, the 'power' of a test, depending on the sample size among other factors, must be large enough to detect the 'true' a priori determined difference between the populations. The use of small sample sizes may imply that a rather large difference between two mean leakage scores is not found to be significant, thereby leading to incorrect conclusions. This article describes the power and the statistical related factors that determine the adequate size of samples. Examples of power calculation are presented. Next, the power of publicized endodontic leakage studio was evaluated. Almost two-thirds of the sample sizes were 10 or less, and about 90% were 20 or less. Less than one-half of the tests had an adequate power (conventionally ^ 0,80), It is necessary to be cautious when extrapolating the results of such studies, because of the limited power ofthe statistical tests. The power may be increased by using larger sample sizes or, alternatively, by enlarging the 'effect size', by either taking an interest in a larger difference between the mean scores, or by minimizing the variability ofthe data.

In recent years almost one in every four publications in the endodontic journals has dealt with in-vitro leakage after root canal treatment (Wu & Wesselink 1992), The studies concern various treatment methods, techniques and materials, which are usually compared with cold lateral condensation of gutta-percha as the standard control. Comparative in-vitro studies are carried out to assess whether various treatments differ in outcome, and to decide which treatment is optimal (the latter to be confirmed in clinical trials). Thus a thorough statistical analysis is essential. The present article deals with statistical aspects of published endodontic leakage studies. The data are commonly analysed with parametric tests, i,e. (-tests and analysis of variance, which are preferable to the nonparametric approach, under the condition that certain assumptions (considered later) are met. The tests are used to compare the mean scores of different treatments. For the convenience ofthe reader not so well acquainted with statistics, introductory comments are made concerning testing. The £-test and analysis of variance are then considered. Finally, the power ofthe statistical tests of endodontic leakage studies is evaluated.

Keywords: gutta-percha, lateral condensation, leakage, root canal obturation, statistical power, statistics. Correspondence: Dr A. H. B. Schuurs, Department of Cariology and Endodontology, Academic Centre for Dentistry Amsterdam (ACTA), Louwesweg 1.1066 EA Amsterdam, The Netherlands.

44

Power and sample sizes in testing equality of means One is seldom abie to study a whole population. Fortunately, it is sufficient to investigate samples, which must, however, reflect the population from which they are drawn to make the inferences worthwhile. Even random samples rarely represent the population exactly, yet they are recommended, as the magnitude of sampling errors is reduced. In order to test whether a research hypothesis (HJ is valid, the researcher formulates a null hypothesis (H,,) in which Hj is negated. Next, the researcher tries to reject Hp in favour of H j, and he does so if an adequately chosen

Endodontic leakage studies reconsidered. Part U

45

Scheme 1. Differences between the population and test results ofthe data of a sample; Type I and Type II (Wtllemseti 1974) In reality:

reject' H,, Based upon the sample data it is decided to: accept H^

H,, is true

H^ is false

true H^ incorrectly rejected = Type I error {probability = a| true H|, correctly accepted = no error (probability = 1 — a)

false H,, correctly rejected = n o error (probabiiity=l-^) false H,, incorrectly accepted Type II error {probability=^)

'Implies acceptance of H,

statistical test ofthe sample data reveals that the chances of obtaining results as extreme as those observed supposing Hy is true are very low. A major detenninant of either rejecting or accepting H(, is the sample size, but other parameters, which are all statistically functionally interrelated, are also important (Cohen 1969). As well as the sample size, these parameters are the Type I and Type Ii error, and the effect size d, all summarized here.

Too small sample sizes n may prevent a real difference between two mean scores being statistically significant. Thus, a non-significant result of a test is informative only when the sample sizes are sufficiently large. However, excessively large sample sizes require unnecessarily high costs, much time, patients unnecessarily exposed to experimental treatments, and may reveal a small, even trivial difference to be significant. Therefore, it is necessary to determine optimal sample sizes prior to an experiment.

Thus, small as that probability may be, the researcher may erroneously reject an actual 'true' H,,, If one sets a more stringent standard, for instance a = 0.01, the probability that the sample data fulfil the criterion is lower, i,e, H,j is less easily rejected falsely. An analogy exists in law. One is 'not guilty' (Hp) of a crime unless otherwise proven beyond reasonable doubt (Hj), A judge or jury often does not know for sure whether a defendant is guilty or not. No problems exist either when an innocent individual is judged 'not guilty' or, alternatively, when a criminal is convicted. However, there are two kinds of potential error: (1) an innocent person is incorrectly condemned (which is comparable to the Type I error) and (2) a guilty person is incorrectly found 'not guilty', A judge or jury who discharges the majority ofthe defendants in order to minimize the probability (a) that innocent individuals are falsely convicted, of course increases the risk of making the second error, i,e, releasing actual criminals. Clearly, the testimony of many unanimous and independent witnesses instead of one witness (an increase of n), may enlarge the chance of a correct judgement.

Type I error

Type n error and power

In testing, for instance, whether leakage differs in straight and curved root canals (Hj), it may be stated under H(| that the two mean leakage scores do not differ, H,| is rejected if a statistical test ofthe sample data yields a value whose associated probability of occurrence under H,, is very small. This probability, designated a. is a priori set, usually at 0,05, a represents the maximum risk, here 5%, one is willing to take in rejecting mistakenly H^,, i,e. the statement in HQ holds in fact true in the population, but is nevertheless rejected.

As in the court example, as well as committing a type I error, a researcher may make a type Ii error, namely an actual 'false' Hp is not rejected, whQe it should be (Scheme 1), The probability of a Type II error is designated fi. The Type H error, which often receives less attention than Type I error, depends on the actual difference between two mean scores of the populations. In social sciences ^ is conventionally set at 0,20, but to avoid the serious consequences of non-rejecting a 'false' H,, a value

Sample sizes

46

A, H, B. Schuurs et al.

of 0,10 and even lower may be chosen. The probability of committing a Type II error decreases with increasing sample sizes, ceteris paribus.

Thepo wer of a statistical test is the probability that it will yield a significant result (Cohen 1969), A low power casts doubt on the conclusions drawn. Presentation of the value ofthe power enables the readers to assess whether (non-significant) findings are worthwhile. The power of a statistical test is defined as the probability of correctly rejecting an actually false H^ on the basis of the sample data (Cohen 1969), and is therefore equal to (1 -;?), The larger n, the larger the power (holding a equal), but large samples sizes may be, as akeady stated, disadvantageous. For a given sample size, a decrease in a results in an increase in fi and, by definition, a decrease in power.

Ejfect size index The power also depends on the effect size index d. The term 'effect size' points to the degree to which a phenomenon, a characteristic, or difference, is present in the population(s), A small difference (= effect) between two mean scores is not likely to be significant if the samples are very small, d is the ratio ofthe difference between two mean scores (v) and a (the population's standard deviation) and as such is a standardized unit of variabihty. For instance, if the leakage in endodontically treated straight root canals is 5 mm and in curved canals it is 8 mm, the difference v = 3 mm. Now let us suppose that - ^"^ = 0,54 In case of repetition of the t-test, one has to set a at a lower level. Furthermore, besides the pre-set level of a, it is preferable that the exact probability of the test is presented. With regard to ANOVA it is noted that significant results were in some instances due to the design of the studies. For instance, in comparing leakage scores of lateral condensation with different sealers, the mean score of the control group without a sealer appeared to be an 'outlier', which was the single cause for a significant F-ratio, In other words, comparison with a 'golden standard', i.e. a proven and generally accepted method, is needed. ANOVA provides information as to whether significant differences between the mean scores exist, but does not indicate which means differ. Some researchers applied repeated £-tests to test for differences in means, by which the risk of rejecting incorrectly Hg increased from a to a'. Other tests, such as theStudent-Newman-Keuls test and Duncan's multiple range test, are available. In eight of the 3 5 publications, non-parametric tests had been performed. Not many investigators had recognized non-normality.

51

of (T by controlling as many variables as possible, for instance by using similar types of teeth, root canals of similar size, standardized root canals, careful treatment methods, and so on. However, given the various uncertainties, it may be advisable to consider different scenarios, in which one varies both v and «r, and to take into account the costs of each scenario. A low power may imply that substantial differences are concluded to be non-significant. Performance of a parametric test demands that certain assumptions are met, which may be tested. If one has reason to doubt whether the assumptions are met, non-parametric tests (such as the non-parametric one-way analysis of variance by ranks (Kruskal-Wallis) test, followed by, for instance, the Terpstra test (jonge de 1963) for determining a trend in the mean scores) are to be preferred—if applicable. It seems reasonable to conclude that the value of many endodontic leakage studies is limited because of a low power ofthe statistical tests applied, due to sample sizes that are too small. References ALTMAN D.G. & DoRfi CJ, (1990) Randomisation and baseline comparisons in clinical trials. Lancet. 3 3 5 , 1 4 9 - ] 53. BEATTY R.G. (198 7) The effect of standard or serial preparation on single cone obturations. Internalional Endodontic journal 20, 2 76-281. BEATTS' R.G. & ZAK.IRIASBN K.L. (1984) Apical leakage associated with three obturation techniques in large and small root canals. International Endodontif lourrml. 17, 67-72. BEAm- R.G.. VERTDCCI FJ. & ZAKARIASBN K.L (1986) Apical sealing efficacy of endodontic obturation techniques. International Endodontic /ournai. 1 9 , 2 3 7 - 2 4 1 . COHEN J. (1969) Statistical Power Analysis for the Behavioral Sciences. Academic Press, New York and London. DAOT.T.T., LAVIGNE G.J., FEINE J.S., TANGLiAV R. & LUND J.F. (1991) Power

Conclusions In view of the size and difference between the mean scores and standard deviations, the sample sizes in endodontic leakage studies are often unjustifiably small, and it seems likely that many investigators do not perform a priori power calculations. Other dental and medical journals had and still have the same problem (Kingman 1977, Altman&Dore 1990, Daoetal, 1991), Prior to an experiment the researcher has to choose which statistical test(s) will be applied, and he must consequently determine the desired sample sizes. These are based upon a,fi,d, and thus upon an estimate of a. in order to guarantee a test with a sufficient power. In planning an endodontic leakage study one has to decide, if possible, which difference between the means is clinically important. Furthermore, one may infiuence the size

and sample size calculations for clinical trials of myofascial pain of jaw muscles. Journal of Dental Research. 70,118-122. GARDNER J.M. & ALTMAN D.G. (1989) Statistics with Confidence, pp. 20-22. British Medical Journal, London. GciiLFORD J.P. & FRUCHTER B . {19 78) Fundamental

Statistics in Psychology

andEducfltion. 6thedn, pp. 155,165-166. McGraw-Hill Kogakusha, Tokyo. HAYS W.L. (19 72) Statistics, p. 22 7. Holt. Rinehart & Winston, London. JONGE DE H. (1964) Inkiding tot de Medische Statistiek. Part 2, 2nd edn, pp. 342-343, 480-482, 675-694, Wolters-Noordhoff, Groningen, The Netherlands. JONGE DE H . & WIELENGA G. (19 73) SlotisliscAe Methoden war Psychologen

en Sociologen. 4th edn, pp. 179, 200, Tjeenk Willink, Groningen, The Netherlands. KINGMAN A. (1977) Adequate cohort sizes for caries clinical trials. Community Dentistry and Oral Epidemiology, 6, 30-3 5. LORTON L. & RETHMAN M.P. (1990) Statistics: curse ofthe writing class. Journal of Endodontics. 16, 13-18. MADISON S. & KRBJ. K.V. (1984) Comparison of ethylenediamine tetraacetic acid and sodium hypochlorite on the apical seal of endodonticaliy treated teeth. Journal of Endodontics. 10,499-503.

52

A. K B. Schuurs et al,

MADisoNS.&ZAKARjASENK.L.(1984)Linear and volumetric analysis of apical leakage in teeth prepared for posts. Journal of Endodontics, 10, 4 2 2 - 4 2 7. PoRKAEW P., RETIEF H., BARFIELD R.D., LACEFULD W.R. & SooNG S.-j. (1990) Effects of calcium hydroxide paste as an intracanal medicament on apical seal./ournaio/Endodontics, 16, 369-374. SIEGEL S, & CASTELLAN N.J. (1988) Nonparametrk statistics, 2nd edn, pp. 19-20, 215. McGraw-Hill Book Company, New York. SNEDECOR G.W, & COCHRAN W.G. (1976) Statistical Methods, 6th edn, pp. 111-114. The Iowa State University Press, Ames, Iowa.

WILLEMSENE.W. (1974) Understanding Statistical Reasoning, pp. 56-60. Freeman & Company, San Francisco. Wv M.-K. & WESSEUNK P.R. (1992) Endodontic leakage studies reconsidered. Part I. Methodology, application and relevance. Jnternaflonn/ Endodontic Journal, 26, 37—43. ZAKARIASEN K.L. & STADERN P.S. (1982) Microleakage associated with modifiedeucaperchaandchloropercharoot-canal-Hllingtechniques. Internationa! Endodontic Journal, 15, 67-70,