EFSA Journal 2011;9(9):2372
SCIENTIFIC OPINION
Statistical Significance and Biological Relevance1 EFSA Scientific Committee2, 3 European Food Safety Authority (EFSA), Parma, Italy
ABSTRACT The Scientific Committee (SC) developed an opinion addressing the issue of statistical significance and biological relevance. The objective of the document is to help EFSA Scientific Panels and Committee in the assessment of biologically relevant effects. The SC considered the distinction between the concepts of biological relevance and statistical significance and produced descriptions of the terms. It is suggested that EFSA Experts and Staff members should use the terminology of biological relevance and statistical significance as interpreted by the SC in their considerations. The SC recommends that the nature and size of biological changes or differences seen in studies that would be considered relevant should be defined before studies are initiated. The size of such changes should be used to design studies with sufficient statistical power to be able to detect effects of such size if they truly occurred. Statistical significance is considered as just one part of an appropriate statistical analysis of a well designed experiment or study. Identifying statistical significance should not be the primary objective of a statistical analysis. The relationship of statistical significance to the concept of hypothesis testing was considered and the limitations on the use of hypothesis testing in the risk assessment process when interpreting data were noted. The SC therefore recommended that less emphasis should be placed upon the reporting of statistical significance and more on statistical point estimation and associated interval estimations (e.g. Confidence Interval) as more information can be presented using the latter. In addition, the SC recommends that a complete description of the methods used, the programming code and the raw data are made available to the assessors so that alternative analyses could be conducted to test the robustness of any conclusions drawn. © European Food Safety Authority, 2011
KEY WORDS Statistical significance, biological relevance
1 2
3
On request from EFSA, Question No EFSA-Q-2010-00710, adopted on 8 September 2011. Scientific Committee members: Boris Antunović, Sue Barlow, Andrew Chesson, Albert Flynn, Anthony Hardy, Michael Jeger, Ada Knaap, Harry Kuiper, David Lovell, Birgit Nørrung, Iona Pratt, Ivonne Rietjens, Josef Schlatter, Vittorio Silano, Frans Smulders and Philippe Vannier. Correspondence:
[email protected] Acknowledgement: The Panel wishes to thank the members of the Working Group on Statistical Guidance: Fernando Aguilar, Jean Louis Bresson, Lutz Edler, Gabor Lövei, David Lovell (Chair), Joe Perry, Guido Rychen, Mo Salman and Ivar Vågsholm for the preparatory work on this scientific opinion, and EFSA staff: Saghir Bashir, Frank Boelaert, Bernard Bottex, Laura Ciccolallo, José Cortiñas Abrahantes, Olaf Mosbach-Schulz, Didier Verloo and Gabriele Zancanaro for the support provided to this scientific opinion.
Suggested citation: EFSA Scientific Committee; Statistical Significance and Biological Relevance. EFSA Journal 2011;9(9):2372. [17 pp.] doi:10.2903/j.efsa.2011.2372. Available online: www.efsa.europa.eu/efsajournal
© European Food Safety Authority, 2011
Statistical significance & biological relevance
SUMMARY The European Food Safety Authority (EFSA) asked its Scientific Committee (SC) to develop an opinion addressing the issue of statistical significance and biological relevance. For this purpose, a Working Group consisting of experts who are members of EFSA Scientific Panels and staff members of the EFSA Scientific Assessment Support unit (SAS) was convened. The objective of the document is to help EFSA Scientific Panels and Committee in the assessment of biologically relevant effects. An assessment of the relevance of scientific studies for the work of EFSA is based upon a critical assessment of the evidence provided by these studies. Statistical analysis of the data is central to this assessment. Confusion in the use of words such as “significance”, “relevance” and “importance” can therefore hinder the assessment. The distinction between the concepts of biological relevance and statistical significance should be acknowledged when developing scientific opinions with the words significance/significant being related to statistical concepts while relevance/relevant should be related to biological considerations. EFSA Experts and Staff should be encouraged to use the meanings of biological relevance and statistical significance, as specified in this document. The concept of biological relevance was explored. It is recommended that the nature and size of relevant biological changes or differences should be defined before studies are initiated. A pre-defined relevant biological effect should be used to design studies with sufficient statistical power to be able to detect such effects if they truly occurred. It is considered a critical point that the statistical analyses commonly used make implicit assumptions about the data including information on the design and conduct of a study. Although the mathematical calculations underlying the analyses can still be carried out independently of such information, these can lead to biased / unreliable results. No amount of statistical sophistication can rescue a badly designed study. Statistical significance is just one part of an appropriate statistical analysis of a well designed experiment or study. Identifying statistical significance should not be the primary objective of a statistical analysis. The relationship of statistical significance to the concept of hypothesis testing was considered and the limitations on the use of hypothesis testing as a tool for decision making were noted. In particular, the practice of dichotomised experimental findings into significant and not significant results was considered of limited value and potentially misleading. The SC concludes that less emphasis should be placed upon the reporting of statistical significance and more on statistical (point) estimation (e.g. of an effect) and associated interval estimation (e.g. Confidence Interval). It is considered that appreciably more information can be presented in the estimate of the size of an effect and its uncertainty when described by a confidence interval than when expressed solely by the results of significance tests. There may be alternative approaches to the analysis of data. It is important that a complete description of the methods used, the programming code and the raw data are made available to the assessors if required, so that alternative analyses can be conducted to test the robustness of any conclusions drawn.
EFSA Journal 2011;9(9):2372
2
Statistical significance & biological relevance
TABLE OF CONTENTS Abstract .................................................................................................................................................... 1 Summary .................................................................................................................................................. 2 Table of contents ...................................................................................................................................... 3 Background as provided by EFSA ........................................................................................................... 4 Terms of reference as provided by EFSA ................................................................................................ 4 Assessment ............................................................................................................................................... 6 1. Introduction ..................................................................................................................................... 6 2. Context and definition of the problem ............................................................................................. 7 2.1. What is “biological relevance”?.............................................................................................. 7 2.2. What is “statistical significance”?........................................................................................... 8 2.2.1. The Hypothesis Testing Framework ................................................................................... 8 2.3. Relationship between Biological Relevance and Statistical Significance .............................. 9 2.4. Confirmatory studies vs. Exploratory studies ....................................................................... 10 3. Errors in interpretation of Statistical Significance ......................................................................... 10 3.1. “Absence of Evidence is not Evidence of Absence” ............................................................ 10 3.2. Statistical analysis is not only reporting Statistical Significance .......................................... 11 3.3. Multiple Testing .................................................................................................................... 12 4. Guidance on data analysis ............................................................................................................. 13 4.1. Biological Relevance first ..................................................................................................... 13 4.2. Using Interval Estimates ....................................................................................................... 13 4.3. Robustness ............................................................................................................................ 14 Conclusions and recommendations ........................................................................................................ 15 References .............................................................................................................................................. 17
EFSA Journal 2011;9(9):2372
3
Statistical significance & biological relevance
BACKGROUND AS PROVIDED BY EFSA Standard toxicology and safety evaluation tests have been used for around 50 years for the identification of hazards and the subsequent assessment of risk associated with chemicals and agents in food and feed. A series of standard methods with specific guidelines for their conduct and interpretation has been built up over this time through the activities of international and national regulatory organisations (OECD 2007, EFSA 2009a). Considerable experience and expertise has developed in applying these approaches through, for instance, the work of the EFSA Panels. Nevertheless, despite the considerable experience in what is effectively a mature field of research, there is still debate at both the scientific and public level about the methods for conducting, analysing and interpreting such studies. Although a battery of standard tests is available, there is increasing awareness that sophisticated risk assessment of individual chemicals and agents on a case-by-case basis may require specifically designed studies to identify mechanisms and modes of action. These have important experimental design and statistical analysis issues including ensuring that any bioinformatics method used is validated by conventional statistical approaches. Guidelines on how to interpret the results from these studies, highlighting key points, should be of value to EFSA panels in their work. A number of misunderstandings derive from the interpretation of the results of statistical tests and the reliance on the use of specific probability levels as criteria for either a positive or negative effect. Increasingly many statisticians are arguing for much less reliance on significance testing and instead putting more emphasis on the use of estimation, such as the use of confidence intervals (Gardner & Altman, 1986; Lee & Lovell, 2009; Sterne & Davey Smith, 2001). In the pharmaceutical industry, the concepts of bioequivalence / inferiority / superiority testing approaches have been developed. Such approaches applied to toxicology studies could help address the relationship between statistical significance and biological importance and help assess the amount of variability and uncertainty in the risk / benefit assessment process.
TERMS OF REFERENCE AS PROVIDED BY EFSA The Scientific Committee is requested by EFSA to develop a series of short documents addressing statistical issues, in order to help EFSA Scientific Panels and Committee in the assessment of biologically relevant effects. When preparing this series of guidance, the Scientific Committee will take account of the specific interests of some of the Panels, e.g. statistical issues associated with animal health and welfare assessment, or in the assessment of a bio-agent which needs to enter, establish and spread in a food. The Scientific Committee is initially requested to address by the end of March 2011 the issue of statistical significance vs. biological relevance: the definitions and interpretation of significance levels used in the assessment of study results, including aspects such as the use of multiple comparison methods. Subsequently, the Scientific Committee will consider addressing statistical issues on further topics such as: • Evaluation of the validity, quality and representativity of data used for EFSA assessments. Impact of the study design on the uncertainty associated with the assessments; • Absence of evidence for an adverse effect vs. evidence of the absence of an adverse effect: consequences for the assessment; • Freedom from disease concept: how to assess evidence for targeted prevalence of an infectious agent in regard to an estimated prevalence and the size of sampling; • Suitability of equivalence testing for EFSA risk assessments (EFSA 2009b); EFSA Journal 2011;9(9):2372
4
Statistical significance & biological relevance
• Scientific evidence for interpreting uncertainty levels; • Strengths and weaknesses of new molecular techniques capable of investigating ‘perturbations’ at low doses (NRC 2007); • Strengths and weaknesses of the statistical, bioinformatics and computational biology tools used in the analysis, interpretation and assessment of complex endpoints, pathways and multivariate data. In developing its guidance documents, the Scientific Committee is requested to take into account the experience gained in the field by EFSA, the three non-food Scientific Committees of the Commission (SCCP, SCHER and SCENIHR), together with that gained by other agencies and international organisations/associations including: EMEA, ECDC, US FDA, FAO/WHO JEFCA, WHO/IPCS.
EFSA Journal 2011;9(9):2372
5
Statistical significance & biological relevance
ASSESSMENT 1.
Introduction
An assessment of the relevance of any scientific study for the work of EFSA is based upon a critical assessment of the evidence provided by these studies. Statistical analysis of the data is central to this assessment. Confusion in the use of words such as “significance”, “relevance” and “importance” can hinder the assessment. Scientific concepts and methods are thought in general to be objective and robust against misinterpretation. However, they need to be used with extreme care because the wrong wording, or inappropriate use of words in specific contexts, can lead to errors in the understanding, interpretation and communication of scientific results for decision making. The scientific work of EFSA encompasses a wide and diverse range of disciplines, ranging, for instance, from the assessment of toxicological studies of additives and contaminants to epidemiological studies of zoonotic agents, the evaluation of genetically modified organisms, the surveys of dietary intakes and concentrations of residues in foodstuffs. Each of these areas has its own specific statistical issues, but the purpose of this document is more to concentrate on relevance and significance as this represents a basic underlying issue. The objective of this document is to reinforce the correct use and understanding of the concepts of biological relevance and statistical significance when conducting risk assessments in EFSA. The main aim is to help non-statisticians with the interpretation of statistical results by addressing the limitations and weaknesses in the use of the term statistical significance and also to highlight potential misuses, particularly as this term is often confused with biological relevance. It should be stressed that statistical analysis is not limited to evaluating statistical significance but comprises other important statistical techniques including, among others, point estimation and the use of confidence intervals, which will also be addressed in this document. The concepts of “significance” and “relevance” of an empirical finding is an example of this problem in scientific assessment where one term may be erroneously used in place of the other. The concept of biological relevance implies a biological effect of interest that is considered important based on expert judgement. Its use refers to an effect of interest or to the size of an effect that is considered important and biologically meaningful and which, in risk assessment, may have consequences for human health. The objective of carrying out an empirical study is usually to identify the existence of relevant biological effects at the population level using statistical tools to detect them. Therefore the identification of statistical significance is only part of the evaluation of the biological relevance. In contrast, the concept of “significance” has a specific and distinctive meaning when used in the context of statistical hypothesis testing. This meaning relates to where, for instance, a difference of concentrations (e.g. of a hazardous substance in two exposed populations) or a difference in proportions (e.g. of tumour bearing individuals) is called statistically significant because it is unlikely to have occurred by chance alone. Significant – as in “statistically significant”- does not, therefore, necessarily mean “important” or “meaningful”, as it is sometimes misinterpreted – but is a statistical statement on the property and information content of the observed data. It is considered a critical point that the statistical analyses commonly used make implicit assumptions about the data including information on the design and conduct of a study. Although the mathematical calculations underlying the analyses can still be carried out independently of such information, these can lead to biased / unreliable results. No amount of statistical sophistication can rescue a badly designed study. It is recommended that statisticians are involved at an early stage of a study, e.g. its design.
EFSA Journal 2011;9(9):2372
6
Statistical significance & biological relevance
2.
Context and definition of the problem
2.1.
What is “biological relevance”?
A problem can arise in the interpretation of data from experimental or observational studies because standards to define the quantitative changes which designate biological relevance in a general context do not exist. This has resulted in the use of arbitrary or subjective criteria to define whether effects observed in a particular experiment or trial are considered important in biological terms. For example, in the field of toxicology, any type of histopathological change (e.g. tumours, adenomas, infiltrations, cellular modifications …) can be considered as biologically relevant. The specific definitions of biological relevance quoted in the literature differ based upon the particular biological field being considered and vary in the size of the effect, the context of the measurement used in the evaluation and the biological importance attributed to the findings (Lindgren et al., 1993). In the medical literature, the following definitions of “clinical relevance” illustrate the various ways in which biologically relevant effects can be defined: •
Hollon and Flick (1988) suggested that "the minimal unit of clinical relevance should be defined in terms of the smallest of reliable changes of interest to some, but not necessarily to all interested parties";
•
Lindgren et al. (1993) indicated that "when two treatment methods are compared, the smallest difference between therapies with respect to an important outcome variable that would result in a decision to modify treatment denotes clinical relevance";
•
LeFort (1993) mentioned that the term "clinical relevance" reflected "the extent of change, whether the change makes a real difference to subject lives, how long the effects last, consumer acceptability, cost-effectiveness and ease of implementation";
•
Killoy (2002) indicated that “clinical relevance is a subjective evaluation of relevance by the clinician and that before a finding can be clinically relevant, it must have achieved statistical significance”.
The following meaning of biological relevance is proposed for use by EFSA: A biologically relevant effect can be defined as an effect considered by expert judgement as important and meaningful for human, animal, plant or environmental health. It therefore implies a change that may alter how decisions for a specific problem are taken. This assumes that a “normal” biological state has been defined. This is a subjective but expert judgement which depends on the specific situation as well as on the expert opinion of the investigator and other involved parties. This can be a difficult task as scientists from different disciplines may differ in their definitions. It is important to emphasize that the size of an effect that would be considered biologically relevant should ideally be considered at the design stage and, certainly, before the decision making process starts. It is appreciated, though, that it may be difficult to define the size of effect that will be considered biologically relevant or important in every situation.
EFSA Journal 2011;9(9):2372
7
Statistical significance & biological relevance
2.2.
What is “statistical significance”?
Statistical significance is a measure of how likely an observed result could have occurred, on the basis of a set of assumptions. (Reese, 2004). Finding an unlikely result should therefore lead to a questioning of these assumptions. Results of statistical tests are evaluated having in mind what should be considered as “significant”, and how to interpret “non-significant” findings. Note that statistical significance is, thus, a technical term defined within the framework of statistical hypothesis testing. This framework depends upon concepts of statistical inference. The commonest concepts are the Neyman-Pearson, the Fisherian and the Bayesian paradigms. These concepts differ in their basic underlying assumptions. The widely used hypothesis testing paradigm described below is based upon a combination of the Neyman-Pearson and the Fisherian approaches and has been not without controversies in the statistical community since its introduction about 80 years ago. This document does not specifically address Bayesian methodologies. 2.2.1.
The Hypothesis Testing Framework
Hypothesis testing is the rational framework for applying statistical tests. In general, hypothesis testing in experimental or observational studies is based on samples drawn from the population under investigation. For instance, individuals, who are either exposed or not to a hazardous chemical, for example through intake of a particular food, could be sampled and these two groups of subjects could be then compared for the proportions suffering adverse health effects after consumption. The comparison involves issues which need to be taken into account such as the population variability of the health effect that is being measured, including any measurement errors. A conclusion is made about the difference between the two groups based upon what is observed in the samples and from what can be concluded using methods of statistical inference taking the variability into account. Table 1: shows the standard scheme for statistical testing in which two kinds of errors may arise: (i) false positive errors, i.e. concluding on the occurrence of an effect when there really is none (Type I errors) and (ii) false negative errors, i.e. concluding that there is no effect when there really is one (Type II errors). These incorrect conclusions, which can arise when results from a study are analysed using hypothesis testing, are linked with the concept of statistical significance. The hypothesis testing framework consists of testing a null hypothesis and an alternative hypothesis. Both hypotheses are statements that are either rejected (disproved) or not rejected (proved), according to the outcome of a statistical test procedure applied to observed data. •
The null hypothesis in this framework usually corresponds to a general or default position. For example, the null hypothesis might be that there is no relationship between two measures on a series of individuals or that a potential treatment has no effect;
•
The alternative hypothesis, in general represents a contradicting alternative of the null hypothesis.
As a consequence errors that can occur are: •
Type I error (α): rejecting the null hypothesis in favour of the alternative hypothesis when the null hypothesis is, in fact, true;
•
Type II error (β): accepting the null hypothesis, as opposed to the alternative hypothesis, when the null hypothesis is, in fact, false.
EFSA Journal 2011;9(9):2372
8
Statistical significance & biological relevance
In general, statistical tests aim to disprove or reject the null hypothesis. If an effect is found, the statement on its existence or its size is reported with a P-value which is interpreted as the probability that an effect, as large as or larger than the one observed, would have occurred by chance alone when there is, in fact, no effect at all, i.e. when the null hypothesis is true. Table 1: Summary table of the results obtained when a statistical hypothesis test (null hypothesis versus alternative hypothesis) is carried out. Two types of errors (Type I and Type II) can occur when making inference from a sample to the population. Truth (population) H0 True H0 False Decision (sample) Accept H0
No error (1- α)(c)
Type II error (β)(b)
Reject H0
Type I error (α)(a)
No error (1-ß) = power (d)
(a): α = probability of rejecting the null hypothesis when it is true. The wrong conclusion is that the effect has occurred; (b): β = probability of accepting the null hypothesis when it is false. The wrong conclusion is that the effect has not occurred; (c): 1-α = probability of accepting the null hypothesis when it is true. The right conclusion is that the effect has not occurred; (d): 1-β = probability of rejecting the null hypothesis when it is false. The right conclusion is that the effect has occurred.
“Statistical power” is defined as the probability of finding as statistically significant the predefined effect (or an effect larger than that), if such an effect actually exists. It is calculated as 1-ß (i.e. power). The power of a test is often set to 80%, which is often considered to be an acceptable level and corresponds to a Type II error (β) of 0.2 (20%). A compromise on the sample size required for high power and the experimental resources available may be necessary. Retrospective power analyses are analyses carried out after a study has finished. They calculate what the power of the study was if, based upon the sample sizes used, the effect size detected was the effect size in the population. Such methods are controversial and not considered an acceptable methodology by many statisticians, and are therefore not recommended by the Scientific Committee. The size of the effect should be defined during the design of the study and has been called the relevant effect size (Luus et al., 1989), (see Section 2.1). Note that the word “relevant” here relates to a size of effect which was defined as biologically relevant. The larger the sample size of the study, the more likely it can detect the biologically defined relevant effect as statistically significant. Statistical power increases with the sample size, while all other parameters of statistical testing remain the same. Statistical significance, when expressed by a P-value, relates to the probability of having obtained results as (or more extreme than) those observed, given that the null hypothesis H0 is true. Since the type I error α denotes the probability of rejecting the null hypothesis when it is true, one can use this level as the cut-off point for significance testing (the threshold of significance) and, therefore, this α value (called the critical value) indicates the significance level of the test. As mentioned previously, the significance level should be chosen before the statistical test is performed, and preferably at the design stage of the study. When the resulting P-value from the test is lower than the chosen α (critical) value, the test result may be termed significant, because the pattern of results obtained would be considered “a rare event” if the null hypothesis was, in fact, true. 2.3.
Relationship between Biological Relevance and Statistical Significance
Biological relevance and statistical significance are not necessarily linked. The usual definition of the word “significant” outside the specific field of statistics implies large size or great relevance. In the field of statistics the meaning of the word “significant” does not necessarily imply large size or relevance (see Section 2.2.1). However, many scientific publications interpret “statistically EFSA Journal 2011;9(9):2372
9
Statistical significance & biological relevance
significant” as a statement about the size, importance or biological relevance of the effect. Many researchers incorrectly conclude that any statistically significant effect is biologically relevant, as it is supported by mathematics. This is clearly misleading. In the interpretation of statistical analyses it should always be kept in mind that the definition of biological relevance of any change found should be of primary importance in the assessment rather than the specific level of statistical significance. In practice the biological effect is calculated using (statistical) point estimations and its uncertainty, which is expressed via an interval estimate (e.g. confidence interval). Using such point and interval estimates can help in focussing the discussions on the biological relevance of the results. 2.4.
Confirmatory studies vs. Exploratory studies
In some circumstances it may be difficult to define precisely what constitutes a biologically relevant effect. In such cases, exploratory studies may be designed. Exploratory studies may be restricted in their objective to the generation of data that suggest hypotheses for more intensive and targeted confirmatory experiments in the future. Whatever the type of study, its objectives should always be clearly and transparently stated. This scientific opinion, however, does not specifically address study design.
3.
Errors in interpretation of Statistical Significance
Statistical significance is an often misused term when characterising an empirical effect, leading to errors in the interpretation of results. This section will discuss some of the more common errors. 3.1.
“Absence of Evidence is not Evidence of Absence”
In Section 2 it was highlighted that an outcome of a statistical evaluation could be statistically significant, but without any biological relevance. Moreover, “a far greater problem arises from misinterpretation of non-significant findings” (Altman and Bland, 1995; Alderson, 2004). Statistical significance in a statistical test may be absent for two reasons: •
there is, in fact, no relationship to detect (the null hypothesis is actually true). This is controlled by the error probability α (significance level);
•
there is a relationship, but the study was not capable of detecting it, e.g. the study design was poor or had low power, e.g. because the sample size was not large enough to detect the effect.
“Absence of evidence” relates to the output of a statistical test where the null hypothesis is defined as the absence of a given effect. In this case the output is not significant, i.e. the P-value is above a prespecified threshold, conventionally 5% (P>0.05) or 1% (P>0.01). The null hypothesis cannot be rejected as the data fail to provide enough evidence to demonstrate that there is an effect. “Evidence of absence” relates to the output of a statistical test where the null hypothesis states that the effect is present. In this case, if the output of the statistical test is significant (e.g. P