Statistical Significance Testing in Science and in the Courts after Matrixx David A. Gulley1
Introduction Tests of the statistical significance of research results are widespread. They are adopted world-wide in a variety of cultures and disciplines that often seem to have little else in common. These tests also make their way into courtrooms. No fewer than 1,140 Federal court rulings refer to “statistical significance,” and 3,040 such rulings refer to “statistically significant.” An overlapping 2,220 Federal court rulings refer to “95 percent.”2 In fact, some of the earliest work in this intellectual vein was undertaken specifically with the justice system in mind.3 Popular television portrays “forensic scientists” examining patterns and tell-tale signs in crime-scene evidence, much of which has been subjected to statistical tests in real life. In a companion academic article, I studied the level of acceptance and rate of adoption of basic tests of statistical significance by the scientific community.4 By conducting empirical research on this topic, I wanted to provide some additional objective results to a topic that is usually approached more subjectively. My results may be useful to attorneys in Frye, Daubert, and related hearings on the admissibility of expert evidence. And the Supreme Court ruling in re: Matrixx Initiatives makes the subject of statistical significance worth revisiting more generally. Of course there is more to science and to legal investigations than significance testing, and this gives rise to some confusion that we will turn to in a moment. However, the evaluation of the statistical reliability of empirical results (and especially projections) is if anything still underemployed in many financially 1
Columbia University in the City of New York. The author can be contacted at
[email protected]. This article originally appeared in a Newsletter of the American Bar Association, February 2013. 2 Data are results from a Google Scholar search of Federal case law, exact matches only, conducted in February 2012. Search results can vary from one trial to another. 3 Jacob Bernoulli considered mathematical odds and the reliability of evidence in the context of legal inquiries in or around 1700. 4 D. A. Gulley, 2012, “The Adoption of Significance Tests by the Scientific Community: an Empirical Analysis”, working paper available at: http://ssrn.com/abstract=2012659.
oriented lawsuits. When an accountant or appraiser divides one number by another to form a ratio, or “averages” several disparate value estimates, the statistical properties of the estimate are rarely carefully inspected. For example, if the inherent mathematical, error-propagation properties of DCF analysis were better understood, it might transform the evaluation of valuations.
Significance Testing and Expert Testimony Since my academic research is available, I will merely summarize the conclusions here.5 There are several important points for attorneys working with experts. By way of background, the basic modern statistical tests of reliability consist of “significance tests” and “hypothesis tests” or “confidence tests.” These methods were developed in the 1920s and 1930s by two rival groups of scholars. My data show that significance testing (as a collective term for both methods) has been adopted by a majority of scientific practitioners for over 50 years, and today is virtually universal.6 For a decade or more following the introduction of confidence tests, a point of view persisted that one or the other of these must be superior in principle and/or in practice, a divide that was sharpened by the personal animosities of leading proponents of each view.7 However, by the 1950s the data show that both methods were widely used, often together. Prior to the 1980s, a disposition by researchers to adopt one had little bearing one way or the other on whether they also reported on results from the other tests. In more recent decades, a researcher who adopted either one often times would also report the other set of measurements. Neither side of this once-spirited debate “won” the argument and it is largely a non-issue for practical purposes today.8 The importance of this research result for attorneys is, first, there is no question as to whether significance testing generally is used as often as possible in peer review and accepted by the community of scientists. It is. Second, technical debates between 5
For reasons discussed in the research paper, the study focused on the publications of natural scientists in the oldest and arguably most prestigious English language scientific journal, the Philosophical Transactions of the Royal Society. 6 Some academics worry that the methods are misunderstood, misinterpreted, and misapplied, but no one disputes that significance testing is widely used and a standard scientific method.. 7 R. A. Fisher for significance testing; Jerzy Neyman and Egon Pearson for confidence intervals. The unfortunate antagonism originated with Egon Pearson’s father’s high-handed treatment of Prof. Fisher. 8 I would like to thank Prof. George Easton and Justin Regus for their kind assistance in running the tests necessary to reach these conclusions. Discussions of the conceptual advantages of one method over another continue today, but this somewhat philosophical debate does not appear to influence practicing empirical researchers.
opposing experts about the superiority of one method over another would only rarely be germane to the issues being tested. A related issue has to do with the numerical value, the threshold or critical level of the test statistic, used to reach the conclusion that the results are “significant”. As the reader may know, this is ordinarily set at 95% (or p=0.05). Where does this come from? Is it really the standard? The economist F. Y. Edgeworth used a 95% threshold, which he termed “significant” in his lectures to the Royal Society in 1885. His methods predated modern significance testing, however. A p value of 0.05 was first recommended by the founder of significance testing (R. A. Fisher) in his correspondence with a researcher. It soon made its way into print in his various writings. All this occurred in the 1920s, but there were antecedents. There are two, maybe more, intuitively appealing arguments in favor of that number. The oldest – it was present in the work of Sir Isaac Newton at the Royal Mint, beginning around 1696 – is that one chance in twenty is about the lowest sensible threshold for making an error in measurement.9 The more modern is that 95/5% corresponds to two standard deviations (a reference enshrined in case law, by the way) and therefore has natural appeal to modern statistically trained individuals.10 However, the author (Fisher) who invented statistical testing also warned that no single number could constitute an absolute bright line to be used in all times and places (a sentiment SCOTUS would surely agree with). Textbooks and reference books are seldom very helpful in this regard, maintaining it is a choice for the investigator to make. And some scientists report p-levels and standard errors numerically and leave the conclusion to the reader. But this is often a bit too ambiguous. In science, it is common to see higher levels of 99% and even 99.9% highlighted by researchers. In courts, the question is sometimes asked whether the number could be lower – perhaps 90% or even less. It is with regard to this issue that my research is perhaps most interesting. In over 500 published articles there was exactly one reference to a lower test threshold than 95% -90% -- and when 99% could be highlighted, it often was. The 95% level may be the offspring of an obscure birth and childhood, but empirically it is without doubt the established and in the literature examined in my study, virtually universal practice.
Significance Testing and Materiality Materiality is an important subject in securities fraud and other cases, and over the years there has been regular progress in analyzing this issue scientifically and 9
One-in-ten seems unacceptably risky to risk-averse people for important subjects. The next higher round number is arguably one-in-fifty, which was actually advocated by Jacob Bernoulli (mentioned earlier) as meeting the standard of “moral certainty” of being correct. This level, equivalent to a 98/2% standard, never caught on and even Bernoulli found it impractically high in some of his empirical work. Other levels were suggested in the years before modern testing, but the point is that 95/5% resonated and other levels did not. 10 More formally, 1.96 times the standard error of the estimate.
introducing the results in courtrooms. The SCOTUS decision in re: Matrixx Securities did not repudiate significance testing, but appears to have caused some confusion that should be cleared up. The confusion arises from the difference between the potential materiality of corporate announcements, when the content of the announcements involves scientifically inconclusive data, and the use of scientific methods of analyzing, in retrospect, whether a given announcement really was material to the stock market. These two issues are distinctly different, yet both involve the terminology of statistical significance and materiality. SCOTUS was concerned with the former. Ironically enough, the Matrixx decision confirms the Court’s abiding interest in stock market reactions to news, which is the more common application of significance testing. When public companies and their auditors consider the materiality of potential disclosures, they do so with no hard evidence of how the market will react to the news, because that event still lies in the future. Accounting principles, much like legal doctrine established in Basic and elsewhere, and SEC practices, eschew bright lines and may consider but do not rely upon rules-of-thumb. This situation changes in the courtroom, because by the time of trial the market’s reaction to the news is a matter of record. The market’s reaction is still ambiguous, because there can be many influences on share price movements and the market does not necessarily react to news the way the company, auditors, and regulators might have anticipated, ex ante. The impact of the news on the market’s valuation of the company can be evaluated using statistical and econometric methods. The oldest and best known of these is the so-called “event study.” As the reader may well, know the usefulness of event studies has been recognized by the courts in many cases. More recently, a body of empirical literature referred to as “value relevance” research has become established, and for certain types of investigations offers additional, powerful analytical results. I have introduced “value relevance” results (and other less-common scientific methods) in several securities fraud and whitecollar crime cases, where it appears to have been helpful to the triers of fact. Since these methods are scientific, they are evaluated scientifically, and that means they are amenable to significance testing. The SCOTUS ruling in Matrixx seems clear enough: “This is not to say that statistical significance (or the lack thereof) is irrelevant-only that it is not
dispositive of every case.”11 In point of fact, the majority opinion carefully notes the share price movement when news and corporate announcements were made. It noted the share price decline from $13.55 to $11.97 following a January 20, 2004 press report about an FDA inquiry and product liability lawsuits. This was an 11.6% drop. It noted the price recovery following a corporate press release, and the subsequent “plummet” of the share price to $9.94, following a national television news story. To what extent, if any, were these carefully noted price movements due to the natural, underlying volatility of the share price, or to unrelated shifts of stock market mood? Such questions are answered through the use of event studies to establish the statistical significance of the price movement. The majority ruling shows the Court’s continuing interest in such price movements as a guide to materiality. Instead, in Matrixx the majority was commenting on statistical significance in a quite different application: whether the frequency of adverse event reports had to be statistically significant. Since virtually all therapies are associated with a nonzero number of such reports, physicians, regulators, and companies seek evidence (statistical and otherwise) as to whether the true causes of these adverse events might have been due to unobserved factors, rather than to the therapeutic regime. A number of specialists in health sciences are disappointed the ruling did not clarify the disclosure requirements for that industry, but that is a separate issue from the question of whether SCOTUS was taking on significance testing more generally. It clearly was not. Finally, there is the issue of semantics. In Matrixx, SCOTUS quotes a famous passage from Basic: “In Basic, we held that this materiality requirement is satisfied when there is “‘a substantial likelihood that the disclosure of the omitted fact would have been viewed by the reasonable investor as having significantly altered the “total mix” of information made available.’”12 It later refers also to: information that “would otherwise be considered significant to the trading decision of a reasonable investor.”13 Significance testing is a powerful tool in evaluating the “total mix” of information, but that does not mean that tests of statistical significance are using the word “significant” with exactly the same meaning as it is used in these passages. When the phrase statistical significance was coined, the intended meaning was that the statistical measure signified a given result, i.e., it conveyed a conclusion. The more 11
Matrixx Initiatives, Inc., et a. v.James Sirascusno, et al., Westlaw imprint, 2011 WL 977060 (U.S.), p. 13. 12 2011 WL 977060 (U.S.),, p. 11, quoting 485 U.S., at 231-2, 108 S.Ct. 978, emphasis added. 13 2011 WL 977060 (U.S.),, p. 11, quoting 485 U.S., at 236, 108 S.Ct. 978, emphasis added.
common meaning of the word today, that something is significant if it is important, was the secondary connotation in the late 1800s.14 Ordinarily, of course, a scientist is interested in statistically significant results that are also important to the body of scientific knowledge, but there can be some confusion as to whether statistical significance is a direct test of the importance of the phenomenon tested. Of course, attorneys must often link the two. This confusion lies at the root of the seeming ambiguity of Matrixx, that statistically insignificant adverse events can be significant to the stock market, but it does not mean that significance tests should not be applied to the stock price movements themselves. Significance testing is here to stay in the sciences, and I believe, in the courts.
14
The 1888 and contemporary editions of the Oxford Dictionary of the English Language reverse the order of these two meanings. For more on this, see: David Salsburg, 2001, The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century, New York: Henry Holt & Co.