Psychology, Public Policy, and Law 1996, Vol. 2, No. 2, 324-347
Copyright 1996 by the American Psychological Association, Inc. 1076-8971/96/$3.00
CUMULATIVE RESEARCH KNOWLEDGE AND SOCIAL POLICY FORMULATION: The Critical Role of Meta-Analysis John E. H u n t e r Michigan State University
Frank L. S c h m i d t University of Iowa
For many years, policymakers expressed increasing frustration with social science research. On every issue there were studies arguing for diametrically opposed conclusions. Methods of meta-analysis that correct for the effects of sampling error have shown that almost all such conflicting results were caused by sampling error. Furthermore, the effects of sampling error are greatly exaggerated by using significance test methodology. In many areas, meta-analysis has now provided dependable answers to the original research questions. Meta-analysis is now increasingly being used by policymakers, by textbook writers, and by theorists to provide the basic facts needed to draw both practical and explanatory conclusions. Sophisticated meta-analysis procedures are now used to correct for the effects of other study imperfections, such as measurement error, range restriction, and artificial dichotomization. In domains where the data on artifacts are available, the effect sizes in necessarily imperfect studies have been found to be considerably understated. Path analysis can be applied to the findings from meta-analysis to yield improved causal analyses that result in both explanation of results and improved generalization of results to new settings. The purpose of this article is to show that meta-analysis is critical to the formulation of effective social policy. Many public policy questions and issues are obviously related to the subject matter of the behavioral and social sciences. Examples include racial integration in schools, crime, juvenile delinquency, employment discrimination, and drug abuse. Policy recommendations should be based on facts and knowledge, and it is the purpose of empirical scientific research to establish facts and knowledge. So it has been quite natural that those responsible for analyzing public policy options and making policy recommendations have turned to behavioral and social science research for the factual foundations they need for policy. Over the years, however, many policymakers have been disappointed with the assistance they have obtained from behavioral and social science research. The major problem has been that different research studies on the same question have reported conflicting and contradictory findings and conclusions, leaving policymakers puzzled as to which findings were correct and which should be chosen as the foundation for policy recommendations. For example, in an invited address at the 78th Annual Convention of the American Psychological Association, then-Senator Fritz Mondale made the following statement: What I have not learned is what we should do about these problems. I had hoped to find research to support or to conclusively oppose my belief that quality integrated John E. Hunter, Department of Psychology,Michigan State University; Frank L. Schmidt, Department of Managementand Organizations,Universityof Iowa. Correspondence concerning this article should be addressed to Frank L. Schmidt, College of Business, University of Iowa, Iowa City, Iowa 52242. Electronic mail may be sent via Internet to
[email protected].
324
SOCIAL POLICY
325
education is the most promising approach. But I have found very little conclusive evidence. For every study, statistical or theoretical, that contains a proposed solution or recommendation, there is always another, equally well documented, challenging the assumptions or conclusions of the first. No one seems to agree with anyone else's approach. But more distressing: no one seems to know what works. As a result I must confess, I stand with my colleagues confused and often disheartened. (Mondale, 1970, p. 8) Mondale gave his address in 1970. Things were to get worse before they got better. By the middle or late 1970s the behavioral and social sciences were in serious trouble in the United States. Large numbers of studies had accumulated on many questions that were important not only to theory development but also to social policy decisions. Results of different studies on the same question typically were conflicting. For example, are workers more productive when they are satisfied with their jobs? The studies did not agree. Do students learn more when class sizes are smaller? The studies did not agree. Does participative decision making in management increase productivity? Does job enlargement increase job satisfaction and output? Does psychotherapy really help people? The studies did not agree. As a consequence, the public and government officials became increasingly disillusioned with the behavioral and social sciences, and it became more and more difficult to obtain funding for research. Finally, in 1981 then-Director of the Federal Office of Management and Budget David Stockman proposed an 80% reduction in federal funding for research in the behavioral and social sciences. Such proposed cuts are typically trial balloons sent up to see how much political opposition they arouse. Even when proposed cuts are much smaller than a draconian 80%, constituencies can be counted on to come forward and dramatically protest the proposed cuts. This usually happens, and many behavioral and social scientists sat back and waited for it to happen. Nothing happened. The behavioral and social sciences, it turned out, had no constituency among the public; the public did not care (see "Cuts Raise New Social Science Query," 1981). Finally, out of desperation, the American Psychological Association took the lead in forming the Consortium of Social Science Associations to lobby against the proposed cuts. Although this super-association had some success in getting these cuts reduced (and, even in some areas, getting small increases in research funding in subsequent years), these developments should make psychologists look carefully at how such a thing could happen. The sequence of events that led to this state of affairs has been much the same in one research area after another. First, there is initial optimism about using social science research to answer socially important questions. Do governmentsponsored job training programs work? We will do studies to find out. Does Head Start really help disadvantaged kids? The studies will tell us. Does integration increase the school achievement of Black children? Research will provide the answer. Next, several studies on the question are conducted, but the results are conflicting. There is some disappointment that the question has not been answered, but policymakers--and people in general--are still optimistic. They, along with the researchers, conclude that more research is needed to identify the supposed interactions (moderators) that have caused the conflicting findings. For example, perhaps whether job training works depends on the age and education of the trainees. Maybe smaller classes in the schools are beneficial only for lower IQ
326
HUNTER AND SCHMIDT
children. It is hypothesized that psychotherapy works for middle-class but not lower-class patients. That is, the conclusion at this point is that a search for moderator variables is needed. In the third phase, a large number of research studies are funded and conducted to test these moderator hypotheses. When they are completed, there is now a large body of studies, but instead of the conflicts being resolved, their number increases. The moderator hypotheses from the initial studies are not bome out. No one can make much sense out of the conflicting findings. Researchers conclude that the phenomenon that was selected for study in this particular case has turned out to be hopelessly complex. They then turn to the investigation of another question, hoping that this time the question will turn out to be more tractable. Research sponsors, government officials, and the public become disenchanted and cynical. Research funding agencies cut money for research in this area and in related areas. After this cycle has been repeated enough times, social and behavioral scientists themselves become cynical about the value of their own work, and they publish articles expressing doubts about whether behavioral and social science research is capable in principle of developing cumulative knowledge and providing general answers to socially important questions (e.g., see Cronbach, 1975; Gergen, 1982; Meehl, 1978). Clearly, at this point there is a critical need for some means of making sense of the vast number of accumulated study findings. Fortunately, starting in the late 1970s new methods of combining findings across studies on the same subject were developed. These methods were referred to collectively as meta-analysis. Applications of meta-analysis to accumulated research literatures have now shown that research findings are not nearly as conflicting as had been thought and that useful and sound general conclusions can in fact be drawn from existing research. Cumulative knowledge is possible in the behavioral and social sciences, and socially important questions can be answered in reasonably definitive ways. As a result, the gloom and cynicism that have enveloped many in the behavioral and social sciences is lifting. This article is not an extended presentation of meta-analysis methods. Such presentations can be found elsewhere (e.g., see Hedges & Olkin, 1985; Hunter & Schmidt, 1990b; Hunter, Schmidt, & Jackson, 1982; Schmidt, 1992). This article provides an overview of the logic of meta-analysis and examples of meta-analysis findings' relevance for policy formulations. A key point in understanding the effect that meta-analysis has had is that the illusion of conflicting findings in research literatures resulted mostly from the traditional reliance of researchers on statistical significance testing in analyzing and interpreting data in their individual studies (Cohen, 1994). These statistical significance tests typically had low power to detect existing relationships. Yet the prevailing decision rule has been that if the finding was statistically significant, then a relationship existed; and if it was not statistically significant, then there was no relationship (Oakes, 1986; Schmidt, 1996). For example, suppose that the population correlation between a certain familial condition and juvenile delinquency is .30. That is, the relationship in the population of interest is p = .30. Now suppose 50 studies are conducted to look for this relationship, and each has statistical power of .50 to detect this relationship if it exists. (This level of statistical power is typical of many research literatures.) Then approximately 50%
SOCIAL POLICY
327
of the studies (25 studies) would find a statistically significant relationship; the other 25 studies would report no significant relationship, and this would be interpreted as indicating that no relationship existed. The researchers in these 25 studies would most likely incorrectly state that because the observed relationship did not reach statistical significance, it probably occurred merely by chance. Thus, half the studies report that the familial factor was related to delinquency and half report that it had no relationship to delinquency--a condition of maximal apparent conflicting results in the literature. Of course, the 25 studies that report that there is no relationship are completely wrong: The relationship exists and is always p = .30. Traditionally, however, researchers did not understand that a statistical power problem such as this was even a possibility, because they did not understand the concept of statistical power. In fact, they believed that their error rate was no more than 5% because they used an alpha level (significance level) of .05. But the 5% is just the Type I error rate (the alpha error rate)--the error rate that would exist if the null hypothesis were true and in fact there was no relationship. They overlooked the fact that if a relationship did exist, then the error rate would be 1.00 minus the statistical power (which here is .50). This is the Type II error rate: the probability of failing to detect the relationship that exists. If the relationship does exist, then it is impossible to make a Type I error; that is, when there is a relationship, it is impossible to falsely conclude that there is a relationship. Only Type II errors can occur. Now suppose these 50 studies were analyzed by means of meta-analysis. Meta-analysis would first compute the average r across the 50 studies; all rs would be used in computing this average regardless of whether they were statistically significant or not. This average should be very close to the correct value of .30, because sampling errors on either side of .30 would average out. So meta-analysis would lead to the correct conclusion that the relationship is on the average p = .30. Meta-analysis can go beyond this; it can estimate the real variability of the relationship across studies. To do this, one first computes the variance of the 50 observed rs, using the ordinary formula for the variance of a set of scores. One next computes the amount of variance expected solely from sampling error variance, using the formula for sampling error variance of the correlation coefficient. This sampling variance is then subtracted from the observed variance of the rs; after this subtraction, the remaining variance should be approximately zero. Thus the clear conclusion is that all of the observed variability of the rs across the 50 studies is due merely to sampling error and does not reflect any real variability in the true relationship. Thus one would conclude correctly that the real relationship is always .30--and not merely .30 on the average. This simple example illustrates two critical points. First, the traditional reliance on statistical significance tests in interpreting studies leads to false conclusions about what the study results mean; in fact, the traditional approach to data analysis makes it impossible to reach correct conclusions (Hunter & Schmidt, 1990b; Schmidt, 1996). Second, by contrast, meta-analysis leads to the correct conclusions about the real meaning of research literatures. These principles are illustrated and explained in much more detail in Hunter and Schmidt (1990b); for a shorter treatment, see Schmidt (1996). The reader might reasonably ask what statistical methods researchers should use in analyzing and interpreting the data in their individual studies. If reliance on
328
HUNTER AND SCHMIDT
statistical significance testing leads to false conclusions, what methods should researchers use? The answer is point estimates of effect sizes (correlations and d values) and confidence intervals. The many advantages of point estimates and confidence intervals are discussed in Hunter and Schmidt (1990b) and Schmidt (1996). Our example here has examined only the effects of sampling error variance and low statistical power. There are other statistical and measurement artifacts that cause artifactual variation in effect sizes and correlations across studies, for example, differences between studies in measurement error, range restriction, and dichotomization of measures. Also, in meta-analysis, mean correlations (and effect sizes) must be corrected for downward bias due to such artifacts as measurement error and dichotomization of measures. There are also artifacts such as coding or transcriptional errors in the original data that are difficult or impossible to correct for. These artifacts and the complexities involved in correcting for them are beyond the scope of this article but are covered in detail in Hunter and Schmidt (1990a, 1990b) and Schmidt and Hunter (1996). The purpose here is only to explain why traditional data analysis and interpretation methods must logically lead to erroneous conclusions and to demonstrate that meta-analysis can solve this problem and provide correct conclusions. A very common reaction to the above critique of traditional reliance on significance testing goes something like this: "Your explanation is clear but what I don't understand is this: How could so many researchers (and even some methodologists) have been so wrong for so long on a matter as important as the correct way to analyze data7 How could psychologists and others have failed to see the pitfalls of significance testing?" Over the years, a number of methodologists have attempted to address this question (Carver, 1978; Cohen, 1994; Guttman, 1985; Meehl, 1967; Oakes, 1986; Rozeboom, 1960). For one thing, in their statistics classes young researchers have typically been taught a lot about Type I error and very little about Type II error and statistical power. Thus they are unaware that the error rate is very large in the typical study; they tend to believe the error rate is the alpha level used (typically .05 or .01). Empirical research has suggested, however, that most researchers believe that the use of significance tests provides them with many nonexistent benefits in understanding their data. For example, most researchers believe that a statistically significant finding is a "reliable" finding in the sense that it will replicate if a new study is conducted. For example, they believe that if a result is significant at the .05 level, then the probability of replication in subsequent studies (if conducted) is .95. This belief is completely false. The probability of replication is the statistical power of the study and is almost invariably much lower than .95 (e.g., typically .50 or less). Most researchers also falsely believe that if a result is nonsignificant, one can conclude that it is probably just due to chance, and they regard this as valuable information. There are other widespread but completely false beliefs about the usefulness of information provided by significance tests (Carver, 1978). A recent discussion of these beliefs can be found in Schmidt (1996). During the 1980s and accelerating up to the present, the use of meta-analysis to make sense of research literatures has increased dramatically, as anyone who reads psychology research journals knows. For example, Lipsey and Wilson (1993) found more than 350 meta-analyses of experimental studies of treatment
SOCIAL POLICY
329
effects alone; the total number is many times larger, because most meta-analyses are conducted on correlational data (as was our hypothetical example above). The overarching meta-conclusion from all these efforts is that cumulative, generalizable knowledge in the behavioral and social sciences is not only possible but is increasingly a reality. In fact, meta-analysis has even produced evidence that cumulativeness of research findings in the behavioral sciences is probably as great as in the physical sciences. Psychologists have long assumed that their research studies are less replicable than those in the physical sciences. Hedges (1987) used meta-analysis methods to examine variability of findings across studies in 13 research areas in particle physics and 13 research areas in psychology. Contrary to common belief, his findings showed that there was as much variability across studies in physics as in psychology. Furthermore, he found that the physical sciences used methods to combine findings across studies that were "essentially identical" to meta-analysis. The research literature in both areas--psychology and physics--yielded cumulative knowledge when meta-analysis was properly applied. Hedges's major finding is that the frequency of conflicting research findings is probably no greater in the behavioral and social sciences than in the physical sciences. The fact that this finding has been so surprising to many social scientists points up two facts. First, psychologists' reliance on significance tests has caused the research literatures to appear much more inconsistent than they are. Second, researchers have long overestimated the consistency of research findings in the physical sciences. In the physical sciences also, no research question can be answered by a single study, and physical scientists must use meta-analysis to make sense of their research literature, just as psychologists do. Another fact is very relevant at this point: The hard sciences, such as physics and chemistry, do not use statistical significance testing in interpreting their data. It is no accident, then, that these sciences have not experienced the debilitating problems described earlier that are inevitable when researchers rely on significance tests. Given that the hard sciences regard reliance on significance testing as unscientific, it is ironic that so many psychologists defend the use of significance tests on grounds that such tests are the objective and scientifically correct approach to data analysis and interpretation. In fact, it has been our experience that psychologists and other behavioral scientists who attempt to defend significance testing usually equate null hypothesis statistical significance testing with scientific hypothesis testing in general. They argue that hypothesis testing is central to science and that the abandonment of significance testing would amount to an attempt to have a science without hypothesis testing. The fact that they believe that null hypothesis significance testing and hypothesis testing in science are in general one and the same thing is a tribute to the persuasive impact of R. A. Fisher's writings (R. A. Fisher, 1932, 1935, 1973). This belief is tantamount to stating that physics, chemistry, and the other hard sciences are not really legitimate sciences at all because they are not built on hypothesis testing. It also equates to a statement that before the introduction of null hypothesis significance testing by Fisher (1932) in the 1930s, no legitimate scientific research was possible. The fact is, of course, that there are many ways to test scientific hypotheses--and significance testing is one of the least effective methods of doing this. Lawyers, journalists, and other non-science professionals are even more likely to equate statistical significance testing with "real" science. For example, many
330
HUNTER AND SCHMIDT
law review journal articles on employment testing and civil rights laws reflect this orientation. Perhaps the area in which the use of meta-analysis is expanding most rapidly is medical research (Baum et al., 1981; Halvorsen, 1986). For example, in 1986 the Clinical Trials Branch of the U.S. National Heart, Lung, and Blood Institute, along with the National Cancer Institute, sponsored a conference on ways of developing new meta-analysis methods and having these applied more widely in medical research (Yusuf, Simon, & Ellenberg, 1987). Sacks, Berrier, Reitman, AnconaBerk, and Chalmers (1987), in the New England Journal of Medicine, examined the meta-analyses published up to 1987 in the medical research literature (86 at that time) and discussed the future of meta-analysis in medical research. Since then meta-analysis has played an increasingly important role in medical research. From 1989 to the present, more than 100 meta-analyses per year have been reported in the medical literature. The increasing use of meta-analysis has produced important changes in research and publication patterns in psychology. The relative status of reviews has changed dramatically. Journals that formerly published only primary studies and refused to publish reviews are now publishing numerous meta-analytic reviews. In the past, research reviews were based on the narrative-subjective method, and they had limited status and gained little credit for one in academic raises or promotions. Perhaps this was appropriate because such reviews rarely contributed to cumulative knowledge. The information integration task involved in using the narrative method was too complex for the unaided human mind. Furthermore, many of these reviews were highly subjective. Relying on various personal and subjective theories and beliefs about methodological quality, reviewers often excluded all but a small number of studies as "methodologically inadequate" and then based their reviews on only the remaining few studies. Perhaps it is understandable, then, that the rewards went not to the reviewers but to those who conducted the individual empirical studies. The same is true in medical research (Baum et al., 1981; Halvorsen, 1986). Not only is this no longer the case, but there has been a far more important development: A behavioral or social scientist today with the needed training and skills can make major original discoveries and contributions by mining the rich untapped veins of information in accumulated research literatures. This process is well under way today. The industrial-organizational psychology and organizational behavior research literatures--the ones with which we are most familiar--are rapidly being mined. The same is true in education, social psychology, and other areas. Role of Meta-Analysis in Theory Development The major task in the behavioral and social sciences, as in other sciences, is the development of theory. A good theory is simply a good explanation of the processes that actually take place in a phenomenon. For example, what actually happens when employees develop a high level of organizational commitment? Does job satisfaction develop first and then cause the development of commitment? If so, what causes job satisfaction to develop, and how does it have an effect on commitment? How do higher levels of mental ability cause higher levels of job performance? Only by increasing job knowledge? Or also by directly improving problem solving on the job? The social scientist is essentially a detective; his or her
SOCIAL POLICY
331
job is to find out why and how things happen the way they do. To construct theories, however, researchers must first know some of the basic facts, such as the empirical relations among variables. These relations are the building blocks of theory. For example, if researchers know there is a high and consistent population correlation between job satisfaction and organizational commitment, this will send them in particular directions in developing their theories. If the correlation between these variables is very low and consistent, theory development will branch in different directions. If the relation is highly variable across organizations and settings, researchers will be encouraged to advance interactive or moderatorbased theories. Meta-analysis provides these empirical building blocks for theory. Meta-analytic findings tell researchers what it is that needs to be explained by the theory. Meta-analysis has been criticized because it does not directly generate or develop theory (Guzzo, Jackson, & Katzell, 1987). This is like criticizing typewriters or word processors because they do not generate novels on their own. The results of meta-analysis are indispensable for theory construction, but theory construction itself is a creative process distinct from meta-analysis. As implied in language used in our discussion, theories are causal explanations. The goal in every science is explanation, and explanation is always causal. In the behavioral and social sciences, the methods of path analysis (e.g., see Hunter & Gerbing, 1982) can be used to test causal theories when the data meet the assumptions of the method. The relationships revealed by meta-analysis--the empirical building blocks for theory---can be used in path analysis to test causal theories even when all the delineated relationships are observational rather than experimental. Experimentally determined relationships can also be entered into path analyses along with observationaUy based relations. It is only necessary to transform d values to correlations. Path analysis can be a very powerful tool for reducing the number of theories that could possibly be consistent with the data, sometimes to a very small number, and sometimes to only one theory (Hunter, 1988). For examples, see Hunter (1983) and Schmidt (1992). Every such reduction in the number of possible theories is an advance in understanding. It has now become a virtual necessity that every researcher have some understanding of meta-analysis. This is also becoming the case for path analysis and structural equation modeling (Joreskrg & Sorbrm, 1979). The explosion of meta-analysis applications is not limited to psychology, education, and the social sciences. As discussed earlier, the fastest growing area of application is medical research. Meta-analysis is also used in finance (Coggin, Fabozzi, & Rahman, 1993; Coggin & Hunter, 1983, 1987, 1993; Dimson & Marsh, 1984; Ramamurti, 1989), economics, marketing, and other areas. The rapid growth of meta-analysis is likely to continue. Meta-Analysis in Industrial-Organizational Psychology Meta-analytic methods are general and can be applied to research literatures in any area. There have been numerous applications in industrial-organizational (I/O) psychology. The most extensive and detailed application of meta-analysis in I/O psychology has been the study of the generalizability of the validities of employment selection procedures (Schmidt, 1988; Schmidt & Hunter, 1981). Most of the applications have been to employment tests of ability and aptitude, but other procedures, such as employment interviews (McDaniel, Whetzel, Schmidt, &
332
HUNTER AND SCHMIDT
Maurer, 1994), assessment centers (Gaugler, Rosenthal, Thornton, & Bentson, 1987), and ratings of training and experience (McDaniel, Schmidt, & Hunter, 1988) have also been studied. The findings have resulted in major changes in the field of personnel selection. Validity generalization research is described in more detail later in this article. The meta-analysis methods presented in Hunter and Schmidt (1990b) have also been applied in other areas of I/O psychology and organizational behavior. Between 1978 and 1995, there were approximately 70 such published applications. The following are some examples: (a) correlates of role conflict and role ambiguity (C. D. Fisher & Gittelson, 1983; Jackson & Schuler, 1985); (b) relation of job satisfaction to absenteeism (Hackett & Guion, 1985; Terborg & Lee, 1982); (c) relation between job performance and turnover (McEvoy & Cascio, 1987); (d) relation between job satisfaction and job performance (Iaffaldono & Muchinsky, 1985; Petty, McGee, & Cavender, 1984); (e) effects of nonselection organizational interventions on employee output and productivity (Guzzo, Jette, & Katzell, 1985); (f) effects of realistic job previews on employee turnover, performance, and satisfaction (McEvoy & Cascio, 1985; Premack & Wanous, 1985); (g) evaluation of Fiedler's theory of leadership (Peters, Harthe, & Pohlmann, 1985); and (h) accuracy of self-ratings of ability and skill (Mabe & West, 1982). The applications have been to both correlational and experimental literatures. As of the mid-1980s, sufficient meta-analyses had been published in I/O psychology and a review of meta-analytic studies in this area was published. This lengthy review (Hunter & Hirsh, 1987) reflected the fact that this literature had become quite large. It is noteworthy that the review denoted considerable space to the development and presentation of theoretical propositions; this reflects the fact that the clarification of research literatures produced by meta-analysis provides a basis for theory development that did not previously exist. It is also noteworthy that the findings in one meta-analysis were often found to be theoretically relevant to the interpretation of the findings in other meta-analyses. A second review of meta-analytic studies in I/O psychology was recently published (Tett, Meyer, & Roese, 1994). These are examples of application of meta-analysis to research programs, the results of which can be sought out and used by policy analysts as a foundation for policy recommendations. Meta-analysis can, however, be applied more directly in the public policy arena. We now described one such example. The Federation of Behavioral, Psychological and Cognitive Sciences sponsors regular Science and Public Policy Seminars for members of Congress and their staffs. A recent speaker was Eleanor Chelimsky, who spent many years as the director of the General Accounting Office's Division of Program Evaluation and Methodology. In that position she pioneered the use of meta-analysis as a tool for providing program evaluation and other legislatively significant advice to Congress. Chelimsky (1994) stated that meta-analysis has proven to be an excellent way to provide Congress with the widest variety of research results that can hold up under close scrutiny under the time pressures imposed by Congress. She stated that the General Accounting Office has found that meta-analysis reveals both what is known and what is not known in a given topic area and distinguishes between fact and opinion without being confrontational. One application she cited as an example was a meta-analysis of studies on the merits of producing binary chemical
SOCIAL POLICY
333
weapons (nerve gas in which the two key ingredients are kept separate for safety until the gas is to be used). The meta-analysis did not support the production of such weapons. This was not what officials in the Department of Defense wanted to hear, and the Department of Defense disputed the methodology and the results. The methodology held up under close scrutiny, however, and in the end Congress cut off all funds for binary weapons. By law it is the responsibility of the General Accounting Office to provide policy-relevant research information to Congress. So the adoption of meta-analysis by the General Accounting Office provides a clear and even dramatic example of the impact that meta-analysis can have on public policy. Application o f Meta-Analysis to Personnel Selection One major application of meta-analysis to date has been the examination of the validity of tests and other methods used in personnel selection. Meta-analysis has been used to test the hypothesis of situation-specific validity. In personnel selection it had long been believed that validity was specific to situations; that is, it was believed that the validity of the same test for what appeared to be the same job varied from employer to employer, region to region, across time periods, and so forth. In fact, it was believed that the same test could have high validity (i.e., a high correlation with job performance) in one location or organization and be completely invalid (i.e., have zero validity) in another. This belief was based on the observation that obtained validity coefficients for similar tests and jobs varied substantially across different studies. In some such studies there was a statistically significant relationship, and in others there was no significant relationship--which, as noted earlier, was falsely taken to indicate no relationship at all. This puzzling variability of findings was explained by postulating that jobs that appeared to be the same actually differed in important but subtle ways in what was required to perform them. This belief led to a requirement for local or situational validity studies. It was held that validity had to be estimated separately for each situation by a study conducted in that setting; that is, validity findings could not be generalized across settings, situations, employers, and the like (Schmidt & Hunter, 1981). In the late 1970s, meta-analysis of validity coefficients began to be conducted to test whether validity might not in fact be generalizable (Schmidt & Hunter, 1977; Schmidt, Hunter, Pearlman, & Shane, 1979); these meta-analyses were therefore called validity generalization studies. If all or most of the study-to-study variability in observed validities was due to sampling error and other artifacts, then the traditional belief in situational specificity of validity would be seen to be erroneous, and the conclusion would be that validity did generalize. To date, meta-analysis has been applied to more than 500 research literatures in employment selection, each one representing a predictor-job performance combination. These predictors have included nontest procedures, such as evaluations of education and experience, employment interviews, and biographical data scales, as well as ability and aptitude tests. In many cases, artifacts accounted for all variance across studies. As an example, consider the relation between quantitative ability and overall job performance in clerical jobs (Pearlman, Schmidt, & Hunter, 1980). This substudy was based on 453 correlations computed on a total of 39,584 people. Seventy-seven percent of the variance of the observed validities was traceable to artifacts, leaving a negligible variance of .019. The
334
HUNTER AND SCHMIDT
mean validity was .47. Thus, integration of this massive amount of data leads to the general and generalizable principle that the correlation between quantitative ability and clerical performance is .47, with very little (if any) true variation around this value. Like other similar findings, this finding shows that the old belief that validities are situationally specific is false (Schmidt & Hunter, 1981). Today many organizations---including the federal government, the U.S. Employment Service, and some large corporations--use validity generalization findings as the basis of their selection-testing programs. Validity generalization has been included in standard texts (e.g., Anastasi, 1988) and in the Standards for Educational and Psychological Testing (1985). Proposals have been made to include validity generalization in the federal government's Uniform Guidelines on Employee Procedures (1978) when this document is next revised. In litigation in Canada, the use of validity generalization findings as the basis for the use of a group intelligence test in selecting tax collectors was upheld (Maloley et al. v. Department of National Revenue, 1986). A report by the National Academy of Sciences (Hartigan & Wigdor, 1989) devotes an entire chapter (chapter 6) to validity generalization and endorses its methods and assumptions. In recent years, we have continued to refine our meta-analysis procedures to make them more accurate. Many of these refinements are described in Hunter and Schmidt (1990b). Others are described and tested in a series of studies conducted and published since the appearance of that meta-analysis book. The findings from the research on refinements indicate that in general our earlier methods overestimated the amount of real study-to-study variability in results remaining after correction for the effects of sampling error variance and other variance-producing statistical and measurement artifacts. This real variability is symbolized as SD o, the estimated standard deviation of the true validities. In the first study in this series (Law, Schmidt, & Hunter, 1994a), we found through computer simulation that a new procedure that better takes into account the nonlinearity of the range restriction correction increased the accuracy of estimates of SD o for our most frequently used procedure (the interactive procedure; Hunter & Schmidt, 1990b, pp. 185-186). The next article (Hunter & Schmidt, 1994) showed analytically that use of the mean observed correlation in computing the sampling error variance of (Pearson) correlation coefficients increases the accuracy of those estimates in all cases likely to occur in real data. However, we were able to show this only for the homogeneous case (the case in which there is no real variance in population correlations across studies); the answer for the heterogeneous case is much more difficult to obtain analytically. We therefore conducted a third study in which we used computer simulation to examine the use of the mean r in the heterogeneous case and examined the accuracy-enhancing effects of the new range restriction procedure with realistic sample sizes. Both refinements were found to increase accuracy under these circumstances (Law, Schmidt, & Hunter, 1994b). Finally, we applied the interactive meta-analysis procedure with these accuracy-enhancing refinements to a large body of data previously analyzed by using the older methods (data from Pearlman et al., 1980) to determine the practical and theoretical implications of these improvements for conclusions drawn from real data (Schmidt et al., 1993). The database in that study consisted of observed validities for tests used in clerical hiring against supervisory ratings of overall job performance. Studies in this large database (see Pearlman et al., 1980) are highly heterogeneous, spanning
SOCIAL POLICY
335
the time period from the 1920s through the 1970s and including both published and unpublished data (68% unpublished). Studies were conducted in all parts of the United States and in many different types of organizations, both public and private. Clerical jobs were classified into five job families; meta-analyses were conducted separately by family, and the results were then averaged across job families. (Further details can be found in Schmidt et al., 1993.) Despite this heterogeneity, findings with the accuracy-enhanced meta-analysis procedures showed that there was very little real variability in findings across studies. The results most relevant to our present purpose of illustrating meta-analysis findings are shown in Table 1. The first column of numbers is the total sample size for each test type, and the second column shows the number of studies. The third column shows the average percentage across the different clerical job families of the observed variance of validity coefficients across studies that is explained by statistical and measurement artifacts. Note that these figures are quite large; in many cases all or almost all of the observed variability is explained. The (geometric) average for all 10 test types is 82%. The first 5 test types in Table 1 are "classic" homogeneous psychological constructs; for these 5, the average percentage of variance explained is 87%. Given the general heterogeneity of these studies, these figures are quite remarkable. The fourth column of numbers shows the estimated mean true validities for the 10 types of tests. These values are substantial and are indicative of high levels of practical (economic) value in terms of enhanced workforce productivity resulting from use of such measures in hiring clerical workers (Hunter, Schmidt, & Judiesch, 1990; Schmidt, Hunter, McKenzie, & Muldrow, 1979). However, mean validities are changed very little by adoption of the more accurate meta-analysis procedures. The original, approximate meta-analysis methods provided very accurate estimates of mean values. (Original estimates are presented in Pearlman et al., 1980.) The use of more accurate meta-analysis procedures has its major impact on estimates of SD o, the standard deviation of the true validities. The fifth column of numbers shows the mean estimates of SD o. (Note that because the values in Table 1 are averages, it is possible to have nonzero values for the average SD o in rows in which the average amount of variance accounted for is 100%.) Values of SDp are critical; they estimate the level of real variability in true validities. These values are Table 1 Some Validity Generalization Results for Employment Tests Used for Clerical Workers Obtained With Accuracy-Enhanced Meta-Analysis Procedures Test
N
No. of studies
% Variance
p
SDp
90% CV
General mental ability Verbal ability Quantitative ability Reasoning ability Perceptual speed Memory Spatial mechanical ability Motor ability Performance tests Clerical aptitude tests
7,014 18,859 18,919 4,367 35,134 5,588 5,281 14,518 3,422 5,196
88 232 223 64 446 69 49 156 40 76
91 68 107 84 89 94 101 79 33 67
.54 .38 .50 .40 .40 .40 .38 .25 .39 .53
.10 .14 .05 .11 .09 .08 .08 .10 .32 .17
.41 .20 .44 .26 .29 .30 .28 .12 -.02 .31
Note. CV = credibility value.
336
HUNTER AND SCHMIDT
not only quite small, they are considerably smaller than those obtained by using earlier, less accurate meta-analysis procedures. The reduction in size averages 29% for all the predictor types, and for the five classic psychological constructs, it is 32% (Schmidt et al., 1993). Yet as small as they are, these values of SDo must be considered overestimates because, as explained in Schmidt et al., there are at least six sources of artifactual variation in these validity estimates that even these refined meta-analysis methods are unable to correct for (e.g., errors in the data due to transcriptional, typographical, etc. errors). The average value of SD o in Table 1 is. 106; the true average is smaller, perhaps substantially smaller, and perhaps zero. Thus these findings undercut again whatever is left of the original situational specificity hypothesis that we discussed earlier. The last column in Table 1 presents the average value at the 10th percentile of the true validity distribution. This value is called the 90% credibility value; 90% of all values in the distribution of true validities lie above this value. Therefore an employer or other test user can be certain at the 90% credibility level that the validity in a new setting will be at least this large. For example, in the case of a general mental ability test, the expected and most probable value for the operational validity is .54. However, its "worst case value" is .41; that is, in only 10% of such applications would the value be less than .41. However, as noted above, all figures for SD o have to be viewed as overestimates to some unknown extent; therefore, the true frequency of values below .41 will be less than 10% (although we do not know how much less). On the basis of the currently most accurate meta-analysis methods, this example illustrates two points. First, meta-analysis produces cumulative, generalizable knowledge about relationships that exist in the real world--knowledge that decision makers and policy formulators can rely on with confidence. Second, not only is meta-analysis capable of precisely estimating the average level of relationship that exists, but it can also show that this average level varies very little if at all across situations and settings, if this is indeed the case in reality. This second point is very important: Knowing that you can count on a relationship not to depart much from its average level is critical to confidence in decision making about actions to be taken in particular situations. Approaches to Meta-Analysis In our book on meta-analysis, we systematically reviewed 10 different methods for integrating research findings across studies, starting with the traditional narrative-subjective review and ending with modem quantitative methods of meta-analysis (Hunter & Schmidt, 1990b, chapter 11). Here we present only a quick overview of the most important systematic approaches to metaanalysis.
Descriptive Meta-Analysis Methods Glass (1976) advanced the first such set of procedures and coined the term meta-analysis to designate the analysis of analyses (studies). For Glass, the purpose of meta-analysis is descriptive; the goal is to paint a very general, broad, and inclusive picture of a particular research literature (Glass, 1977; Glass,
SOCIAL POLICY
337
McGaw, & Smith, 1981). The questions to be answered are very general; for example, does psychotherapy--regardless of type--have an impact on the kinds of outcomes that therapy researchers consider important enough to measure, regardless of the nature of these outcomes (e.g., self-reported anxiety, count of emotional outbursts, etc.)? Thus Glassian meta-analysis often combines studies with somewhat different independent variables (e.g., different kinds of therapy) and different dependent variables. As a result, some of Glass's critics have accused him of combining apples and oranges. Methodologically, the primary properties of Glassian meta-analysis are the following: 1. A strong emphasis on effect sizes rather than significance levels. Glass believed the purpose of research integration is more descriptive than inferential, and he felt that the most important descriptive statistics are those that indicate most clearly the magnitude of effects. His meta-analysistypically uses estimates of the Pearson r or of the effect size (ES) d, where d = (XE - Xc)/SD and ~'e and ~'c are the means of the experimental and control groups, respectively. The SD is the pooled standard deviation or the standard deviation of the control group. Glass (1977) presented a number of useful formulas for converting statistics in studies to estimates of d or r. The initial product of a Glassian meta-analysis is the mean and standard deviation of effect sizes across studies. 2 at face value. Glassian 2. Acceptance of the variance of effect sizes (SEs) meta-analysis implicitly assumes that (SEs) 2 is real and should have some substantive explanation. These explanations are sought in the varying characteristics of the studies (e.g., sex or mean age of subjects, length of treatment, and more). Study characteristics that correlate with study effect are examined for their explanatory power. The general finding has been that few study characteristics correlate significantly with study outcomes. Problems of capitalization on chance and low statistical power associated with this step in meta-analysis are discussed in Hunter and Schmidt (1990b; chapter 2). 3. A strongly empirical approach to determining which aspects of studies should be coded and tested for possible association with study outcomes. Glass (1976, 1977) felt that all such questions are empirical questions, and he de-emphasized the role of theory or logic in determining which variables should be tested as potential moderators of study outcome (see also Glass, 1972). One variation of Glass's (1976; 1977) methods has been labeled study effects meta-analysis by Bangert-Drowns (1986). It differs from Glass's procedures in several ways. First, only one effect size from each study is included in the meta-analysis, thus ensuring statistical independence within the meta-analysis. If a study has multiple dependent measures, those that assess the same construct are combined (usually averaged) and those that assess different constructs are assigned to different meta-analyses. These steps are similar to those we have followed in our methods and research. Second, study effects meta-analysis calls for the metaanalyst to make at least some judgments about study methodological quality and to exclude studies with deficiencies judged serious enough to distort study outcomes. In reviewing experimental studies, for example, the experimental treatment must be at least similar to those judged by experts in the research area to be appropriate, or the study will be excluded. This procedure seeks to determine the effect of a particular treatment on a particular outcome (construct) rather than to paint a broad Glassian picture of a research area. Some of those instrumental in developing and
338
HUNTER AND SCHM1DT
using this procedure are Mansfield and Busse (1977), Kulik and his associates (Bangert-Drowns, Kulik, & Kulik, 1983; Kulik & Bangert-Drowns, 1983-1984), Landman and Dawes (1982), and Wortman and Bryant (1985).
Meta-Analysis Methods That Focus on Sampling Error As noted earlier, numerous artifacts produce the deceptive appearance of variability in results across studies. The artifact that typically produces more false variability than any other is sampling error variance. Glassian meta-analysis implicitly accepts variability produced by sampling error variance as real variability. There are two types of meta-analyses that move beyond Glassian methods in that they attempt to control for sampling error variance. The first of these methods is homogeneity test-based meta-analysis. This approach was advocated independently by Hedges (1982b) and his associates and by Rosenthal and Rubin (1982) and their associates. Hedges (1982a) and Rosenthal and Rubin (1982) proposed that chi-square statistical tests be used as an aid in deciding whether study outcomes are more variable than would be expected from sampling error alone. If these chi-square tests of homogeneity are not statistically significant, then the population correlation or effect size is accepted as constant across studies, and there is no search for moderators. Use of chi-square tests of homogeneity to determine whether findings in a set of studies differ more than would be expected from sampling error variance is not new; such tests were developed by Snedecor (1946) almost 50 years ago. Hedges (1982b) extended the concept of homogeneity tests to develop a more general procedure for moderator analysis based on significance testing. It calls for breaking the overall chi-square statistic down into the sum of within- and between-group chi-squares. The original set of effect sizes in the meta-analysis is divided into successively smaller subgroups until the chi-square statistics within the subgroups are nonsignificant, indicating that sampling error can explain all the variation within the last set of subgroups. Homogeneity test-based meta-analysis represents a return to the practice that originally led to the great difficulties in making sense out of research literatures: reliance on statistical significance tests. The first problem is that the significance tests advocated allow only for sampling error; there are other purely artifactual sources of variance between studies in effect sizes. These include computational, transcriptional, and other data errors, differences between studies in reliability of measurement and in levels of range restriction, and many others (Hunter & Schmidt, 1990b). Thus, even when true study effect sizes are actually the same across studies, these sources of artifactual variance will create variance beyond sampling error, causing the chi-square test to be significant and hence to falsely indicate heterogeneity of effect sizes. This is especially likely to be true when the number of studies is large, creating high "power" to detect even small amounts of such artifactual variance. Second, even when the variance beyond sampling error is not artifactual, it often will be small in magnitude and of little or no theoretical or practical significance. Hedges and Olkin (1985) recognized this fact and cautioned that researchers should evaluate the actual size of the variance; unfortunately, however, once researchers are caught up in significance tests, the usual practice is to assume that if it is statistically significant it is important and to act (or interpret)
SOCIAL POLICY
339
accordingly. Once the major focus is on the results of significance tests, effect sizes are usually ignored. The second approach to meta-analysis that attempts to control only for the artifact of sampling error is what we have referred to as bare bones meta-analysis (Hunter & Schmidt, 1990b; Hunter et al., 1982; Pearlman et al., 1980). This approach can be applied to correlations, d values, or any other effect size statistic for which the standard error is known. For example, if the statistic is correlations, the mean r is first computed. Then the variance of the set of correlations is computed. Next the amount of sampling error variance is computed and subtracted from this observed variance. If the resUlt is zero, then sampling error accounts for all the observed variance, and the mean r value accurately summarizes all the studies in the meta-analysis. If not, then the square root of the remaining variance is the index of variability remaining around the mean r after sampling error variance is removed. The hypothetical example we presented earlier in this article was a bare bones meta-analysis. Because there are always other artifacts (such as measurement error) that should be corrected for, we have consistently stated in our writings that the bare bones meta-analysis method is incomplete and unsatisfactory. It is useful primarily as the first step in explaining and teaching meta-analysis to novices. However, some studies using bare bones methods have been published; the authors have invariably claimed that the information needed to correct for artifacts beyond sampling error was unavailable to them. This is in fact rarely the case, as estimates of artifact values (e.g., reliabilities of scales) are usually available from the literature, from test manuals, or from other sources. The third type of meta-analysis is psychometric meta-analysis. These methods correct not only for sampling error (an unsystematic artifact) but also for other, systematic artifacts, such as measurement error, range restriction or enhancement, dichotomization of measures, and so forth (Hunter & Schmidt, 1990b). These other artifacts are said to be systematic because, in addition to creating artifactual variation across studies, they also create systematic downward biases in the results of all studies. For example, measurement error systematically biases all correlations downward. Psychometric meta-analysis corrects not only for the artifactual variation across studies, but also for the downward biases. The results in Table 1, which we discussed earlier, are an example of full psychometric meta-analysis. Psychometric meta-analysis is the only meta-analysis method that takes into account both statistical and measurement artifacts. These procedures were initially developed concurrently with Glass's (1976; 1977) work for application to validity studies of employment tests (Pearlman et al., 1980; Schmidt, Gast-Rosenberg, & Hunter, 1980; Schmidt & Hunter, 1977; Schmidt, Hunter, Pearlman, & Shane, 1979); however, the statistical and measurement principles are general and were quickly extended to other research areas (Hunter, 1979). Even though we were unaware of Glass's work at the time, this procedure can be regarded as an extension of Glassian meta-analysis to deal with the problems created by artifacts such as sampling error, unreliability (measurement error), restriction in range, and the like. The primary properties of this procedure are as follows: 1. Primary focus is on effect sizes. All effect sizes (ESs) are corrected for
340
HUNTER AND SCHMIDT
statistical and measurement artifacts (e.g., instrument unreliability) that bias them downward from their true score values. 2. After estimating mean true effect size, the hypothesis that observed S2s is due to statistical artifacts is tested by subtracting variance due to artifacts from the observed variance. The alternative hypothesis is that the variance of actual (true) effect sizes is greater than zero, that is, S2sr > 0. If this hypothesis is rejected, the reviewer concludes that the true effect size is constant across the many factors varying in the studies reviewed. Estimated mean true effect size (EST) is then the final and only product of the meta-analysis. 3. Only if the hypothesis that S2sa- > 0 cannot be rejected are selected properties that vary across studies coded and correlated with study effect sizes. The meta-analyst relies on theoretical, logical, statistical, and psychometric considerations in deciding what study characteristics to code and how to code them. (Study properties coded should not be the artifacts [or products of the artifacts] controlled for in Step 2; otherwise, the effects of a given artifact may be partialed out twice.) Alternatively, one can subgroup studies based on study characteristics and conduct separate meta-analyses; in fact, this is the most commonly used method of moderator detection. 4. Correlations between study characteristics and effect sizes are corrected for sampling error in the effect size scale by using the procedure developed by Hunter (1979). Corrections are also made for unreliability in the measure of the study characteristic. If all these correlations are trivial in magnitude, go to Step 8. 5. Correlations among study characteristics are computed and corrected for unreliability in the study characteristics. 6. Using the resulting true score correlation matrix, regression of effect sizes on study characteristics is computed, yielding the true score regression equation. The resulting beta weights should be interpreted as indicating potential causal effects of true study characteristics on true study effect size. 7. The resulting true score R should be corrected for shrinkage by using the appropriate shrinkage formula (Cattin, 1980). This shrunken R 2 then gives the percentage of variance S2sT (from Step 2) accounted for by variation in study characteristics. (Unfortunately, however, this procedure will often not fully correct for capitalization on chance. See the discussion of capitalization on chance below and in chapter 2 of Hunter and Schmidt, 1990b.) 8. Three different kinds of distributions of effect sizes can then be derived. These constitute the final products of the meta-analysis. a. M = EST, with standard deviation corrected for effects of statistical artifacts only. This distribution describes true effect sizes to be expected when study characteristics are allowed to vary, as in the studies reviewed. This would be the information needed if, for example, a given educational program were to be implemented in somewhat different ways in different parts of the country. Credibility intervals are derived for this distribution. b. M = EST, with standard deviation corrected for effects of statistical artifacts and for effects of deviations of study characteristics from their mean values. This distribution describes true effect sizes to be expected when study characteristics are held constant at their mean values. Credibility intervals are derived for this distribution. c. The value of M = E S T and the standard deviation can be found for a
SOCIAL POLICY
341
distribution in which study characteristics are held constant at values other than their means. Under the usual assumption of homoscedasticity, all such alternative sets of study characteristic values should leave SDesTunchanged from distribution b. However, EST would change. Thus, a decision maker could specify in advance the conditions of implementation and compute the expected effect size under the specific set of conditions. He or she would then construct a credibility interval for the estimated mean true effect size. This procedure tailors effect size predictions to the specific set of circumstances under which an intervention or program is to be implemented. In cases in which study characteristics do not correlate with the effect size, the distributions in 8a, 8b, and 8c will be identical. In Glass's (1976; 1977) meta-analysis, study effects meta-analysis, and psychometric meta-analysis, there are unresolvable problems in the search for moderators. First, when effect size estimates are regressed on multiple-study characteristics, capitalization on chance operates to increase the apparent number of significant associations for those study characteristics that have no actual associations with study outcomes. Because the sample size is the number of studies and many study properties may be coded, this problem is potentially severe (Hunter & Schmidt, 1990b, chapter 2). There is no purely statistical solution to this problem. The problem can be mitigated, however, by basing choice of study characteristics and final conclusions not only on the statistics at hand but also on other theoretically relevant empirical findings (which may be the result of other meta-analyses) and on theoretical considerations. Results should be examined closely for substantive and theoretical meaning. Capitalization on chance is a threat whenever the (unknown) correlation or regression weight is actually zero or near zero. Second, when there is in fact a relationship, there is another problem: Power to detect the relation is often low (Hunter & Schmidt, 1990b, chapter 2). Thus, true moderators of study outcomes (to the extent that such exist) may have only a low probability of showing up as statistically significant. In short, this step in meta-analysis is often plagued with all the problems of small-sample studies. For a discussion of these problems, see Schmidt, Hunter, and Urry (1976) and Schmidt and Hunter (1978). Other things being equal, conducting separate meta-analyses on subsets of studies to identify a moderator does not avoid these problems and may lead to additional problems of confounding (Hunter & Schmidt, 1990b, chapter 13). Other Changes Being Produced by Meta-Analysis Meta-analysis is much more than just a new method for conducting reviews. The realities revealed about data and research findings by the principles of meta-analysis require major changes in our views of the individual empirical study, the nature of cumulative research knowledge, and the reward structure in the research enterprise. Meta-analysis has explicated the critical role of sampling error, measurement error, and other artifacts in determining the observed findings and statistical power of individual studies. In doing so, it has revealed how little information there is in a single study. It has shown that, contrary to widespread belief, no single primary study can provide more than tentative evidence on any issue. Because of sampling error, multiple studies are required to draw solid conclusions. The first study done in an area may be revered for its creativity, but sampling error in that study will
342
HUNTER AND SCHMIDT
often produce a fully or partially wrong answer to the study question. The quantitative estimate of effect size will almost always be wrong. The shift from tentative to solid conclusions will require the accumulation of studies and the application of meta-analysis to those study results. Furthermore, adequate handling of other study imperfections such as measurement error---especially imperfect construct validity--may also require separate studies and more advanced meta-analysis. Because of the effects of artifacts such as sampling error and measurement error, data come to researchers encrypted, and to understand their meaning researchers must first break the code. Doing this requires meta-analysis. Therefore any individual study must be considered only a single data point to be contributed to a future meta-analysis. Thus the scientific status and value of the individual study is necessarily reduced. Ironically, however, the value of individual studies in the aggregate is increased. Because multiple studies are needed to solve the problem of sampling error, it is critical to ensure the availability of all studies on each topic. The problem is that many good replication articles are rejected by the primary research journals. Journals currently put too much weight on creativity in evaluating studies and fail to consider either sampling error or other technical problems such as measurement error. Many journals will not even consider "mere replication studies" or "mere measurement studies." Many persistent authors eventually publish studies in journals with lower prestige, but they must endure many letters of rejection-many of which are phrased in unjustifiable condemnatory language--and publication is delayed for a long period. Consider the data presented in Table 1. Sixty-eight percent of the 698 studies represented in this table were unpublished. More than 3 years of effort were required to compile these data; countless letters were written and telephone calls made. (As an aside, our analysis showed no differences in the findings of the published and unpublished studies in this database; for a fuller discussion of the topic of "source bias," see Hunter and Schmidt, 1990b, chapter 13.) Situations like this are the rule rather than the exception for meta-analysts. To us this clearly indicates that we need a new type of journal--whether hard copy based or electronic--that systematically archives all studies that will be needed for later meta-analyses. The American Psychological Association's Experimental Publication System in the early 1970s was an attempt in this direction. However, at that time the need subsequently created by meta-analysis did not yet exist; the system apparently met no real need at that time and hence was discontinued. Today, the need is so great that failure to have such a journal system in place is retarding our efforts to reach our full potential in creating cumulative knowledge in psychology and the social sciences. This is clearly a problem that should be brought to the attention of the American Psychological Association's Publications Office and its Publications and Communications Board. Conclusions For decades policymakers seeking factual foundations for policy have looked to psychological and social science research. Until recently, they have been disappointed to find research literatures that were conflicting and contradictory. As the number of studies on each particular question became larger and larger, this situation became increasingly frustrating and intolerable. These problems stemmed
SOCIAL POLICY
343
from reliance on defective procedures for achieving cumulative knowledge: the statistical significance test in individual primary studies in combination with the narrative subjective review of research literatures. Meta-analysis principles have now correctly diagnosed this problem and, more important, have provided the solution. In area after area, meta-analytic findings have shown that there is much less conflict between different studies than had been believed, that coherent and useful conclusions can be drawn from research literatures, and that cumulative knowledge is possible in psychology and the social sciences. These methods have proven so successful that they have been adopted in other areas such as medical research. A prominent medical researcher, Thomas Chalmers (as cited in Mann, 1990), has stated, "[Meta-analysis] is going to revolutionize how the sciences, especially medicine, handle data. And it is going to be the way many arguments will be ended" (p. 478). In concluding his oft-cited review of meta-analysis methods, Bangert-Drowns (1986, p. 398) stated, Meta-analysis is not a fad. It is rooted in the fundamental values of the scientific enterprise: replicability, quantification, causal and correlational analysis. Valuable information is needlessly scattered in individual studies. The ability of social scientists to deliver generalizable answers to basic questions of policy is too serious a concern to allow us to treat research integration lightly. The potential benefits of meta-analysis method seem enormous. References American Educational Research Association, American Psychological Association, and National Council on Measurement in Education. (1985). Standards for educational and psychological testing. Washington, DC: American Psychological Association. Anastasi, A. (1988). Psychological testing (7th ed.). New York: Macmillan. Bangert-Drowns, R. L. (1986). Review of developments in meta-analytic method. Psychological Bulletin, 99, 388-399. Bangert-Drowns, R. L., Kulik, J. A., & Kulik, C.-L. C. (1983). Effects of coaching programs on achievement test performance. Review of Educational Research, 53, 571-585. Baum, M. L., Anish, D. S., Chalmers, T. C., Sacks, H. S., Smith, H., & Fagerstrom, R. M. (1981). A survey of clinical trials of antibiotic prophylaxis in colon surgery: Evidence against further use of no-treatment controls. New England Journal of Medicine, 305, 795-799. Carver, R. R (1978). The case against statistical significance testing. Harvard Educational Review, 48, 378-399. Cattin, R (1980). The estimation of the predictive power of a regression model. Journal of Applied Psychology, 65, 407-415. Chelimsky, E. (1994, October 14). Use of meta-analysis in the General Accounting Office. Paper presented at the Science and Public Policy Seminars, Federation of Behavioral, Psychological and Cognitive Sciences, Washington, DC. Coggin, I". D., Fabozzi, F. J., & Rahman, S. (1993). The investment performance of U.S. equity pension fund managers: Am empirical investigation. The Journal of Finance, 48, 1039-1055. Coggin, T. D., & Hunter, J. E. (1983). Problems in measuring the quality of investment information: The perils of the information coefficient. Financial Analysts Journal, May~June, 25-33. Coggin, T. D., & Hunter, J. E. (1987). A meta-analysis of pricing of "risk" factors in APT. The Journal of Portfolio Management, 14, 35-38.
344
HUNTER AND SCHMIDT
Coggin, T. D., & Hunter, J. E. (1993). A meta-analysis of mutual fund performance. Review of Quantitative Finance and Accounting, 3, 189-201. Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 997-1003. Cronbach, L. J. (1975). Beyond the two disciplines of scientific psychology. American Psychologist, 30, 116-127. Cuts raise new social science query: Does anyone appreciate social science? (1981, March 27). Wall Street Journal, p. 54. Dimson, E., & Marsh, P. (1984). An analysis of brokers' and analysts' unpublished forecasts of UK stock returns. Journal of Finance, 39(5), 1257-1292. Fisher, C. D., & Gittelson, R. (1983). A meta-analysis of the correlates of role conflict and ambiguity. Journal of Applied Psychology, 68, 320-333. Fisher, R. A. (1932). Statistical methods for research workers (4th ed.). Edinburgh, Scotland: Oliver & Boyd. Fisher, R. A. (1935). The design of experiments. London: Oliver & Boyd. Fisher, R. A. (1973). Statistical methods and scientific inference (3rd ed.). Edinburgh, Scotland: Oliver & Boyd. Gaugler, B. B., Rosenthal, D. B., Thornton, G. C., & Bentson, C. (1987). Meta-analysis of assessment center validity. Journal of Applied Psychology, 72, 493-511. Gergen, K. J. (1982). Toward transformation in social knowledge. New York: SpringerVerlag. Glass, G. V. (1972). The wisdom of scientific inquiry on education. Journal of Research in Science Teaching, 9, 3-18. Glass, G. V. (1976). Primary, secondary and meta-analysis of research. Educational Researcher, 5, 3-8. Glass, G. V. (1977). Integrating findings: The meta-analysis of research. Review of Research in Education, 5, 351-379. Glass, G. V., McGaw, B., & Smith, M. L. (1981). Meta-analysis in social research. Beverly Hills, CA: Sage. Guttman, L. (1985). The illogic of statistical inference for cumulative science. Applied Stochastic Models and Data Analysis, 1, 3-10. Guzzo, R. A., Jackson, S. E., & Katzell, R. A. (1987). Meta-analysis analysis. In L. L. Cummings & B. M. Staw (Eds.), Research in organizational behavior (Vol. 9, pp. 407 442). Greenwich, CT: JAI Press. Guzzo, R. A., Jette, R. D., & Katzell, R. A. (1985). The effects of psychologically based intervention programs on worker productivity: A meta-analysis. Personnel Psychology, 38, 275-292. Hackett, R. D., & Guion, R. M. (1985). A re-evaluation of the absenteeism-job satisfaction relationship. Organizational Behavior and Human Decision Processes, 35, 340-381. Halvorsen, K. T. (1986). Combining results from independent investigations: Metaanalysis in medical research. In J. C. Bailar & E Mosteller (Eds.), Medical uses of statistics (pp. 392-416). Waltham, MA: New England Journal of Medical Books. Hartigan, J. A., & Wigdor, A. K. (1989). Fairness in employment testing: Validity generalization, minority issues, and the General Aptitude Test Battery. Washington, DC: National Academy Press. Hedges, L. V. (1982a). Estimation of effect size from a series of independent experiments. Psychological Bulletin, 92, 490-499. Hedges, L. V. (1982b). Fitting categorical models to effect sizes from a series of experiments. Journal of Educational Statistics, 7, 119-137. Hedges, L. V. (1987). How hard is hard science, how soft is soft science: The empirical cumulativeness of research. American Psychologist, 42, 443455. Hedges, L. V., & Olkin, I. (1985). Statistical methods for meta-analysis. Orlando, FL: Academic Press. Hunter, J. E. (1979, September). Cumulating results across studies: A critique of factor
SOCIAL POLICY
345
analysis, canonical correlation, MANOVA, and statistical significance testing. Invited address presented at the 86th Annual Convention of the American Psychological Association, New York, New York. Hunter, J. E. (1983). A causal analysis of cognitive ability, job knowledge, job performance, and supervisory ratings. In E Landy, S. Zedeck, & J. Cleveland (Eds.), Performance measurement and theory (pp. 257-266). Hillsdale, NJ: Erlbaum. Hunter, J. E. (1988). A path analytic approach to analysis of covariance. Unpublished manuscript, Michigan State University, East Lansing. Hunter, J. E., & Gerbing, D. W. (1982). Unidimensional measurement, second order factor analysis and causal models. In B. M. Staw & L. L. Cummings (Eds.), Research in organizational behavior (Vol. 4, pp. 267-320). Greenwich, CT: JAI Press. Hunter, J. E., & Hirsh, H. R. (1987). Applications of recta-analysis. In C. L. Cooper & I. T. Robertson (Eds.), International review of industrial and organizational psychology 1987 (pp. 321-357). London: Wiley. Hunter, J. E., & Schmidt, E L. (1990a). Dichotomization of continuous variables: The implications for meta-analysis. Journal of Applied Psychology, 75, 334-349. Hunter, J. E., & Schmidt, E L. (1990b). Methods ofmeta-analysis: Correcting error and bias in research findings. Newbury Park, CA: Sage. Hunter, J. E., & Schmidt, E L. (1994). The estimation of sampling error variance in meta-analysis of correlations: The homogenous case. Journal of Applied Psychology, 79, 171-177. Hunter, J. E., Schmidt, E L., & Jackson, G. B. (1982). Meta-analysis: Cumulating research findings across studies. Beverly Hills, CA: Sage. Hunter, J. E., Schmidt, E L., & Judiesch, M. K. (1990). Individual differences in output variability as a function of job complexity. Journal of Applied Psychology, 75, 28-42. Iaffaldono, M. T., & Muchinsky, E M. (1985). Job satisfaction and job performance: A meta-analysis. Psychological Bulletin, 97, 251-273. Jackson, S. E., & Schuler, R. S. (1985). A meta-analysis and conceptual critique of research on role ambiguity and role conflict in work settings. Organizational Behavioral and Human Decision Processes, 36, 16-78. Joresk6g, K. G., & Sorb6m, D. (1979). Advances in factor analysis and structural equation models. Cambridge, MA: Abt Books. Kulik, J. A., & Bangert-Drowns, R. L. (1983-1984). Effectiveness of technology in precollege mathematics and science teaching. Journal of Educational Technology Systems, 12, 137-158. Landman, J. T., & Dawes, R. M. (1982). Psychotherapy outcome: Smith and Glass' conclusions stand up under scrutiny. American Psychologist, 37, 504-516. Law, K. S., Schmidt, E L., & Hunter, J. E. (1994a). Nonlinearity of range corrections in meta-analysis: A test of an improved procedure. Journal of Applied Psychology, 79, 425-438. Law, K. S., Schmidt, E L., & Hunter, J. E. (1994b). A test of two refinements in meta-analysis procedures. Journal of Applied Psychology, 79, 978-986. Lipsey, M. W., & Wilson, D. B. (1993). The efficacy of psychological, educational, and behavioral treatment: Confirmation from meta-analysis. American Psychologist, 48, 1181-1209. Mabe, E A., III,& West, S. G. (1982). Validity of self evaluations of ability: A review and meta-analysis. Journal of Applied Psychology, 67, 280-296. Maloley et al. v. Department of National Revenue. (1986, February). Canadian Civil Service Appeals Board, Ottawa, Ontario Canada. Mann, C. (1990, August 3). Meta-analysis in the breech. Science, 249, 476--480. Mansfield, R. S., & Busse, T. V. (1977). Meta-analysis of research: A rejoinder to Glass. Educational Researcher, 6, 3.
346
HUNTER AND SCHMIDT
McDaniel, M. A., Schmidt, E L., & Hunter, J. E. (1988). A meta-analysis of training and experience ratings in personnel selection. Personnel Psychology, 41, 283-314. McDaniel, M. A., Whetzel, D. L., Schmidt, F. L., & Maurer, S. D. (1994). The validity of employment interviews: A comprehensive review and meta-analysis. Journal of Applied Psychology, 79, 599-616. McEvoy, G. M., & Cascio, W. F. (1985). Strategies for reducing employee turnover: A meta-analysis. Journal of Applied Psychology, 70, 342-353. McEvoy, G. M., & Cascio, W. E (1987). Do poor performers leave? A meta-analysis of the relation between performance and turnover. Academy of Management Journal, 30, 744-762. Meehl, P. E. (1967). Theory testing in psychology and physics: A methodological paradox. Philosophy of Science, 34, 103-115. Meehl, P. E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow process of soft psychology. Journal of Consulting and Clinical Psychology, 46, 806-834. Mondale, W. E (1970, August). Psychological research andpublic policy. Invited address presented at the 78th annual convention of the American Psychological Association, Washington, DC. Oakes, M. (1986). Statistical inference: A commentary for the social and behavioral sciences. New York: Wiley. Pearlman, K., Schmidt, F. L., & Hunter, J. E. (1980). Validity generalization results for tests used to predict job proficiency and training success in clerical occupations. Journal of Applied Psychology, 65, 373-406. Peters, L. H., Harthe, D., & Pohlmann, J. (1985). Fiedler's contingency theory of leadership: An application of the meta-analysis procedures of Schmidt and Hunter. Psychological Bulletin, 97, 274-285. Petty, M. M., McGee, G. W., & Cavender, J. W. (1984). A meta-analysis of the relationship between individual job satisfaction and individual performance. Academy of Management Review, 9, 712-721. Premack, S., & Wanous, J. P. (1985). Meta-analysis of realistic job preview experiments. Journal of Applied Psychology, 70, 706-719. Ramamurti, A. S. (1989). A systematic approach to generating excess returns using a multiple variable model. In E J. Fabozzi (Ed.), Institutional investor focus on investment management. Cambridge, MA: Ballinger. Rosenthal, R. (1984). Meta-analytic procedures for social research. Newbury Park, CA: Sage. Rosenthal, R. (1991). Meta-analytic procedures for social research (2nd ed.). Newbury Park, CA: Sage. Rosenthal, R., & Rubin, D. B. (1982). Comparing effect sizes of independent studies. Psychological Bulletin, 92, 500-504. Rozeboom, W. W. (1960). The fallacy of the null hypothesis significance test. Psychological Bulletin, 57, 416-428. Sacks, H. S., Berrier, J., Reitman, D., Ancona-Berk, V. A., & Chalmers, T. C. (1987). Meta-analysis of randomized controlled trials. New England Journal of Medicine, 316, 450--455. Schmidt, E L. (1988). The problem of group differences in ability scores in employment selection. Journal of Vocational Behavior, 33, 272-292. Schmidt, F. L. (1992). What do data really mean? Research findings, meta-analysis, and cumulative knowledge in psychology. American Psychologist, 47, 1173-1181. Schmidt, E (1996). Statistical significance testing and cumulative knowledge in psychology: Implications for the training of researchers. Psychological Methods, 1, 115-129. Schmidt, E L., Gast-Rosenberg, I., & Hunter, J. E. (1980). Validity generalization results for computer programmers. Journal of Applied Psychology, 65, 643-661.
SOCIAL POLICY
347
Schmidt, E L., & Hunter, J. E. (1977). Development of a general solution to the problem of validity generalization. Journal of Applied Psychology, 62, 529-540. Schmidt, E L., & Hunter, J. E. (1978). Moderator research and the law of small numbers. Personnel Psychology, 31, 215-232. Schmidt, E L., & Hunter, J. E. (1981). Employment testing: Old theories and new research findings. American Psychologist, 36, 1128-1137. Schmidt, E L., & Hunter, J. E. (1996). Measurement error in psychological research: Lessons from 26 research scenarios. Psychological Methods, 1, 199-223. Schmidt, E L., Hunter, J. E., McKenzie, R. C., & Muldrow, T. W. (1979). The impact of valid selection procedures on work-force productivity. Journal of Applied Psychology, 64, 609--626. Schmidt, E L., Hunter, J. E., Pearlman, K., & Shane, G. S. (1979). Further tests of the Schmidt-Hunter Bayesian validity generalization procedure. Personnel Psychology, 32, 257-281. Schmidt, E L., Hunter, J. E., & Urry, V. E. (1976). Statistical power in criterion-related validation studies. Journal of Applied Psychology, 61, 473-485. Schmidt, E L., Law, K. S., Hunter, J. E., Rothstein, H. R., Pearlman, K., & McDaniel, M. (1993). Refinements in validity generalization methods: Implications for the situational specificity hypothesis. Journal of Applied Psychology, 78, 3-13. Snedecor, G. W. (1946). Statistical methods (4th ed.). Ames: Iowa State College Press. Terborg, J. R., & Lee, T. W. (1982). Extension of the Schmidt-Hunter validity generalization procedure to the prediction of absenteeism behavior from knowledge of job satisfaction and organizational commitment. Journal of Applied Psychology, 67, 280-296. Tett, R. P., Meyer, J. P., & Roese, N. J. (1994). Applications of meta-analysis: 1987-1992. In C. L. Cooper & I. T. Robertson (Eds.), International review of industrial and organizationalpsychology (Vol. 9, pp. 71-112) London: Wiley. U.S. Equal Employment Opportunity Commission, U.S. Civil Service Commission, U.S. Department of Labor, and U.S. Department of Justice. (1978). Uniform guidelines on employee selection procedures. Federal Register, 43 (166), 38295-38309. Wortman, P. M., & Bryant, F. B. (1985). School desegration and black achievement: An integrative review. Sociological Methods and Research, 13, 289-324. Yusuf, S., Simon, R., & Ellenberg, S. S. (1986). Preface to proceedings of the workshop on methodological issues in overviews of randomized clinical trials, May 1986. Statistics in Medicine, 6, 217-218.