The Impact of Performance Management on Performance in Public Organizations: A Meta-Analysis Ed Gerrish∗† University of South Dakota
[email protected] June 10, 2015
Abstract Performance-based management is pervasive in public organizations; countless governments have implemented performance management systems with the hope that they will improve organizational effectiveness. However, there has been little comprehensive review of their impact. This article conducts a meta-analysis on the impact of performance management on performance in public organizations. It contributes to the current literature in three ways. First, it examines the effect of the “average” performance management system. Second, it examines the influence of management; whether beneficial performance management practices moderate the average effect. Third it examines the effect of “time” on performance management. Using 2,188 effects from 49 studies, the analysis finds a small average effect of performance management. However, the effect is substantially larger when indicators of best practices in high quality studies are included—management practices have an important impact on the effectiveness of performance management systems. Evidence for the effect of time is mixed.
∗ This manuscript is dedicated to the memory of Evan J. Rinquist, a mentor. Thanks to Liz Baldwin, Tom Rabovsky, Dave Warren, Shannon Watkins, Zach Wendling, and Shuang Zhao for helpful comments and suggestions. † The author has no conflicts of interest or funding to disclose.
Practitioner Points • The act of measuring performance may not improve performance. Managing performance, however, might. • Emphasize the use of benchmarking over time and to other entities to provide a valid comparison and to replicate success. • Performance management can be found in a wide variety of policy areas. Ideas and best practices can be gleaned from many experiences. Considering how common performance management systems have become in public organizations, from policing to social services, one might expect to find a consensus among practitioners and scholars that performance management systems are generally successful.1 Instead one finds arguments that the values of performance systems are misguided (B. Radin 2006), or poorly applied (Beryl A Radin 1998; Frederickson 2003; Frederickson and Frederickson 2006), or are used for political ends (Lavertu and Moynihan 2012), evidence that performance management systems do not substantially improve public performance in the contexts that it has been studied (Heckman, Heinrich, and Smith 1997; Rosenfeld, Fornango, and Baumer 2005; Hvidman and Andersen 2014; Gerrish 2014) and finally that performance management systems tend to induce behaviors that increase measured performance while adversely impacting actual performance (Courty and Marschke 2004, 2008; Heinrich and Marschke 2010). There are a number of important questions about performance management, but perhaps the most fundamental, the one addressed in this meta-analysis, is whether performance systems are associated with improved performance in public organizations. If there is little evidence that performance management improves performance, then it seems senseless to consider the tradeoffs with democratic values (B. Radin 2006) or unintended consequences. Collecting data from original studies that evaluate performance management, this meta-analysis is able to combine data on performance with dummy variables representing beneficial performance management practices and indicators of study quality. In total, 2,188 effects were gathered from 1
49 original studies that examine the impact of a performance management system on performance in a public organization. This analysis explores three related concepts. First, it examines the impact of performance management on performance by combining all studies to estimate the effect of the average performance management system, termed the mean effect size. Second, this analysis uses moderating variables to explore whether some indicators of beneficial practices in performance management (such as benchmarking and bottom-up implementation) influence the mean effect size; a test of the influence of management practices on performance. Finally, this analysis explores the effect of “time” on the effectiveness of performance management systems. Time, in this analysis, is coded in two ways. A “second-generation” performance management system is defined as a system that has been in the same organization for at least two years and has been substantially changed (typically in response to perceived or actual failures). This definition is used in meta-regressions. Second, I examine the effect of time by examining the mean impact of performance management the year of the data, accumulating the empirical evidence on performance management over the years. This analysis significantly contributes to the extant literature by quantitatively examining the current state of performance management research. It combines studies from diverse fields and tests important theories about the impact of performance management, leveraging a large and sometimes contradictory body of existing empirical evidence. These results have implications for a wide range of policy areas. The next section discusses the recent foundations of performance-based management. It identifies some theories tested by surveys about the moderating effect of managers on the relationship between performance management and performance. Next, this article discusses meta-analysis as a research method and how it is employed in this article, including the literature search process, coding the original studies, and estimation of meta-regressions. After discussing results in detail, this article offers some suggestions for advancing the empirical research of performance management.
2
Managing for Performance: The Literature Public organizations have been managing for performance since at least the early 1990s (Williams 2003), though many of the ideas that became the performance movement started gaining traction in the 1970s (Donald P Moynihan 2008). Most scholars peg the modern incarnation of performance management to the late 1980s and early 1990s as part of the New Public Management (NPM) based on experience in governance from the 1980s (Hood 1995). Performance-based management in this era caught the attention of politicians of all stripes with a few key publications (Osborne and Gaebler 1992; Wholey and Hatry 1992; Ammons 1994; Osborne and Plastrik 1997, e.g.), leading to National Performance Review in the U.S. (Gore 1993). Performance-based management efforts have been criticized as being fundamentally misguided because they supplanted democratic values with technocratic ones (B. Radin 2006). Experiences with the Government Performance and Results Act (GPRA) at the federal level suggested that organizations may lack the capacity to implement sweeping performance reforms (Kimm 1995; Mihm 1995; Frederickson and Frederickson 2006), that GPRA had a one-size-fits-all problem (Beryl A Radin 1998; Beryl A. Radin 2000; Long and Franklin 2004), and that performance measurement might be inappropriate in some programs or departments such as within the Department of Health and Human Services where, for example, programs find it difficult to measure performance on rare diseases (Frederickson and Frederickson 2006). Despite these cautions, governments at every level have bet “the future of governance on the use of performance information” (Donald P Moynihan 2008, p.5). Performance management continued in the Bush Administration under GRPA’s successor, the Program Assessment Rating Tool (PART), was modernized under President Obama, and numerous surveys have report that local governments, especially cities, use performance measurement widely, though use it for management less (Wang and Berman 2001; Melkers and Willoughby 2005). Two parallel trends in performance management research have occurred during the last two decades. The first is that management scholars have hypothesized how “management matters” to
3
performance (Ingraham, Joyce, and Donahue 2003; Donald P. Moynihan 2005) and have examined these hypotheses using policy case studies (Forsythe 2001, e.g.) and surveys that examine self-reported performance from managers. These surveys have examined the effect of management generally (Donald P. Moynihan 2005) and performance management specifically (Cavalluzzo and Ittner 2003; de Lancer Julnes and Holzer 2001; Melkers and Willoughby 2005). In particular, surveys have been instrumental in generating and testing hypotheses. However, surveys that use self-reported performance have (at least) two drawbacks that limit their general applicability. The first is that surveys are subject to common-method variance bias (Lindell and Whitney 2001); when respondents are asked about the performance system and organization effectiveness in the same instrument, common-method variance bias tends to result in stronger correlations than multi-method instruments. The second is that there may be a positive response bias if, for example, respondents are the performance officers who are tasked with both implementing the performance system and reporting on perceived effectiveness for the survey instrument. Nonetheless, surveys of performance management have found a few consistent results. Support from managers for performance management is associated with both adoption and implementation of performance management (Cavalluzzo and Ittner 2003). Use of performance information is both directly and indirectly related to the perception of performance management effectiveness (Yang and Hsieh 2007). Additionally, training and preparation for performance management implementation is associated with greater perceived effectiveness (Cavalluzzo and Ittner 2003; de Lancer Julnes and Holzer 2001; Kroll and Moynihan 2015). Mission orientation activities such as the establishment and re-evaluation of mission goals is correlated with the implementation of performance management, but evidence for mission orientation’s impact on perceived effectiveness is lacking (Berman and Wang 2000; Wang and Berman 2001). Finally, voluntary performance management adoption may lead to “buy-in” and greater performance improvements (de Lancer Julnes and Holzer 2001). The second trend has been that public policy researchers began evaluating performance
4
management systems within their respective fields, bringing evidence to bear on the impact of performance management. However, as often as not, these studies (and citations) are isolated within policy subfields. Some examples of this research include: policing using Compstat-like programs both in the U.S. and abroad (Chilvers and Weatherburn 2004; Rosenfeld, Fornango, and Baumer 2005; Jang, Hoover, and Joo 2010; Mazerolle, Rombouts, and James McBroom 2007), waiting times in the National Health Service in England (Propper et al. 2008; Besley, Bevan, and Burchardi 2009; Propper et al. 2010), education accountability systems (Hanushek and Raymond 2005; Dee and Jacob 2011; Dee and Wyckoff 2013; Hvidman and Andersen 2014), child support enforcement (Huang and Edwards 2009; Gerrish 2014), and job training, primarily from an experiment of the Job Training Partnership Act (Barnow 2000; Heckman, Heinrich, and Smith 2002; Heinrich 2002; Heinrich and Lynn 2001; Courty and Marschke 2008). Research in job training and educational accountability systems have been published in public management outlets more frequently than others and have also been linked to the incentives literature in economics (Baker, Jensen, and Murphy 1988; Holmstrom and Milgrom 1991). This analysis leverages the findings from the management literature (the first trend) with data from policy research (the second trend) to address the three important questions about performance management described above. The first is whether there is evidence that the average performance management “works.” There exists, however, no such thing as an average performance management system nor average policy area; the average represents the result of a diverse body of research. The second is whether management matters to the performance of performance management; are the findings from surveys replicated when empirical measures of performance are used. Third, this analysis explores the effect of time on performance, absent in the performance management literature thusfar. As de Lancer Julnes and Holzer recommend: “the impact of time needs to be empirically assessed. . . it takes time to develop and implement good performance measures” (2001, p.703). This analysis examines the impact of time using two constructs, described in further detail, below. If the answer to all three questions points to a lack of association between performance
5
management and performance, then it seems unnecessary to consider value tradeoffs or to use resources when we ought to be focusing on alternatives to performance management, like developing a public service motivation or ethic among managers (Rainey 1982; Perry and Wise 1990). These questions are amenable to meta-analytic techniques.
Meta-Analysis: Data and Methods Meta-analysis, or analysis of analyses, combines quantitative findings from a number of different studies into a single study. It is more common in fields like medicine, where much of the research comes from randomized trials of the same treatment. In policy analysis and management, meta-analysis is less common for a few reasons. First, research is constantly shifting, meaning that we may not expect results to be replicated in new contexts or using different methods. However, it is possible in meta-analysis tease out context– and method-specific results using independent variables in meta-regressions. Second, policy analysis and management does not have a strong culture of study replication, a culture inherited from other social sciences. To a large extent, however, we discount the amount of parallel research that occurs in the social sciences. Not all studies contribute a methodological or theoretical contribution on a particular subject, and might therefore be omitted in a standard literature review. Perhaps the largest advantage of quantitative meta-analytic techniques is that they allow analysts to accumulate findings from the literature in a way that accounts for sample sizes (efficiency) and strengths of the original research. Much like in the original studies, it is difficult to convey the effect of X on Y by examining individual observations, so effects are summarized using parameter estimates in a regression model. Meta-analysis has its own terminology. The original analyses are called studies (alternatively, original studies), a term that encompasses manuscripts, reports, books, and other publication outlets. Every study must have at least one statistical association between performance management (the X variable) and performance (Y). Each association is an effect. Every effect within every study is coded using the set of rules established below. The goal in meta-analysis is 6
to estimate the size (direction and magnitude) of the average effect, the mean effect size. This mean effect size can be examined both unconditionally or conditioned on independent variables of interest. The term meta-analysis describes the data collection process, but over time has also encompassed statistical properties and a suite of tools (Card 2012; Wolf 1986; Borenstein et al. 2011; Ringquist 2013). The following sections provide more detail on data collection, variable coding, and empirical techniques.
Data Collection Data collection in meta-analysis involves searching through the relevant literature to find the few studies that are acceptable. An exact phrase search in google scholar for “performance management” nets 306,000 results. It is important then, to establish a clear framework for including studies before beginning a systematic search. Studies that are acceptable meet all of the criteria listed, below. Study Criteria. The research question for this synthesis, “What impact do performance management systems have on performance in public organizations?” suggests four criteria for identifying a study as acceptable. The first criterion is an acceptable outcome variable of interest; typically the dependent variable in a study that employs regression analysis. A broad definition of performance is employed here, though one that excludes self-reported measures of performance. As noted above, self-reported performance is likely to be positively biased for two reasons: first, because managers in charge of performance systems may be more inclined to report success. Second, because surveys that ask both about performance systems and their performance may suffer from common method bias. I would, however, include survey responses of “customers” but no studies examined employed such a survey. Some examples of performance used in this analysis include: child support orders established, student test scores, future earnings, and reported crimes. The second criterion is that the study evaluates a performance management system. There is no
7
single acceptable definition of performance management, but there are some important elements of performance management systems. In his book, Donald P Moynihan defines performance management as “a system that generates performance information through strategic planning and performance measurement routines and that connects this information to decision venues, where, ideally, the information influences a range of possible decisions” (2008, p. 5). Using this definition, and other literature in this area (Wholey and Hatry 1992; Kloot and Martin 2000; Robert D Behn 2003; Melkers and Willoughby 2005; Hatry 2006; Yang and Hsieh 2007; Moynihan and Pandey 2010, e.g.), this analysis defines a few elements of performance management systems: 1. Setting performance goals or creating performance measures through fiat, negotiations, or models. 2. Using incentives to achieve performance goals, including monetary rewards. 3. Collecting performance information for use in strategic planning. 4. Evidence that performance information is used in organizational decision-making. 5. Benchmarking current performance to previous performance or performance of other entities, inside and outside of the organization. Similarly, grading, categorizing, or recognizing performance from benchmarking. 6. Linking agency, departmental, or organizational budgets or autonomy to achievement of performance goals. 7. Publishing performance targets and results for managers, staff, stakeholders, and the public. To be included in this meta-analysis, two or more of the features described above ought to be evident in the narrative of the original study. In some cases, acceptability is evident from the body of work on the performance management system rather than a single work (e.g. JTPA). In other areas, it was necessary to find additional background information (e.g. Dee and Wyckoff 2013; 8
Fryer 2011, 2013). This typically excludes pay-for-performance and performance contracting unless the description of the performance management system in the original study makes it clear that other elements of performance systems are involved. Still, determining what is or is not a performance management system is as much art as science. For example, the National Institute for Excellence in Teaching (NIET) developed a teacher evaluation program, TAP, which has been evaluated by NIET researchers as well as independent evaluators. The program uses four elements of success: multiple career paths, ongoing applied professional growth, instructionally focused accountability, and performance-based compensation. While these share two elements of performance management systems, there also appears to be an absence of organization-level use of teacher performance information for strategic management. Moreover, goals are set almost completely at the individual level, atypical in this analysis. TAP also has a strong focus on professional development, again, atypical. However, by the rules established here, TAP and some TAP-like teacher evaluation programs demand inclusion into this analysis. However, studies of teacher evaluation and incentives programs are different enough from the other studies to conduct a robustness check excluding them.2 The third criterion for inclusion into this meta-analysis is that the original study data must either be completely comprised of “public” organizations or must have a separate effect for public organizations. The “publicness” of an organization is, of course, its own field of inquiry (Bozeman and Bretschneider 1994).3 The criterion used here is that individuals or organizations must ultimately respond to an elected authority and does not have an explicit profit motivation.4 The last criterion is that the study must have sufficient information to transform the reported results into an effect that can be compared on an equal basis to other studies. Generally speaking, this only requires that the original study conduct a statistical analysis with a hypothesis test. This includes t, z, χ 2 , F, significance stars, p-values, or any other figures that signifies a statistical test was performed against a null hypothesis. Literature Search Process. The literature search finds studies that meet the four inclusion
9
criteria. Because of the overwhelming number of hits for any search for “performance management,” the search was limited to public policy areas and a few specific performance systems such as CompStat and GPRA/PART. These terms are numerous, including: crime, policing, prisons, welfare, food assistance, child support enforcement, job training, education, public health, transportation, construction, and the federal programs GPRA and PART. A list of the exact policy/program related search terms can be found in the endnotes.5 The above terms were then combined with performance management-related search terms. I used the following five exact phrases: “performance system,” “performance management,” “performance measurement,” “performance standard,” and “performance information.” This resulted in 100 total search permutations (20 policy search terms by five performance-related terms). Each permutation was then searched in academic search engines,6 online conference proceedings,7 working paper directories,8 and organizational web sites, including government research bodies,9 and nongovernmental researchers and think tanks.10 After the initial search, two more processes were used to ensure that as many acceptable studies as possible were found. The first is an ancestry search, searching both the references of acceptable studies as well as studies that have cited the original study (using google scholar). The second is to contact all of the authors who authored a study coded as acceptable, requesting any other studies on the same subject. Both strategies yielded additional studies that were not found in the original search—contacting authors resulted in two additional dissertations that did not appear in searches of Proquest’s Dissertation and Theses Database. Literature searches resulted in just under 25,000 total “hits” (article titles that matched the search terms), ending May 10, 2014. In total, 49 acceptable studies are included in this analysis and are listed before the references. These studies contained 2,188 total effects, the unit of analysis.11 Figure 4 presents flowchart of the literature search process. Studies cover a fairly wide swath of policy research but is dominated by studies in education—19 studies in education, compared to 10 in policing, nine in job training (all from JTPA/WIA), six in public health, five in other areas, including child support, and general local government (e.g. English local governments). 28 of the
10
49 studies come from peer-reviewed journal outlets, the others come from a variety of sources, including government reports, doctoral dissertations, and working papers. Table 1 lists the authors of each of the studies used, the year it was published, and the number of effects coded within each study. Table 1 reveals that 949 effects, 43.4% of all effects, are coded from just four studies, one of which is the working paper of the journal article by Dee and Jacob (2011).12 These four studies are removed in a robustness check to examine the sensitivity of results, finding the results to be robust to the removal of these studies.
Variable Coding After identifying an acceptable study, features about the study and each effect are coded. Except for the effect size, all of the other characteristics listed here are dummy variables.13 Many have only two categories, represented by by a single dummy variable (for example, peer reviewed or not). In other cases, there are more than two categories; three categories are represented by two dummy variables, each reflecting the difference between that category and the base. Discussion of variables has been grouped five sections. The first section discusses the calculation of the dependent variable, the effect. Next is coding a second-generation performance management system. The third section discusses coding performance management best practices. The fourth section discusses coding indicators of the quality of the original study. Last is a discussion of some other characteristics of the performance management systems, such as dummy variables for the policy area. Descriptive statistics of these variables, by group, are presented in Table 2. Calculating the Effect Size. Effects, both the dependent variable and unit of analysis, are calculated using a straightforward method but requires some introduction to those versed in statistics but not meta-analysis. Because parameter estimates measure associations from different samples, it is necessary to convert parameter estimates from the original studies into a standardized measure. As a technique, Meta-analysis is able to combine such disparate measures of performance into a single variable, an effect. The use of random effects meta-regression
11
(discussed in more detail below), explicitly assumes that these effects come from different but related measures of performance, adjusting (widening) confidence intervals to account for different study settings. The method used here relies on the distributions of Pearson’s r and Fisher’s Z. Effects using these distributions are most common in social sciences and are called r-based effects. Pearson’s r is first calculated from the original studies. Pearson’s r takes values from -1/+1 where |1| represents a perfect linear relationship and 0 indicates no relationship. For example, the p equation for Pearson’s r using the t statistic from ordinary least squares is: r = t 2 /(t 2 + d f ), Where t is the t statistic and d f is degrees of freedom. Since distributions of z, t, F, χ 2 , and others are related, similar calculations can convert other associations into r using the test statistic and the degrees of freedom (Ringquist 2013). Conservative estimates of r are used when original studies only report p-values or significance stars. For instance, a reported p-value of