WILLIAM H. STARBUCK. New York University. Performance measures are important. They shape the future and indirectly determine the quality of human life.
JOURNAL 10.1177/1056492605279099 Starbuck XXXX / OF PERFORMANCE MANAGEMENT MEASURES INQUIRY / Month
Performance Measures: Prevalent and Important but Methodologically Challenging WILLIAM H. STARBUCK New York University
Performance measures are important. They shape the future and indirectly determine the quality of human life. However, performance measures often assess something other than what researchers assume they do, and their meaning is made ambiguous by the fact that they subsume conflicting subgoals. Performance measures contain correlated errors that distort inferences, and the errors in performance measures often exceed the limitations imposed by prevalent statistical techniques. Thus, researchers should be cautious about inferring that they understand the determinants or consequences of performance. Keywords:
performance; measurement; methodology
WHY PERFORMANCE MATTERS When I moved to New York City in 1985, it was widely perceived as a dangerous place. The newspapers told of wolf packs—gangs of young men who roamed the streets watching for well-dressed older men who seemed to be alone and thus vulnerable to attack. New statistics about murders, rapes, and robberies appeared regularly, documenting that New York City had one of the highest crime rates in the United States. In December 2003, the New York Times reported that 193 American cities with populations more than 100,000 have higher crime rates than New York City and only five such cities have lower crime rates. So what happened? Several things happened, including better management of the subway system, better economic conditions, and two somewhat more competent mayors. However, there is wide agreement that one of the most important things that happened was a program called Compstat, which was introduced by Police Chief William Bratton (Smith & Bratton, 2001). Two decades ago, performance reports for the commanders of police precincts were limited to annual reports of how many patrolmen and police cars had been on the streets, how many raids had occurred,
how many arrests officers had made, and so on. All the statistics concerned activities of the police themselves and the reports were compiled only once a year. Bratton obtained a grant that paid for 1,000 computers, which he placed in police stations and manned with young officers who understood computers. The officers use the computers to compile daily summaries of crimes committed in their precincts—burglaries, murders, rapes, robberies, and so forth. Thus, precinct commanders began to receive frequent, up-todate measures of the events that police activity is supposed to affect. As well, each precinct commander had to appear periodically before a review board to discuss the crime statistics and to present plans for reducing crimes. Commanders who failed to lower crime rates received public scoldings in front of their peers, and in some cases, ineffective commanders were dismissed. Crime rates began to decline, streets and subways became much safer for both residents and tourists. The city extended Compstat-like practices to the care of parks and to street cleaning and rubbish collection. New York City became a model for police management in other cities. Obviously, performance measures deserve careful attention because they can alter performance by motivating people. Effective measures of performance can
JOURNAL OF MANAGEMENT INQUIRY, Vol. XX No. X, Month 2004 1DOI: 10.1177/1056492605279099 © 2004 Sage Publications
1
2
JOURNAL OF MANAGEMENT INQUIRY / Month XXXX
elicit dramatic improvements in human and organizational performance, ineffective performance measures can spawn wasteful activities, and erroneous performance measures can produce very undesirable results. So we must strive to create performance measures that are close to the phenomena we want to influence and likely to influence them in desired directions. However, there are many reasons for researchers and managers to regard performance measures with skepticism and to regard contemporary studies of performance measures with great skepticism. The next sections discuss some of the more prevalent deficiencies of performance measures used in academic research and the methods that academics use to study them.
PERFORMANCE MEASURES OFTEN ASSESS SOMETHING OTHER THAN WHAT RESEARCHERS ASSUME THEY DO Some years ago, I edited a manuscript that reported that coal mining and garbage collection are among the occupations producing very high levels of job satisfaction. This observation caught my attention because these are extremely dangerous and unpleasant jobs. Why would coal miners and garbage collectors say they receive great satisfaction from dangerous and unpleasant jobs? To me, the answer lies in the nature of job-satisfaction data. Reports of the job satisfactions of coal miners do not come from people who have tried many different jobs—such as dentistry, teaching, or computer programming—in addition to coal mining and who thus have bases for comparing coal mining with other occupations. Job-satisfaction data are the reactions of people who have remained coal miners throughout their lives and few of whom have tried other jobs. In fact, they would not have remained coal miners if they could switch to other occupations that they preferred, and they may be coal miners because they see no other occupations as being are available to them. Subjective reactions to experience tend to have a profit-and-loss quality. Where the costs are high, people can continue their activities only if they can persuade themselves that the revenues are even higher than the costs. Outside observers might judge the benefits and costs very differently from the job incumbents. Coal miners have to believe that factors such as collegiality among miners, community solidarity, high pay, or job security compensate for the dangers and discomfort of coal mining.
But imagine the consequences of using job-satisfaction data to design satisfying jobs. Because the holders of dangerous and unpleasant jobs report high satisfaction, we might infer that we ought to make jobs more unpleasant and more dangerous to make them more satisfying. Indeed, effects like this do sometimes occur. A former chief of the New York City fire department has told my students that some firemen prefer to work in firehouses that respond to many alarms whereas other firemen prefer quiet firehouses. I do not know for certain, but I deem it very likely that the riskaverse firemen express lower job satisfaction than those who seek challenges and risks. The contradictions of job satisfaction illustrate a general principle about efforts to compare measures of performance across diverse people: There are no reliable ways to compare the subjective reactions (such as satisfactions and dissatisfactions) of different people (Elster & Roemer, 1993). If you and I each eat half of the same apple, how can we decide whether you enjoyed the apple more than I did? Payne and Pugh (1976) reviewed numerous studies of organizational properties and concluded that different members of a given organization disagree so strongly with each other about the properties of their organization that it makes no sense to talk about average perceptions. Similarly, Friedlander and Pickle (1968) found considerable disagreement of organizational effectiveness between the evaluations of communities, creditors, customers, employees, owners, and suppliers. Noncomparability of subjective reactions poses serious challenges to the development of performance measures because many of the phenomena that we might like to regard as performances exist only as subjective reactions.
PERFORMANCE MEASURES OFTEN SUBSUME CONFLICTING SUBGOALS THAT ORGANIZATIONS FIND PROBLEMATIC In this age of questionnaires and scales, there has been a tendency to convert subjective reactions to numbers than can be analyzed statistically. However, numbers based on rather arbitrary scales lack properties that one expects numbers to possess, and one result may be implicit assumptions about the relative importance of different people. Figure 1 shows the ratings of introductory management courses taught to MBA students by my department. Because most students give fairly high ratings, course ratings typically have high averages and differences between courses
Starbuck / PERFORMANCE MEASURES
35% 30% 25% 20% 15% 10% 5% 0% 1
2
3
4
5
6
7
Rating
Figure 1: Frequencies of students’ ratings of management courses
are dominated by small fractions of students who give low ratings. Each evaluation by a very unhappy student counts 2.4 times as much as each evaluation by a very happy student. As one result, to receive an aboveaverage rating, a course must disappoint fewer than 10% and elate more than 48%, whereas below-average ratings go to courses that disappoint more than 10% and elate fewer than 40%. As well, in colleges where grades range from A to D, there is a correlation around 0.6 between the ratings students give their courses and the grades students say they expect to receive, which implies that the students who are expecting lower grades exert stronger influence on the differences in course ratings. Of course, some teachers react to these patterns by awarding uniformly high grades to receive high ratings; I have had colleagues who gave only As and A–s. Therefore, when schools emphasize course ratings, they are making assumptions about the relative importance of unhappy students versus happy students and about the usefulness of making distinctions among students. One characteristic of course ratings is that the performance measure subsumes conflicting yet desirable organizational subgoals, with the result that it may be unclear whether performance can or should rise. Although awarding uniformly high grades to all students is likely to raise a course’s rating, schools are likely to disapprove of this tactic in formal policies if not in actual practices. Likewise, although aiming a course at the students who have the least interest in it is likely to raise a course’s rating, professors are likely to label this tactic pandering or watering-down. Thus, the performance measure is implicitly asking teachers to resolve as individuals issues that the organization as a system finds problematic. Meyer and Rowan (1977) pointed out that schools tend to avoid looking closely at the actual actions teachers and administrators take in the face of conflicting subgoals; instead,
3
they adhere to the public mythology that everyone is acting in good faith. It is not only schools that pursue conflicting subgoals and that deal with such conflicts by devolving decision making to individuals. For business firms to maximize profits, they must both try to obtain as much revenue as possible and try to keep costs as low as possible. Marketing personnel try to increase revenues by urging their firms to produce customized products that are exactly what customers seek and to make these products available whenever customers ask for them. However, production personnel try to reduce costs by minimizing inventories and machine downtime, which implies producing different products in large quantities on efficient schedules. As a result, marketing personnel and production personnel often disagree about what actions to take or when to take them. One consequence of conflicting subgoals is volatility of the overall performance measures. For instance, profit is the difference between two large numbers: revenues and costs. Total revenues and total costs are both more stable from period to period than their difference, profit. Stock prices, which are sometimes used as indicators of performance, are even more volatile than the profits of individual companies, partly because they reflect unstable balances between large numbers of optimistic investors and large numbers of pessimistic ones. When new information arrives, possibly a change in profit expectations, possibly changes in the world economy that have no specific relationship to single companies, some investors become more optimistic or more pessimistic and the balances are upset.
THE ERRORS IN PERFORMANCE MEASURES OFTEN EXCEED THE LIMITATIONS IMPOSED BY PREVALENT STATISTICAL TECHNIQUES One of the central issues in social science research is that statistically significant correlations are ridiculously easy to obtain. In effect, tests of statistical significance classify random noise as meaningful discoveries. Jane Webster and I assembled a database of more than 13,000 correlations reported in studies in Administrative Science Quarterly, the Academy of Management Journal, and the Journal of Applied Psychology (Webster & Starbuck, 1988). We examined all of the correlations among all variables observed, not merely correlations relating to hypotheses. The distributions of correla-
4
JOURNAL OF MANAGEMENT INQUIRY / Month XXXX
tions were very similar in all three journals, and the mean correlation was close to +0.09. Of all correlations, 69% were positive, and 65% of all correlations were statistically significant at the 5% level. Thus, finding statistical significance is extremely easy. A researcher who chooses correlations utterly at random, without regard for hypotheses, has 2-to-1 odds of finding a significant correlation on the first try, and 24-to-1 odds of finding a significant correlation within three tries. Furthermore, the odds are better than 2-to1 that an observed correlation will be positive, and positive correlations are more likely than negative ones to be statistically significant (also see Hubbard & Armstrong, 1992). Not only are correlations among randomly chosen variables likely to be statistically significant, they are likely to rival the correlations that researchers deem meaningful. Peach and Webb (1983) showed that random combinations of macroeconomic variables produce multiple correlation coefficients just as large as the ones that economists report as demonstrations of the effectiveness for their macroeconomic models. Thus, errors in variables cause statistical procedures to identify incorrect associations among variables, not merely incorrect coefficients for relations. Statistical methods that rely on squared errors make these issues even more troubling because squaring errors places extreme emphasis on outlying observations. At least for studies of central tendencies, outliers represent low probability events that are idiosyncratic to specific samples and unlikely to be replicated. The deficiencies of squared-error statistics became evident during the 1970s, when psychometricians discovered that predictions about the success of potential students or potential employees are more likely to be accurate when based on a priori assumptions than when based on models derived from regression analyses (Starbuck & Mezias, 1996). A central problem is that samples used in the regression analyses include quite different outliers than do the samples for which predictions are made; there is a lack of correspondence between the least likely events. Of course, with large enough samples, these sample idiosyncrasies become less troublesome, but (a) models based on regression do not allow much more accurate predictions even with samples of thousands and (b) one needs rather large samples before models based on regression allow predictions that are even as accurate as a priori assumptions. Specifically, a priori assumptions generally allow more accurate predictions than regression does if the multiple correlation is less than
Figure 2: Error in a dependent variable
Figure 3: Error in an independent variable
0.5 and samples are smaller than 400. Even with multiple correlations above 0.5, a priori assumptions generally allow more accurate predictions than regression does if samples are smaller than 200. Outliers that represent valid observations have value in that they suggest the need for multiple theories or additional contingency variables. It is hard to see the value, however, in outliers that represent errors of various sorts, and errors in independent variables tend to be more troublesome than errors in dependent variables (Rousseeuw & Leroy, 1987). Figures 1 and 2 demonstrate the differences between dependent and independent variables. Figure 2 shows some data and a line fitted to these data. One value of the dependent variable is displaced from its correct value, possibly by a data-entry error. The regression line is not the line that would have been computed with correct data, and indeed, the slope has shifted from positive to negative, but the error has tilted the regression line without making it wildly inappropriate. Figure 3 shows the same original data, but in this instance, the identical error occurred in the independent variable. The regression line is obviously quite different from the one that would have been computed with error-free data. Thus, errors in performance measures cause more trouble when performance is an independent variable. Organization theoretic studies have tended to
Starbuck / PERFORMANCE MEASURES
use performance measures as independent variables, as have some studies of employment turnover. Studies of business strategies have generally used measures of financial performance as dependent variables, and studies of work and workers have used measures of job performance or job satisfaction as dependent variables.
PERFORMANCE MEASURES CONTAIN ERRORS THAT DISTORT INFERENCES I have been explaining the effects of errors because performance data and the variables correlated with them contain errors. Despite the assumptions conventionally made by statistical models, data errors do not reduce correlations to zero. Large errors in two variables may create the appearance of a significant correlation, and spurious relationships are probable if even one variable has large errors. For squared-error statistical inferences to yield theoretically meaningful relations, the data need to be rather error-free. However, the data used in management research may contain significant errors. One common source of statistical data is a large database such as Compustat. San Miguel (1977) found a 30% error rate in Compustat’s reporting of R&D expenditures. These errors originated in firms’ reports and during Compustat’s processing (e.g., data-entry errors). Similarly, Rosenberg and Houglet (1974) audited stock prices reported by Compustat and by the Center for Research in Security Prices at the University of Chicago. They concluded, “There are a few large errors in both data bases, and these few errors are sufficient to change sharply the apparent nature of the data” (Rosenberg & Houglet, 1974, p. 1303). Few errors are necessary because sources such as Compustat or Center for Research in Security Prices allow very large sample sizes, which in turn enables researchers to find statistical significance although correlations are close to zero. In these situations, observations have nearly spherical distributions, and the correlations are produced by very small numbers of outlying observations. Error rates a large as 20% to 30% pose serious problems for squared-error statistics. One criterion that statisticians use to evaluate regression methods is their breakdown point. The breakdown point for ordinary least-squares regression is a single observation. That is, just one defective observation can turn an ordinary regression calculation into garbage. However, there are alternative, robust statistical methods
5
that can tolerate high error rates. The most robust methods can cope with errors in nearly 50% of the data. Thus, by adopting robust statistical methods, researchers may be able to obtain valid inferences despite the prevalent errors in large databases. A second source of error is the people who provide data through interviews or questionnaires. Much management research relies on people’s perceptions, and perceptual data underlie many of the numbers that conventions label objective. Payne and Pugh (1976) surveyed roughly 100 studies in which people had characterized their organizations’ structures and cultures. Payne and Pugh found that people’s perceptions of their organizations correlate very weakly with measurable characteristics of their organizations. Likewise, John Mezias and I (Mezias & Starbuck, 2003) made two attempts to assess the accuracy of managers’ perceptions and got similar results in both studies. According to our data, only three out of eight of managers have reasonably accurate perceptions, and a surprisingly large fraction of managers has very, very erroneous perceptions; some perception errors go up into thousands of percentages. Managers’ job specializations and experience do not correlate with the accuracy of their perceptions. That is, people having experience in specific domains do not perceive these domains more accurately than do people without such experience. Because more than half of managers have very erroneous perceptions, the high error rates in managers’ perceptions may be too large for research methods to overcome. As far as I have been able to determine, no statistical techniques can produce accurate analyses when more than half of the data are unreliable.
PERFORMANCE MEASURES CONTAIN CORRELATED ERRORS Another knotty problem with data provided by people is that such data have correlated errors. Statistical models assume that the errors in each observation are uncorrelated with those in other observations and that errors in each variable are uncorrelated with those in other variables. However, these assumptions are difficult to satisfy when obtaining data from people. The main reason is that human brains create logical order. A person who overestimates one variable is likely to overestimate or underestimate other variables that the person believes to relate to the first one. Indeed, human brains imagine events that their logic
6
JOURNAL OF MANAGEMENT INQUIRY / Month XXXX
says should have occurred although they did not actually occur, and human brains so love relationships that they see patterns in sequences of random numbers. A study that obtains data about organizational characteristics from the same people that supply data about the organization’s environment is almost certain to discover relationships between organizational and environmental properties that have no basis beyond a mythology constructed by common sense. For instance, a person who perceives an organization as stable and orderly is likely to perceive the organization’s environments as stable and orderly, whereas a person who perceives an organization as changing and disorderly is likely to perceive the organization’s environments as changing and disorderly. Such mythological relationships are likely strongest when researchers obtain data from respondents at one time and through a single method. By including items in a single questionnaire or interview, researchers suggest to respondents that they ought to see relationships among these items. As well, many studies obtain data from multiple respondents in a single organization, which allows shared mythologies to influence the respondents’ perceptions. One result is that errors in one respondent’s perceptions are likely to correlate with the errors in other respondents’ perceptions. Such distortions can both create correlations between dependent and independent variables, which distort the apparent determinants of performance, and create collinearity among independent variables, which also distorts the apparent determinants of performance. The issues I am raising about errors in observations apply to all sorts of variable, not only performance measures. Do performance measures contain more errors or more troublesome errors than other variables? Performance measures often have direct consequences for people—bonuses, promotions, praise, dismissal, embarrassment—that foster wishful thinking and searches for patterns or relationships. On the other hand, the consequences of performance measures give people incentive to try to make them accurate and to challenge values that appear to be erroneous. Perhaps someone will investigate this issue.
CONCLUSION Performance measures are a dominant and pervasive theme in contemporary, industrialized societies. Business firms issue performance reports at least
quarterly, and newspapers and television report the financial performances of many firms nearly daily. The compensation of executives depends on the numbers in corporate performance reports, and these numbers appear in the media together with executives’ statements calling attention to good performance or rationalizing poor performance. The nonbusiness sections of newspapers report numerous statistics about athletic achievements, the qualities of wines, the fuel efficiency of cars, and attendance at plays, concerts, or movies. The prevalence of such reports reflects their social importance . . . as bases for rewards, punishments, recognition, disapprobation, and as possible guides to successful behaviors. It is no wonder that many academic studies emphasize performance. I do wonder, however, if the desire to understand the determinants and consequences of performance measures has led researchers to jump too hastily and without sufficient reflection to the conclusions that they do understand them. In some cases, performance measures may not mean what researchers would like them to mean. In other cases, researchers are analyzing performance data with statistical techniques that are very prone to error. The very centrality of performance measures implies that researchers should proceed with caution. What makes performance measures important is not that they correlate with other variables but that they can alter performance. Because effective performance measures can dramatically improve human and organizational performance and ineffective performance measures can waste efforts and degrade performance, it would be useful to know more about the properties of performance measures that make them effective. This probably implies longitudinal research that attends to causal relations rather than cross-sectional research that seeks noncausal generalizations.
REFERENCES Elster, J., & Roemer, J. E. (1993). Interpersonal comparisons of well-being. Cambridge: Cambridge University Press. Friedlander, F., & Pickle, H. (1968). Components of effectiveness in small organizations. Administrative Science Quarterly, 13, 289-304. Hubbard, R., & Armstrong, J. S. (1992). Are null results becoming an endangered species in marketing? Marketing Letters, 3(2), 127-136. Meyer, J. W., & Rowan, B. (1977) Institutionalized organizations: Formal structure as myth and ceremony. American Journal of Sociology, 83, 340-363.
Starbuck / PERFORMANCE MEASURES Mezias, J. M., & Starbuck, W. H. (2003). Studying the accuracy of managers’ perceptions: A research odyssey. British Journal of Management, 14, 3-17. Payne, R. L., & Pugh, D. S. (1976). Organizational structure and climate. In M. D. Dunnette (Ed.), Handbook of industrial and organizational psychology (pp. 1125-1173). Chicago: Rand McNally. Peach, J. T., & Webb, J. L. (1983). Randomly specified macroeconomic models: Some implications for model selection. Journal of Economic Issues, 17, 697-720. Rosenberg, B., & Houglet, M. (1974). Error rates in CRSP and Compustat data bases and their implications. Journal of Finance, 29, 1303-1310. Rousseeuw, P. J., & Leroy, A. M. (1987). Robust regression and outlier detection. New York: Wiley. San Miguel, J. G. (1977). The reliability of R&D data in Compustat and 10-K reports. Accounting Review, 52, 638641. Smith, D. C., & Bratton, W. J. (2001). Performance management in New York City: Compustat and the revolution in police management. In D. W. Forsythe (Ed.), Quicker, better, cheaper? Managing performance in American government (pp. 453-482). Albany, NY: Rockefeller Institute.
7
Starbuck, W. H., & Mezias, J. (1996). Opening Pandora’s box: Studying the accuracy of managers’ perceptions. Journal of Organizational Behavior, 17(2), 99-117. Webster, E. J., & Starbuck, W. H. (1988). Theory building in industrial and organizational psychology. In C. L. Cooper & I. T. Robertson (Eds.), International review of industrial and organizational psychology (pp. 93-138). London: Wiley. William H. Starbuck is the ITT professor of creative management in the Stern School of Business at New York University. He has held faculty positions at Purdue, Johns Hopkins, Cornell, and Wisconsin-Milwaukee as well as visiting positions in England, France, New Zealand, Norway, Oregon, and Sweden. He was also a senior research fellow at the International Institute of Management, Berlin. He has been the editor of Administrative Science Quarterly, he chaired the screening committee for senior Fulbright awards in business management; and he was the president of the Academy of Management. He has published more than 130 articles on accounting, bargaining, business strategy, computer programming, computer simulation, forecasting, decision making, human-computer interaction, learning, organizational design, organizational growth and development, perception, scientific methods, and social revolutions.