ting streak for the Boston Red Sox in 1949 and had a lifetime ... Define b as batting average and. A as the .... We define the latter as H/g, where g is the number of.
What Are the Odds?
Another look at DiMaggio’s streak Don M. Chance
Joe DiMaggio lines a single to left field in the seventh inning of the second game of a doubleheader at Washington on June 29, 1941, to set a record for hitting safely in 42 consecutive games. In the first game, DiMaggio tied George Sisler’s record of 41 games, set in 1922. The catcher is Jake Early of the Washington Senators. Yankees won both games, 9–4, 7–5. AP Photo
O
ne of the most amazing athletic feats is the celebrated 56-game hitting streak of Joe DiMaggio in the 1941 baseball season. Popular opinion is that the streak was an extremely unlikely event, and discussion of the streak has been widespread. Calculating the probability that DiMaggio would hit in 56 consecutive games is a common exercise for a probability class, but there are many other related questions that arise. The probability of a specific person winning a lottery is clearly
not very high, but the chance that someone will win the lottery could be quite high. When we allow the possibility that some person, that is, any person, will win the lottery, the probability is much higher. If we conduct the lottery many times, the likelihood of there being a winner increases even further. The analogy to a hitting streak in baseball is that the chance of someone having such a long hitting streak is considerably higher than the chance that this person is Joe DiMaggio or any particular player of interest. Given enough players and CHANCE
33
Definition of a Streak and Other Long Streaks There are some subtleties in how a consecutive-game hitting streak is defined. On the Major League Baseball web site, the official rules state: Rule 10.23 (b) Consecutive-game Hitting Streak: A consecutive-game hitting streak shall not be stopped if all of a batter’s plate appearances (one or more) in a game result in a base on balls, hits batsman, defensive interference, or obstruction, or a sacrifice bunt. The streak will end if the player has a sacrifice fly and no hit. A player’s individual consecutive-game hitting streak shall be determined by the consecutive games in which such player appears and is not determined by his club’s games. Thus, a streak is established only if the player has at least one plate appearance that does not involve one of the abovementioned outcomes. At the end of the 2007 season, there were 43 streaks of at least 30 games, achieved by 41 players, with two players who did it twice—Ty Cobb and George Sisler. There have been 20 streaks in the American League and 23 in the National League. Two streaks have spanned seasons, the most recent being a 38-game streak in 2005–2006 by Jimmy Rollins of the Philadelphia Phillies. Interestingly, not all streak hitters were outstanding hitters. Streak hitting may well reflect the ability to make effective contact at pitches outside the strike zone, the tendency of which can lower one’s average over the long run. For example, in 1987, Benito Santiago hit in 34 consecutive games, but batted only 0.300 that year and only 0.263 in his career. Ken Landreaux batted only 0.281 in the year of his 31-game streak and only 0.268 in his career. The list contains another DiMaggio, Joe’s brother Dom, who compiled a 34-game hitting streak for the Boston Red Sox in 1949 and had a lifetime batting average in 11 seasons of 0.298.
A (Very) Simplified Estimate of the Probability of a Streak Ty Cobb, outfielder for the Detroit Tigers, is shown in action during practice in March of 1921. AP Photo
enough opportunities, why are streaks of this length so rare? How unusual is it, given the large number of players and of games in the history of the sport? According to www.baseball-reference.com, by the end of the 2007 season, there were 382,852 games played. Considering that almost every game consists of 20–25 participants, each with an opportunity to start or continue a hitting streak, there were quite a few opportunities for 56-game hitting streaks. As long as someone gets a hit in at least 56 games in a row, a DiMaggio-like streak will occur. When considering the probability of such a streak, we should not care who did it. That the streak was achieved by one of the most famous and popular baseball players in history is as irrelevant as who won the lottery, as long as someone did. 34
VOL. 22, NO. 2, 2009
Consider a player who has four official at-bats per game and gets one hit. For now, I will treat batting average—the ratio of hits to official at-bats—as indicative of the player’s probability of getting a hit. Intuition suggests it would be difficult to hit in a large number of consecutive games if the player will make an out twice as often making a hit. Define b as batting average and A as the number of official at-bats per game. An initial estimate of the probability of getting a hit in a game, p*, is one minus the probability of not getting a hit in a game, or p* = 1 – (1 – b)A, assuming b is constant and outcomes of at-bats are independent of one another. With a batting average of 0.333, we get p* = 1 – (1 – 0.333)4 = 0.802. Thus, a 0.333 hitter with four official at-bats per game has a more than 80% chance of getting a hit in a game. For even lower averages, these probabilities may seem surprisingly large. A mediocre 0.250 hitter with four official at-bats has a more than two-thirds chance of getting a hit in a game (1 – (1 – 0.250)4 = 1 – 81/256 = 0.684). In fact, even if a player has a pitcher-like batting average of 0.160, he is more
likely than not to get a hit in a game if he has four official atbats (1 – (1 – 0.160)4 = 0.502). The probability of getting a hit in s consecutive games, p(s)*, is found by multiplying this value (p*) by itself s times, that is, raising it to the s power, p(s)* = (p*)s. Thus, for a 0.333 hitter, the probability of getting a hit in 56 consecutive games would appear to be p(56)* = (0.703)56 = 0.0000000027, or about 1-in-400 million. In DiMaggio’s case, his 0.409 batting average and 3.98 at-bats per game during the streak leads to a probability of 0.00063, or about 1-in-1,585. The asterisk in the notation above is used because there are some problems with these measures. For one, batting average is not the probability of a hit. Although baseball rules allow that a streak is not stopped if a player does not receive an official at-bat, a streak is stopped if a player has received an official at-bat during a game, has so far failed to get a hit, and then walks in his last at-bat. In fact, walks are a strong determinant of the likelihood that a player will have a streak. The old expression, “A walk is as good as a hit,” is not true for a player trying to extend a streak. So what we need is not batting average, but the probability that a player will get a hit when he has an opportunity to get a hit. Denote the number of hits as h and the number of hitting opportunities as H. Then, the probability that he will get a hit when he has an opportunity is estimated as h/H. Now, we need an estimate of H. Baseball statisticians use a measure called plate appearances, which are official at-bats plus unofficial at-bats, the latter including walks, sacrifice hits, sacrifice flies, and hitsby-pitch. This measure comes closer to hitting opportunities than at-bats, but it is not exact. Suppose in a game, a player has three official at-bats with one hit, plus one walk, and one sacrifice bunt. It can be argued that his probability of a hit is 1-in-4, because on the sacrifice bunt, he was not attempting to get a hit. He was, however, attempting to get a hit when he walked. If instead of the sacrifice bunt, he had a sacrifice fly, however, we should count the sacrifice fly as an attempt to get a hit because the player was swinging the bat. If the walk was intentional, we should not count it because the player did not have a chance to get a hit. There are some hitting opportunities in which a player receives a few ordinary pitches and then is intentionally walked, but these are not common. There also are plate appearances in which a player walks but does not see any hittable pitches, thereby resulting in a nonintentional walk, but a lost opportunity to extend the streak. These types of walks are not tallied in official baseball statistics, however, and are unlikely to affect the overall figures by much. If the player is hit by a pitch, it is not definitive as to whether he had a chance to get a hit, but we will assume that such an outcome should not be counted against the player. Fortunately, hit batsmen are relatively few compared to total plate appearances. For example, Joe DiMaggio had 7,671 plate appearances in his career and was hit only 46 times. Recall that the initial objective is to estimate the probability in a game that when the player was attempting to get a hit, he succeeded. Thus, we should calculate plate appearances minus hits-by-pitch, sacrifice bunts, and intentional bases on balls. Unfortunately, hit batsmen were not recorded until 1887, records on sacrifice bunts were not kept until 1895, and intentional bases on balls were not tallied until 1955, so the official baseball statistics do not reflect these factors for
some players over all or a portion of their careers. (For hit batsmen, this will have no effect because it is not counted as a plate appearance.) This overall figure for adjusted plate appearances is H, hitting opportunities. Thus, the probability of a hit in a game is 1 minus the probability of a hit in a single opportunity (h/H) raised to the power of the number of hitting opportunities per game. We define the latter as H/g, where g is the number of games. Thus, p is 1– (1– (h/H))H/g, assuming again that attempts have a constant probability and are independent. Note also that H/g might not be an integer. For DiMaggio, during the streak, there were 246 hitting opportunities with 91 hits for a probability of a hit of 0.370 per at-bat. The number of hitting opportunities per game is, therefore, 246/56=4.39. An estimate of the probability of DiMaggio getting a hit in a game during the streak is therefore p = 1– (1– 0.370)4.39=0.868. His probability of getting a hit in every game for a single run of 56 straight games is p(56) = (0.868)56 = 0.00036, or about 1-in-2,772. Of course, we have to acknowledge that these measures assume independence from one hitting opportunity to another and do not reflect how pressure can affect the player or even his teammates or opponents. But that is a problem with any analysis of a seemingly rare human event.
Probability of a Streak in a Career Another problem with this approach is that it gives the player only a single run of 56 games in which to obtain a hit in each game. Over the course of a season in which DiMaggio played 139 games, there are 139 – 56 + 1 = 84 possible (overlapping) 56-game periods. Considering that a streak can span seasons, we also should allow the possibility that the streak might have started in 1940 or ended in 1942. Carrying that argument further, however, we should consider the possibility of his having such a streak during any 56-game period in his career of 1,736 games. But, if we consider his entire career as a possibility, we can hardly use his performance during the 56-game period or in the 1941 season. Instead, his career performance would be more appropriate. Adapting a formula from the classic An Introduction to Probability Theory and Its Applications, by William Feller, the probability of a streak of length s during n games with a probability of hit p per game is p(n,s)=1– (1– px)/[(s+1– sx)(1– p)](1/xn+1), where x is approximately 1+(1– p)ps+(s+1)((1– p)ps)2. The measure x is technically found by iterative solution as xn+1=1+(1– p)psxn with x0=1. The above specification is a quadratic approximation found at MathForum.org. In virtually all cases in this study, x is essentially 1.0 to several decimal places. Of course, if n were small, then exact calculations could be made. In this application, the approximation is very useful. In DiMaggio’s career, using hitting opportunities to determine the probability of hitting in a game, p is 0.778, and using the above formula, the probability of a streak of 56 games in his career is 0.000295, or about 1-in-3,394. G. Warrack, in a 1995 CHANCE article titled “The Great Streak,” undertook a similar analysis and reported a value of p = 0.777 and a streak probability of 0.000274, or about 1-in-3,650. M. Freiman, in a Baseball Research Journal article, estimates the probability for DiMaggio over his lifetime at 0.00121, or 1-in-826. Several Continued on Page 38 CHANCE 35
Table 1— Probabilities of a 56-Game Hitting Streak by the Top 50 Hitters of All Time
36
Player
Batting Average
Hitting Opportunities
Probability of a Hit in a Game (p)
Probability of the Streak
Likelihood (1-in)
Streak Rank
Ty Cobb
0.366
12,777
0.812
0.004890
204
1
Rogers Hornsby
0.358
9,259
0.790
0.000842
1,187
16
Joe Jackson
0.356
5,559
0.798
0.000869
1,151
15
Lefty O'Doul
0.349
3,620
0.756
0.000036
27,916
66
Ed Delahanty
0.346
8,340
0.816
0.003806
263
2
Tris Speaker
0.345
11,679
0.777
0.000434
2,302
21
Ted Williams
0.344
9,700
0.742
0.000031
32,440
70
Billy Hamilton
0.344
7,544
0.798
0.000988
1,013
14
Dan Brouthers
0.342
7,656
0.804
0.001610
621
11
Babe Ruth
0.342
10,503
0.738
0.000027
37,015
72
Dave Orr
0.342
3,411
0.822
0.002246
445
7
Harry Heilmann
0.342
8,683
0.772
0.000244
4,099
34
Pete Browning
0.341
5,315
0.811
0.001700
588
10
Willie Keeler
0.341
9,244
0.810
0.002967
337
3
Bill Terry
0.341
6,974
0.783
0.000419
2,385
23
George Sisler
0.340
8,787
0.808
0.002477
404
5
Lou Gehrig
0.340
9,554
0.772
0.000251
3,989
32
Jake Stenzel
0.339
3,381
0.797
0.000428
2,338
22
Jesse Burkett
0.338
9,525
0.806
0.002202
454
8
Tony Gwynn
0.338
9,984
0.787
0.000752
1,329
17
Nap Lajoie
0.338
10,239
0.792
0.001101
909
13
Riggs Stephenson
0.336
5,043
0.747
0.000026
38,332
74
Al Simmons
0.334
9,404
0.795
0.001142
876
12
John McGraw
0.334
4,894
0.750
0.000026
38,071
73
Ichiro Suzuki*
0.333
5,046
0.819
0.002738
365
4
Paul Waner
0.333
10,588
0.770
0.000246
4,069
33
VOL. 22, NO. 2, 2009
Player
Batting Average
Hitting Opportunities
Probability of a Hit in a Game (p)
Probability of the Streak
Likelihood (1-in)
Streak Rank
Eddie Collins
0.333
11,525
0.749
0.000066
15,116
59
Mike Donlin
0.333
4,186
0.768
0.000085
11,708
53
Cap Anson
0.333
11,292
0.801
0.001950
513
9
Todd Helton*
0.332
6,590
0.754
0.000050
20,158
63
Albert Pujols*
0.332
4,620
0.767
0.000084
11,861
54
Stan Musial
0.331
12,550
0.757
0.000125
7,978
48
Sam Thompson
0.331
6,497
0.813
0.002360
424
6
Bill Lange
0.330
3,570
0.786
0.000227
4,407
36
Heinie Manush
0.330
8,230
0.777
0.000322
3,110
26
Wade Boggs
0.328
10,531
0.766
0.000187
5,357
40
Rod Carew
0.328
10,278
0.769
0.000235
4,249
35
Honus Wagner
0.327
11,518
0.766
0.000205
4,872
39
Tip O'Neill
0.326
4,720
0.789
0.000369
2,710
24
Bob Fothergill
0.325
3,491
0.683
0.000000
5,807,999
100
Jimmie Foxx
0.325
9,599
0.737
0.000023
44,000
78
Earle Combs
0.325
6,433
0.780
0.000282
3,545
29
Joe Dimaggio
0.325
7,657
0.778
0.000295
3,394
28
Vladimir Guerrero*
0.325
6,596
0.767
0.000131
7,652
47
Babe Herman
0.324
6,134
0.751
0.000040
25,087
64
Hugh Duffy
0.324
7,733
0.789
0.000621
1,611
19
Joe Medwick
0.324
8,098
0.774
0.000252
3,974
31
Edd Roush
0.323
7,900
0.762
0.000114
8,748
50
Sam Rice
0.322
10,033
0.771
0.000259
3,866
30
Ross Youngs
0.322
5,214
0.765
0.000086
11,631
52
The top 100 players are analyzed, but only the top 50 are shown here. The highest values among the top 100 hitters are in bold. A more detailed table including the top 100 players is available at www.amstat.org/publications/chance/supplemental.cfm. *Indicates the player was active as of the end of the 2007 season
CHANCE
37
Probability of Streaks 1.00 0.90 0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 30
35
40
45
50
55
60
65
70
Game Streak
Figure 1. Probabilities of streaks based on 30 to 70 games by the top 100 hitters of all time
other researchers estimate probabilities for a single season and for various other players over single seasons, their best seasons, and their lifetimes (see Further Reading).
Probability of a Streak Among the Top 100 Hitters It is not feasible to analyze data for every player in the history of the game. The most likely candidates for streaks would seem to be the best hitters. I focus initially on the top 100 hitters of all time through the 2007 season, based on at least 3,000 plate appearances as listed at www.baseball-reference.com/leaders/ BA_career.shtml. This group consists of nine current and 91 retired players. The batting averages range from Cobb’s 0.366 to Bug Holliday’s 0.311. The results are presented in Table 1. Only the top 50 players by batting average are shown, but the complete table is available at www.amstat.org/publications/chance/ supplemental.cfm. Note a few interesting results. The player most likely to get a hit in a game is Dave Orr, who played in the 1880s and is 11th in all-time batting. Of modern players, the most likely to get a hit in a game is Ichiro Suzuki, although his lifetime batting average of 0.333 puts him only 25th in all-time batting. Ed Delahanty is third in likelihood of getting a hit in a game, Sam Thompson is fourth, and Cobb—the all-time best hitter—is fifth. But Cobb is the player most likely to have had a 56-game hitting streak at least once in his career at a probability of 0.00489, or 1-in-204. The second-best hitter of all time, Rogers Hornsby, ranks only 16th in likelihood of having the streak. Thompson, only the 33rd-best hitter, ranks sixth in streak likelihood. Ted Williams, the seventh-best hitter, ranks only 70th in streak likelihood, and Babe Ruth—the 10th-best hitter—ranks only 72nd. And what about Joe DiMaggio, the 43rd-best hitter? He ranks 28th in streak likelihood at a probability of 0.000295, or 1-in 3,394. There is something to be said for DiMaggio ranking much higher in streak likelihood than in hitting. Clearly, there is more to achieving a streak than merely batting average. In fact, games played is a major determinant, as it is a significant driver of hitting opportunities. Consider Jake Stenzel 38
VOL. 22, NO. 2, 2009
and the aforementioned Orr. Stenzel ranks 22nd in streak probability, but played only 766 games. Orr ranks seventh with only 791 games. If we extrapolate to a career three times as long, which puts their career longevity much closer to the other top hitters, Stenzel moves up to 12th and Orr surpasses Cobb as the most likely player to have achieved the streak. Of course, we do not know whether they would have maintained the same level of performance, but we can clearly see playing a lot of games is a major factor. It is interesting to consider why some of the best hitters rank so low and some of the lowest-ranked hitters, although still outstanding hitters, rank so high in streak probability. One explanation is the number of walks. Williams averaged one walk every 4.84 plate appearances, and Babe Ruth averaged one every 5.15 plate appearances. Thompson, in contrast, averaged one walk every 14.5 plate appearances. Orr walked only once every 34.8 plate appearances, which was partly a result of a rule change I will discuss later. Nonetheless, even after accounting for this effect, Orr walked with about the same infrequency. He ranked 11th in batting, but seventh in streak likelihood. Players who received few walks are likely to be either players who preceded powerful hitters in the batting order or were capable of getting hits on bad pitches. Now let us estimate the overall probability that there will be at least one 56-game hitting streak over the careers of these 100 players. Let pi(ni,s) be the probability for player i, where i = 1, 2, …, 100; ni be the number of games in the player’s career; and s be the streak of interest, which is 56 in this case, but we will change that figure later. To estimate the overall probability, we cannot simply add the probabilities for the individual players because achievement of the streak is not mutually exclusive. More than one player can have the streak. Thus, we must find the probability that no player has the streak and subtract that figure from 1. The probability that player i does not have a streak is 1 – pi(ni,s). The overall probability of at least one streak is 1 minus the probability of no streaks among all the 100 players. The probability of no streaks for all 100 players is (1 – p1(n1,s)) x (1 – p2(n2,s)) x … (1 – p100(n100,s)), or in other words, the
Rule Changes and Their Effect on Streaks Rule changes over the years have altered the interpretation of a streak. For example, the number of balls required for a “base-on-balls” has changed. In addition, at one time, foul balls with less than two strikes did not count as strikes. In fact, the second-longest streak, 45 games by Willie Keeler, was achieved during that era. Another player from that era with a high probability of a long streak is Cal McVey. McVey was a versatile player, who played catcher but also all infield positions as well as outfielder and pitcher. During his brief career in the 19th century, he was indeed a good hitter. He played for the Boston Red Stockings of the National Association for four years, a league that only lasted those four years and later became the modern-day National League. McVey then played two years for the Chicago Cubs and two years for the Cincinnati Reds. Seasons were shorter at that time, and McVey played only 530 games in his nine-year career, batting 0.346. But that average does not officially place him in the top 100 hitters because he did not have at least 3,000 plate appearances. In fact, he had the second-fewest games played of all of the players we examined. McVey achieved his streak of 30 games during the period of June 1 to August 8, 1876. To put that summer in perspective, it was during McVey’s streak that General George Custer and his 7th Cavalry were defeated at the Battle of Little Big Horn, the first transcontinental train ride was completed, the United States celebrated its centennial, and Colorado became a state. During his career, McVey walked only 30 times in 2,543 plate appearances. Walks, however, were not common then because seven balls were required for a walk. In 1887, the number of balls for a walk was reduced to five, and in 1888, it was reduced to the current rule of four. Thus, it is not clear that McVey’s record should count. We can alter his bases-on-balls to a reasonable number and determine if he were really a threat to establish a long streak.
product of the probabilities of no streak for each player. Then, we subtract this number from 1 to get the probability of at least one streak. The answer is 0.0442, or about 1-in-23. This means that if the entire history of baseball could be played 23 times, we would expect at least one 56-game hitting streak from these 100 players. It is simple to examine the likelihood of other hitting streaks. Figure 1 shows the probability of each hitting streak from 30 to 70 games. Note that the probability of at least one streak up to about 40 games is quite high and then drops off rapidly. In the history of baseball, there have been 41 streaks of 30 to 39 games. The probability of at least one 40-game streak is about 0.821 (1-in-1.2), while the probability of a 45-game streak is only about half that, at 0.419 (1-in-2.4). There have been only five streaks of 40 to 45 games. The probability of a 50-game hitting streak is 0.16 (1-in-6.3), and the probability of a 60-game hitting streak is 0.018 (1-in-55.6). It is also interesting to note that the probability of at least one 79-game
Cal McVey— Catcher (Boston Red Stockings) 1874 Courtesy of New York Public Library Digital Gallery
hitting streak among these 100 players is approximately the same as the probability that DiMaggio would have at least one 56-game hitting streak in his career. It is also easy to bump up the lifetime batting averages of these players to see how much a given collective increase in hitting ability would increase the chance of a streak. Adding 10 points (0.01) to each player puts the overall probability at 0.089 (1-in-11.2), and adding 20 points puts the overall probability at 0.168, or about 1-in-6. These estimations are somewhat unrealistic in that they assume that all players have significantly higher lifetime averages, but they do give an idea of at what level higher these players would have had to have played to improve the probability to a certain level. Of course, we also know that if a player is hitting much better than his lifetime average, the probability of a streak is much higher, but it is not possible to definitely say how much higher the probability is because we cannot hypothesize how long the player’s average would remain high. CHANCE
39
Table 2 — Probabilities of a 56-Game Hitting Streak by the Players Who Achieved a Streak of at Least 30 Games and Are Not Among the Top 100 Hitters of All Time* Player
Batting Average
Hitting Opportunities
Probability of a Hit in a Game (p)
Probability of the Streak
Pete Rose
0.303
15,638
0.752
0.000102
9,764
Bill Dahlen
0.272
10,235
0.683
0.000000
2,396,802
Paul Molitor
0.306
11,985
0.765
0.000190
5,272
Jimmy Rollins
0.277
5,107
0.742
0.000015
65,539
Tommy Holmes
0.302
5,496
0.737
0.000012
81,104
Luis Castillo
0.294
6,140
0.736
0.000012
82,721
Chase Utley
0.300
2,409
0.725
0.000002
455,352
George McQuinn
0.276
6,467
0.691
0.000000
2,048,046
Dom Dimaggio
0.298
6,421
0.751
0.000038
26,565
Benito Santiago
0.263
7,437
0.654
0.000000
31,328,349
George Davis
0.295
9,976
0.729
0.000013
76,272
Hal Chase
0.291
7,723
0.733
0.000013
74,227
Willie Davis
0.279
9,664
0.706
0.000002
412,069
Rico Carty
0.299
6,250
0.694
0.000001
1,621,939
Ken Landreaux
0.268
4,431
0.632
0.000000
328,544,907
Cal McVey
0.346
2,543
0.866
0.019723
51
Elmer Smith
0.310
5,362
0.747
0.000024
41,470
Ron LeFlore
0.288
4,843
0.742
0.000015
65,162
George Brett
0.305
11,369
0.745
0.000045
22,039
Jerome Walton
0.269
1,742
0.555
0.000000
830,003,617,282
Sandy Alomar, Jr.
0.273
4,802
0.646
0.000000
92,246,664
Eric Davis
0.269
6,085
0.633
0.000000
228,299,708
Luis Gonzalez
0.284
9,985
0.691
0.000001
1,353,552
Willy Taveras
0.293
1,605
0.715
0.000001
1,472,527
Moises Alou
0.303
7,759
0.723
0.000007
151,491
*A more detailed table can be found at www.amstat.org/publications/chance/supplemental.cfm.
40
Likelihood (1-in)
VOL. 22, NO. 2, 2009
Probability of a Streak Including Other Players Who Had Streaks The 100 best hitters would seem to be the most likely group from which to examine the overall probability of a 56-game streak. But of the 43 streaks of at least 30 games achieved by 41 players, only 16 are in the top 100 all-time hitters. There may be isolated cases of players not on this list who might have had significant probabilities of a streak, but those players would have had a remarkable number of plate appearances and hits without a high enough batting average to make the top 100. However, to extend the analysis to other possible candidates, we should consider those players who actually did achieve significant hitting streaks but were not in the top 100 batters. For example, there are 25 other players who obtained hitting streaks of at least 30 games. Let us take a look at this group. This set of 25 players contains only seven who batted at least 0.300 over their careers—including Pete Rose, whose 1978 streak of 44 games is third—and four who batted below 0.270. Table 2 shows the estimates for these 25 players. One player, Cal McVey, stands out. Of the remaining players, Paul Molitor had the highest probability of a streak at 0.00019, or 1-in-5,272. Rose was second at 0.000102, or 1-in-9,764. The average probability of the top 100 players is 0.00045, so both Rose and Molitor would rank below average in streak likelihood. None of the other players is close to Rose and Molitor in terms of likelihood of the streak, except McVey, whose likelihood is 1-in-51, the highest by far of all 125 players. Among the top 100 hitters, the average number of plate appearances per walk is 10.46. McVey averaged 84.77 plate appearances per walk. Suppose we change the number of McVey’s walks to the number equivalent to one every 10.46 plate appearances. In that case, McVey would have 266 walks. Using that figure, McVey’s probability would go down to 0.0144, or 1-in-69, which is still much higher than that of Cobb, the player who had the highest streak likelihood among the top 100 hitters. The player with the lowest ratio of plate appearances to walks is Williams at 4.84. Even changing McVey’s walks so he has one for every 4.84 plate appearances, we obtain a probability of 0.0095, or 1-in-105, which is still ahead of Cobb. In other words, when we penalize him by giving him an inordinately large number of walks, McVey is more than twice as likely as Cobb to obtain a 56-game hitting streak. We have no information about McVey’s sacrifice bunts, though this is not likely to dramatically alter the overall results. However, the historical requirement of seven balls for a walk could have reduced McVey’s batting average. If a pitcher can make more bad pitches without giving up a walk, the hitter is less likely to see as many good pitches. On the other hand, conventional baseball wisdom argues that the more pitches a hitter sees at a plate appearance, the more likely he is to get a hit. Another factor that helped McVey is hitting opportunity, as he had 4.798 per game, the highest of the 125 players examined here. Billy Hamilton, who was 14th in streak probability and 8th in batting average, was second at 4.742. Thus, McVey had such a high probability of a streak because of a combination of two effects: he came to the plate
often and he made the most of his opportunities. His probability of a hit in a game was 0.866, much higher than Orr’s 0.822. And his effective average was 0.342, much higher than that of Cobb’s. In fact, his batting average of 0.346 would have put him sixth all-time if his 2,543 plate appearances had been sufficient to qualify. His infrequency of walks may have appeared to play a small role, but after adjustment, we see this factor had very little effect. So, coming up often and having a very high probability of a hit were probably the key factors, and certainly these attributes affected the streak probability for other players. Not counting McVey and counting only the top 100 batters plus the 24 other players with streaks of at least 30 games, the overall probability comes to 0.0447 or 1-in-22. Counting McVey, the probability goes to 0.0635 or 1-in-16.
Who Was Most Likely to Have Set the Record? Given that we know a streak has occurred, an interesting question is to determine the likelihood that DiMaggio or any particular player was the one who did it. Suppose we assume we know only the following: at least one streak has been achieved by only one player. Given the assumptions we have made (constant probability of success, independent trials, using p instead of p*), for the sample of the top 100 batters, the probability of seeing at least one streak by one and only one player is
100
100
i1
j1, jxi
¤ pi ( ni ,s) (1 p j ( n j ,s)).
That is, each player’s probability of a streak is multiplied by the joint probability that all the other players did not have a streak. This calculation is done over all players and summed to obtain the probability of one, and only one, player achieving at least one streak. The probability that player i had the streak is then pi ( ni , s ) 100
100
1 p j ( n j ,s))
j1, jx1 100
¤ pi ( ni ,s) (1 p j ( n j ,s)) i1
.
j1, jxi
Of course, an important caveat is that we cannot rule out that these probabilities account for a person achieving more than one streak. The Feller formula is technically the probability of a streak not occurring, and we adapted it to obtain its complement—the probability of at least one streak occurring. Thus, the probability that a streak does occur accounts for the possibility of multiple occurrences. Rather than present a lengthy table, we show an abbreviated version, Table 3, with the 15 players with the highest probabilities. Part A is from the results for only the top 100 hitters, and Part B incorporates all 125 players. The players line up as they did before, but we now have information about the likelihood that the streak was achieved by a particular player. For the top 100 hitters, the probability of it being Cobb is about 11%. Using all 125 players, the most likely is, of course, McVey, at almost 31%. Cobb falls to about 7.5%. Not surprisingly, DiMaggio is not in the top 15, as he accounts for only 0.45%.
CHANCE
41
Table 3 —The Probability That a Particular Player Is the One Who Achieved at Least One 56-Game Hitting Streak, Given That We Know at Least One Such Streak Occurred
A. 15 Highest from the Top 100 Hitters Only
B. Top 100 Hitters Plus 25 Other Players Who Achieved Streaks of at Least 30 Games
Ty Cobb
0.1086
Cal McVey
0.3054
Ed Delahanty
0.0844
Ty Cobb
0.0746
Willie Keeler
0.0658
Ed Delahanty
0.0580
Ichiro Suzuki
0.0606
Willie Keeler
0.0452
George Sisler
0.0549
Ichiro Suzuki
0.0417
Sam Thompson 0.0523
George Sisler
0.0377
Dave Orr
0.0497
Sam Thompson 0.0359
Jesse Burkett
0.0487
Dave Orr
0.0342
Cap Anson
0.0432
Jesse Burkett
0.0335
Pete Browning
0.0376
Cap Anson
0.0297
Dan Brouthers
0.0356
Pete Browning
0.0259
Al Simmons
0.0253
Dan Brouthers
0.0245
Nap Lajoie
0.0243
Al Simmons
0.0174
Billy Hamilton
0.0218
Nap Lajoie
0.0167
Joe Jackson
0.0192
Billy Hamilton
0.0150
Summary I estimate that DiMaggio had a lifetime chance of 1-in-3,394 and is only the 28th most likely player in the top 100 all time hitters to achieve the streak. The top 100 hitters collectively had a chance of 1-in-22. The most likely player among the top 100 hitters was Cobb, the best all-time hitter who also had the most games played and plate appearances. But batting average and longevity are not the sole determinants. And of course, there is the intriguing case of McVey from the 19th century, who was far more likely than anyone else to achieve the streak. Counting McVey and the 124 other players analyzed here, the chance someone would achieve at least one 56-game hitting streak is about 1-in-16. Of course, there are limitations to any such estimates. Clearly a large number of other players has been omitted, but the marginal contributions of the omitted players should be 42
VOL. 22, NO. 2, 2009
extremely small. Even the 100th best hitter of all time has a probability of only 0.00005. Also, the formulas assume each hitting opportunity is independent and the probability of hitting is based on the career average. The notion of a hot streak belies the principle of independence. If a player has what appears to be a hot streak, the likelihood of extending the streak is greater. Other factors could also affect the likelihood of a streak. A team playing well is likely to help a player achieve a streak, though a player in a streak is also likely to help a team play well, so causality is not clear. Teammates, opposing pitchers, and even umpires might behave differently during a streak. But some of these factors would increase the likelihood of extending the streak, and some would decrease it. Rarely in life are events completely independent, as we are all parts of a system of complex interacting factors. So the streak does seem fairly improbable, but perhaps not as improbable as we might have thought. And we should not act as though the rarity of the streak is that DiMaggio did it. Probability analysis removes our subjectivity and allows us to analyze without bias. It cares not about the mystical aura of a player like DiMaggio (who married movie stars Dorothy Arnold and Marilyn Monroe and ultimately became the historical persona of the Yankee franchise) in comparison to a player like Ed Delahanty, whose name would not be recognized by most Americans, and yet whose career was of similar length and 13 times more likely to have produced the streak. And of course, almost no one has heard of McVey, who was far more likely than anyone to have done it, but has virtually no name recognition even in the annals of baseball history.
Further Reading Arbeson, S. and S. Strogatz (2008). “A Journey to Baseball’s Alternative Universe.” The New York Times, March 30. Berry, S. (1991). “The Summer of ’41: A Probability Analysis of DiMaggio’s Streak and Williams’ Average of .406.” CHANCE, 4 (4):8–11. Brown, B. and P. Goodrich (2003). “Calculating the Odds: DiMaggio’s 56-Game Hitting Streak.” Baseball Research Journal, 32:35–40. D’Aniello, J. (2003). “DiMaggio’s Hitting Streak: High ‘Hit’ Average the Key.” Baseball Research Journal 32:31–34. Feller, W. (1968). An Introduction to Probability Theory and Its Applications, 3rd ed., Vol. 1. New York: Wiley. Freiman, M. (2002). “56-Game Hitting Streaks Revisited.” Baseball Research Journal 31:11–15. Gould, S. J. (1989). “The Streak of Streaks.” CHANCE, 2 (2):10–16. Levitt, D. (2004), “Resolving the Probability and Interpretations of Joe DiMaggio’s Hitting Streak.” By the Numbers 14(2):5–7. Seidel, M. (2002). Streak: Joe DiMaggio and the Summer of ’41, Lincoln: University of Nebraska Press (originally published in 1988 by McGraw-Hill). Short, T. and L. Wasserman (1989). “Should We Be Surprised at the Streak of Streaks?” CHANCE, 2(2):13. Warrack, G. (1995). “The Great Streak.” CHANCE 8(3):41–43, 60.