Editor’s Letter Sam Behseta, Executive Editor
Dear Readers,
I
n this issue, Stephen Fienberg puts the very notion of ...
104 downloads
314 Views
2MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Editor’s Letter Sam Behseta, Executive Editor
Dear Readers,
I
n this issue, Stephen Fienberg puts the very notion of privacy in perspective: It would be more meaningful to elucidate privacy in conjunction with confidentiality and disclosure. Fienberg argues that the confidentiality pledge of the U.S. Census Bureau would, in fact, protect the privacy of individuals from harmful disclosure. On the contrary, when privacy policies of online social forums, such as Facebook, evolve into a labyrinth of lengthy legal documents, there is ample justification for the same individuals to be concerned about sharing their personal information with the rest of the world. According to Fienberg, the main challenge for the statistical community is to develop methods that contribute to the efficiency of policymaking through amassing data while guarding the privacy of the individual. Also, Roland Deutsch looks back at the 2010 World Cup in South Africa. Using the pre-tournament’s data obtained from the monthly rankings of FIFA, soccer’s primary international association, and three other well-regarded ranking systems, a nonlinear model is developed to calibrate the performance of the major contenders in the last World Cup. As for the U.S. national team, Deutsch paints a rosy portrait: That the Americans made it to the final 16, finishing at the top of their group ahead of England, is by itself a remarkable achievement. Golf is a unique sport in its administration of a handicap system, allowing for two players with different skills to compete at the same predetermined level. James Lackritz demonstrates the fairness of this system using simulation studies, leading to a somewhat reassuring conclusion: The handicap system is indeed advantageous to the better players. David Rockoff and Heike Hofmann look into the quality of visual judgments by analyzing a considerably large set of data from a web page devoted to a series of eyeballing games. There are two main inquiries of interest: what commonalities can be extracted from the patterns of performances of players who, regardless of the complexity of the games, consistently score higher—or lower—than the others and how statisticians can use these results to improve the accuracy of the graphical tools they employ for data analysis. Tristan Barnett revisits the problem of optimal betting strategy in video poker. Conservative or aggressive game plans may well put the poker player at a disadvantage. However, there is a long-established betting benchmark, known as the Kelly criterion, for maximizing the player’s bank from one
hand to the next. Barnett introduces us to two versions of this criterion for games having two and multiple outcomes. The criterion is used to calculate the expected profit of various hands in the all-American poker machine. It has long been advocated that blackjack players familiar with card-counting systems would greatly profit from their skills. This assumption is put to the test by Bill Hurley and Andrey Pavlov through studying variations of the Hi-Lo counting system, popularized by Edward Thorp in his classic volume Beat the Dealer. Hurley and Pavlov, however, are not as optimistic as Thorp was: To guarantee success, a team of highly skilled and well-organized players is needed to bet large amounts of money over a long period of time. As stated by Meena Doshi in Mark Glickman’s column Here’s to Your Health, a significant portion of the U.S. population is affected by influenza every year. This leads to a considerable number of mortalities, and the costs of its socioeconomic aftermath adds to billions of dollars. There has been a number of statistical models for estimating mortality due to influenza, chief among which is Serfling’s seasonal regression. Doshi’s article highlights some of the major challenges associated with devising more sophisticated statistical tools for assessing the number of deaths due to disease. In his column Visual Revelations, Howard Wainer enlists a number of technical misnomers in the statistical nomenclature and suggests alternatives for them. Wainer opens his article, challenging the term “teacher effect,” as it is frequently used in value-added models. Jonathan Berkowitz offers yet another exciting puzzle by paying tribute to the 125th anniversary of the publication of an iconic statistical work. Finally, two announcements. First, it is my pleasure to introduce a new editor, Shane Reese from Brigham Young University. Second, Statistics Forum, the statistical blog sponsored by CHANCE and the American Statistical Association and edited by Andrew Gelman, is now up and running. In the short term since its inception, the forum has showcased articles by Michael Lavine, Howard Wainer, Christian Robert, and Andrew Gelman, among others. I urge readers of CHANCE to visit the forum at http://statisticsforum.wordpress.com. Sam Behseta
CHANCE
5
O Privacy, Where Art Thou? Aleksandra Slavkovic, Column Editor
Stephen E. Fienberg
A
re you on Facebook? Do you post personal information online to other websites or applications? If so, have you read the privacy policies and realized how your information might actually be used by others? The World Wide Web and social media have dramatically changed the way we communicate and the way we share personal information with others. And data linger in some of the most surprising places. People think nothing of posting photos and information about personal activities, their friends, and even their sexual preferences on Facebook. But once such data are online, they may be shared in unanticipated ways. While Facebook offers options to share content with Everyone, Friends of Friends, or Friends only, you simply have lost control once the information is in other people’s hands; once available online, data are rarely retrievable. And the links to others can reveal information about you, either directly or indirectly. A year ago, an article in The New York Times noted that Facebook’s privacy policy was “5,830 words long; the United States Constitution, without any of its amendments, is a concise 4,543 words.” And just this February, the company revised its information to unfold over myriad web pages, while not clearly changing any of its policies and practices. To see how Facebook’s privacy protection policy has evolved from 2005 through 2010, check out the graphical display prepared by Matt McKeon
at http://mattmckeon.com/facebook-privacy/#. Despite your best efforts at restricting access to your information on Facebook, you may have far less protection that you think. At the same time as some people seem willing to post lots of personal information online, they often object to sharing the same information with government agencies, such as the U.S. Census Bureau as part of the decennial census, even though the bureau makes clear that individual information will be kept strictly confidential and not made public in individually identifiable form. Take Rep. Michele Bachmann from Minnesota, who announced in 2009 that she and her family would not be fully filling out the 2010 census form, but would only supply the number of people living in her home. According to The Washington Times (June 17, 2009), “I know for my family the only question we will be answering is how many people are in our home,” said Bachmann. “We won’t be answering any information beyond that, because the Constitution doesn’t require any information beyond that.” Bachmann’s congressional web page notes, “Congresswoman Bachmann is a graduate of Anoka High School and Winona State University. Bachmann and her husband, Marcus, live in Stillwater, where they own a small business mental health care practice that employs 42 people. The Bachmanns have five children: Lucas, Harrison, Elisa, Caroline, and Sophia.” Other accessible public records will CHANCE
7
Figure 1. Relationship among confidentiality, disclosure, and harm The figure comes from Stephen E. Fienberg’s “Statistical Perspectives on Confidentiality and Data Access in Public Health,” published in Statistics in Medicine.
provide her address and information about her home in Stillwater. How should you square your personal privacy, online activities, and pledges of confidentiality like those offered by the U.S. Census Bureau? What harm might come from providing personal data in various settings?
Privacy, Confidentiality, Disclosure, and Harm Privacy, in the context of data, is the right of individuals to control the dissemination of, or access to, information about themselves. You can protect privacy by not giving others access to information. But we do provide such access for various reasons, either because we are required to by law or because it is the way we acquire crucial services such as health care. Protection can be provided through a formal legally enforceable policy such as the Privacy Rule associated with Health Insurance Portability and Accountability Act of 1996 (HIPAA). Every medical provider must have a formal policy that complies with the privacy rule about who can access your data and for what purposes. It is much harder to protect your information once data about you are
8
VOL. 24, NO. 2, 2011
already publicly available. This is why we need to ask what online companies, such as Facebook, say about their privacy policies and how they actually implement them once participants specify their privacy controls. Confidentiality, on the other hand, is, in effect, a contractual arrangement referring to the nature of protection accorded to statistical information provided by the data collector to the data provider (e.g., a promise not to share the information in identifiable form). The U.S. Census Bureau promises confidentiality to its respondents under Title 13 of the U.S. Code, which specifies: Neither the Secretary, nor any other officer or employee of the Department of Commerce or bureau or agency thereof, or local government census liaison may … – Use the information furnished under the provisions of this title for any purpose other than the statistical purposes for which it is supplied; or – Make any publication whereby the data furnished by any particular establishment or individual under this title can be identified; or
– Permit anyone other than the sworn officers and employees of the Department or bureau or agency thereof to examine the individual reports. Disclosure means the inappropriate attribution of information to the provider of that information (e.g., by an intruder resulting from the release of information that was gathered under a pledge of confidentiality). Government statistical agencies typically have a dual obligation to protect the confidentiality of their respondents and make data available for public use, specifically for public policy. To accomplish the latter, agencies typically redact obvious personal identifiable information from their databases and then apply statistical methods to add further protection. In my writings on confidentiality, I have often used the visualization in Figure 1. For the Current Population Survey, the Census Bureau offers a promise of confidentiality to those selected for inclusion in the sample. But because of links between those in the sample and those not and the data available on individuals from other possibly public sources, disclosure can occur for those not in the sample and thus not protected by the bureau’s pledge. And not all
disclosable information is harmful, or at least perceived as such. You will notice I have not included privacy in the visualization. For me, privacy is a personal right, and each of us needs to determine what we wish to share and with whom. Protecting privacy requires us to either (a) gather individual preferences and specifications and then follow through by preventing unauthorized access or (b) ‘protect everything.’ Because protecting everything is statistically impossible, we need definitions of privacy that accomplish something close to such protection probabilistically, but also allow us to share some form of aggregate information, such as tabulations, and samples of files that cannot be easily used to accomplish disclosure. How statisticians and computer scientists are attempting to accomplish this goal involves statistical disclosure limitation, privacy-preserving data mining, and a new approach from cryptography know as differential privacy. For recent contributions to these areas, see the Journal of Privacy and Confidentiality.
Back to Online Privacy and the Census As we roam the Internet, we leave digital traces of our purchases, preferences, interests, and, in the case of Facebook, friends and relatives. Some of this occurs with our assent and other times without, as we collect cookies from the web pages we visit. On Facebook, you may set your privacy settings, but you do not have full control over what others you entrust with your data do, nor how the company uses the data or shares it with advertisers and those who implement applications on its platform. In a widely cited study done as a class project at MIT and reported in a 2009 issue of First Monday, students scanned the Facebook friends of fellow students and used information posted about sexual preference to accurately predict the sexual orientation of 10 men the students knew to be gay, but who had not declared so on Facebook. It is the unexpected use of networked information that may be especially pernicious. Facebook and other networking websites can plug one privacy hole, but a dozen others can open up with new data mining tools and methods. The detailed results of the 2010 decennial census are now being released
Data Privacy, Confidentiality, and Disclosure Privacy is the right of individuals to control the dissemination of, or access to, information about themselves. Confidentiality is a contractual arrangement referring to the nature of protection accorded to statistical information provided by the data collector to the data provider (e.g., a promise not to share the information in identifiable form). Disclosure is the inappropriate attribution of information to the provider of that information. For more about these and other related definitions, visit the ASA Committee on Privacy and Confidentiality site at www.amstat.org/committees/ pc/index.html.
and the data are being protected by redacting obvious identifiers, a method known as “data swapping,” and the reporting of only aggregate information. I believe we can trust the Census Bureau in this regard. And besides, the Census Bureau actually collected little personal information and will release far less––at least until 2082 when the 2010 records become public by law. In this sense, Rep. Bachmann had far less to fear from the Census Bureau than she did from her own public postings and the data in publicly accessible records. It is probably too much to ask her to advocate stricter online privacy protection for consumers instead of complaining about the intrusiveness of the Census Bureau. And, by the way, the Constitution specifies that the census be conducted “in such Manner as [Congress] shall by Law direct.” Thus, her concerns about the constitutionality of the decennial census seem misplaced. I did participate in the 2010 decennial census and answered all of the questions. For me, this was a civic duty. But I do not belong to or use Facebook or any other networking site. I simply do not trust their privacy policies or their concern to protect my information.
Some Advice My advice to readers of CHANCE is to think carefully about what you would like everyone else to know about you. Until such time as there is serious online privacy protection, you should think beyond
the moment. What you reveal directly, or through your links to friends and acquaintances today, may not be what you want others to know about you now or prospective employers to know about you in the future.
Further Reading Bilton, N. 2010. Price of Facebook privacy? Start clicking. The New York Times. May 12. www.nytimes.com/2010/05/13/ technology/personaltech/13basics.html. Dinan, S. 2009. Exclusive: Minn. lawmaker vows not to complete census. The Washington Times. June 17. www.washingtontimes.com/news/2009/jun/ 17/exclusive-minn-lawmaker-fearscensus-abuse. Facebook Privacy Policy. www.facebook. com/about/privacy. Health Insurance Portability and Accountability Act of 1996 (HIPAA) Privacy Rule. www.hhs.gov/ocr/ privacy/hipaa. Jernigan, C., and B. Mistree. 2009. Gaydar: Facebook friendships expose sexual orientation. First Monday 14(10). http://firstmonday.org/htbin/cgiwrap/bin/ ojs/index.php/fm/article/view/2611/2302 Journal of Privacy and Confidentiality, http:// repository.cmu.edu/jpc. McKeon, M. The Evolution of Privacy on Facebook. http://mattmckeon.com/ facebook-privacy. The U.S. Census Bureau’s Privacy Principles, www.census.gov/privacy/data_ protection/our_privacy_principles.html. CHANCE
9
How Much to Bet on Video Poker Tristan Barnett
A
question that arises whenever a game is favorable to the player is how much to wager on each event? While conservative play (or minimum bet) minimizes “large” fluctuations, it lacks the potential to maximize longterm growth of the bank. At the other extreme, aggressive play (or maximum bet) runs the risk of losing an entire bankroll, even though the player has an advantage in each trial of the game. What
10
VOL. 24, NO. 2, 2011
is required is a mathematical formulation that informs the player of how much to bet with the objective of maximizing the long-term growth of the bank. The famous Kelly criterion, as developed by John L. Kelly in “A New Interpretation of Information Rate,” achieves this objective. The Kelly criterion has been most recognized in games in which there are two outcomes: win $x with probability p and lose $y with probability 1-p. When there are more than two outcomes, a generalized Kelly formula is required. This article will apply the Kelly criterion when multiple (more than two) outcomes exist through working examples in video poker. The methodology could be used to assist “advantage
players” in the decisionmaking process of how much to bet on each trial in video poker.
Kelly Criterion Analysis of casino games and percent house margin A casino game can be defi ned as follows: There is an initial cost C to play the game. Each trial results in an outcome Oi, where each outcome occurs with profit ki and probability pi. A profit of zero means the money paid to the player for a particular outcome equals the initial cost. Profits above zero represent a gain for the player; negative profits represent a loss. The probabilities are all non-negative and sum to one over all the possible outcomes. Given this information, the total expected profit Ei = piki. The percent house margin (%HM) is then -Ei /C, and the total return is 1+Ei /C. Positive percent house margins indicate the gambling site, on average, makes money and the players lose money. Negative percent house margins indicate the game is favorable to the player and could possibly generate a long-term profit. Table 1 summarizes this information when there are m possible outcomes.
Classical Kelly criterion The well-established classical Kelly criterion is given by the following result: Consider a game with two possible outcomes: win or lose. Suppose the player profits k units for every unit wager and the probabilities of a win and loss are given by p and q, respectively. Further, suppose that on each trial, the win probability p is constant with p + q = 1. If kp-q>0, so the game is advantageous to the player, then the optimal fraction of the current capital to be wagered is given by b*= (kp-q)/k. Consider the following example. A player profi ts $2 with probability 0.35 and profi ts -$1 with probability 0.65, as represented by Table 2. Since the expected profit of 2 x 0.35-0.65=0.05>0, the game is advantageous to the player and the optimal fraction is given by b*= (2 x 0.35-0.65)/2 = 0.025. If a player has a $100 bankroll, then wagering 100*0.025 = $2.50 on the next hand will maximize the long-term growth of the bank. If the player loses $1 on that hand, then the next wager should be exactly 99*0.025 = $2.475 under the classical Kelly criterion. Since fractions are often not allowed in gambling games, this figure should be rounded down to an allowable betting amount. Kelly criterion for multiple outcomes When there are multiple (more than two) outcomes, as is the situation for video poker, a generalized Kelly formula is required from the classical Kelly formula. This generalized Kelly formula is given by Theorem 1 (see “A Proof of Theorem 1”). Theorem 1
Consider a game with m possible discrete finite mixed outcomes. Suppose the profit for a unit wager for outcome i is ki with probability pi for 1 ≤ i ≤ m, where at least one outcome is negative and at least one outcome is positive. Then, if a winning strategy exists, and the maximum growth of the bank is attained when the proportion of the bank bet at each turn, b*, is the smallest positive root of
Let g(b) represent the rate of growth of the bank that is the quantity to be
A Proof of Theorem 1 Assume a constant proportion b of the bank is bet with m discrete finite mixed outcomes. Let B(1) / B(0) equal 1+ kib with probability pi for i = 1 to m, where B(t) represents the player’s bank at time t. Assume the player wishes to maximize g(b) = E[log(B(1) / B(0)) ] =
. Without loss of generality, let
k1 be the maximum possible loss. In the interval 0< b 0 since ki ≥ k1 for i = 1 to m, so the logarithm of each term is real. Taking derivatives with respect to b,
and
Note that (a) g(0) = 0 (b) g(0) > 0 follows directly from the requirement for a winning strategy (so you should bet something) (c) g(b) < 0 for 02) Outcome Game Suppose m=3 and the outcome k1=-1 occurs with probability 0.45, k2=1 with 0.45, and k3=2 with probability 0.10. The expected outcome is (-1)(0.45)+(1) (0.45)+2(0.10) = 0.20 >0, which is positive. The approximations are b’ = 0.2/[ (-1)2(0.45) + (12)0.45 + (22)0.10 ] = 0.2/1.3 = 0.1538 and b’’ = 0.2/[1.3 – 0.22] = 0.2/1.26 = 0.1587. k1 is -1 and both b’ and b’’ are less than -1/k1 = 1. g’(b’) = (-1) (0.45)/(1-1(b’)) + (1)(0.45)/(1+1(b’)) + 2(0.10)/(1+2(b’)) = 0.0111. Similarly, g’(b’’) = 0.0053. Both conditions (a) and (b) are satisfied, so it is preferable to work with b’’. If someone has $1,000, the bet should be $158.73, which likely would be rounded to $158.
Table 3—Profits and Probabilities for the All-American Poker Game Outcome
Return ($)
Probability
Expected Profit ($)
Royal Flush
800
799
1 in 43,450
0.018
Straight Flush
200
199
1 in 7,053
0.028
Four of a Kind
40
39
0.00225
0.088
Full House
8
7
0.01098
0.077
Flush
8
7
0.01572
0.110
Straight
8
7
0.01842
0.129
Three of Kind
3
2
0.06883
0.138
Two Pair
1
0
0.11960
0.000
Jacks or Better
1
0
0.18326
0.000
Nothing
0
-1
0.58076
-0.581
1.00
0.0072
Profit ($)
Progressive machines Often, a group of machines is connected to a common jackpot pool, which continues to grow until someone gets a royal flush. When this occurs, the jackpot is reset to its minimum value. CHANCE
13
Table 4—Probabilities of Outcomes for Different Jackpot Levels for the All-American Poker Game
Conclusions
Outcome
Return ($)
Prob: $250 Jackpot
Prob: $800 Jackpot
Prob: $1,200 Jackpot
Royal Flush
Jackpot
1 in 58,685
1 in 43,450
1 in 35,848
Straight Flush
200
1 in 7,272
1 in 7,053
1 in 6,999
Four of a Kind
40
0.00226
0.00225
0.00225
Full House
8
0.01101
0.01098
0.01096
Flush
8
0.01588
0.01572
0.01505
Straight
8
0.01851
0.01842
0.01846
Three of Kind
3
0.06899
0.06883
0.06888
Two Pair
1
0.11988
0.11960
0.11954
Jacks or Better
1
0.18406
0.18326
0.18336
Nothing
0
0.57924
0.58076
0.58132
1.00
1.00
1.00
Table 5—Kelly Criterion Analysis for Progressive Jackpot Machines Jackpot
Return
Theorem 1 –
$17,000
$250
99.62%
–
–
$800
100.72%
0.0307%
$3.38
$5.22
$1,200
101.74%
0.0468%
$5.15
$7.96
Usually, this minimum value would give a return of less than 100%, which creates a win-win situation for the astute player and the house. The amount bet to obtain the jackpot is a fixed amount. Table 4 represents the probabilities of outcomes with three jackpot levels for the “All-American Poker” game. The $800 jackpot was the game analyzed earlier. The $250 and $1,200 jackpots give returns of 99.62% and 101.74%, respectively. Notice the probability of obtaining a royal fl ush increases as the jackpot increases. This is logical, as a player would be more aggressive toward obtaining a royal fl ush with a larger jackpot. Suppose a player has a bankroll of $11,000 and is required to bet $5 hands. What jackpot level is required to maximize the long-term growth of the player’s bank under the Kelly criterion? Table 5 gives the results and can
14
$11,000
VOL. 24, NO. 2, 2011
An analysis of casino games was given to identity when games are favorable to the player and could possibly generate a long-term profit. Analyses were given for both the classical Kelly (two outcomes) and the Kelly criterion when multiple outcomes exist (more than two). The Kelly criterion when multiple outcomes exist was applied to favorable video poker machines. In the case of nonprogressive machines, an optimal betting fraction was obtained for maximizing the long-term growth of the player’s bankroll. In the case of progressive machines, the minimum jackpot size was obtained as an entry trigger to avoid over betting, based on the player’s bankroll. Approximation formulas when multiple outcomes exist were applied to video poker and shown to be useful for managing the risk. The analysis developed in this paper could be used by “advantage players” to assist with bankroll management, which is recognized as an important component to longterm success.
conclude that a jackpot level of $1,200 is required. A player would need a bankroll of about $17,000 to play the game at a jackpot level of $800.
Practical Difficulties Despite the theoretical advances made above, it is impossible to effectively implement the optimal Kelly betting strategy on an All-American Poker machine, or any other video poker game. There are three main sources of difficulty. The first is the existence of a minimum betting unit on a machine. The second is the need to round the bet to avoid fractions of a unit. Third, to gain an edge in the long run requires hitting royal flushes. In the nonprogressive AllAmerican Poker machine, this occurs on average once every 43,450 trials. Therefore, a player’s bankroll would need to withstand the downward drift between hitting jackpots to avoid over betting.
Further Reading Barnett, T., and S. Clarke. 2004. Optimizing returns in the gaming industry for players and operators of video poker machines. Proceedings of the International Conference on Advances in Computer Entertainment Technology. National University of Singapore, 212–216. Barnett, T. 2009. Gambling your way to New York: A story of expectation. CHANCE 22(3). Epstein, R. A. 1977. The theory of gambling and statistical logic. California: Academic Press. Haigh, J. 1999. Taking chances: Winning with probability. New York: Oxford University Press. Kelly, J. L. 1956. A new interpretation of information rate. The Bell System Technical Journal July, 917–926. Thorp, E.O. 2000. The Kelly criterion in blackjack, sports betting, and the stock market. In Finding the edge, ed. O. Vancura, J. Cornelius, and W. Eadington, 163–215. Institute for the Study of Gambling and Commercial Gaming. Wong, S. 1981. What proportional betting does to your win rate. Blackjack World 3:162–168.
Looking Back at South Africa: Analyzing and Reviewing the 2010 FIFA World Cup Roland C. Deutsch
T
he FIFA World Cup is the world’s largest single sports event, and its worldwide impact is only rivaled by the Olympic Games. For the 2010 edition, held in South Africa, 204 federations entered the qualification rounds starting in 2007. By November 2009, the 32 finalists were determined and set for the World Cup Draw, which was held on December 5, 2009, in Cape Town. The draw is the unofficial start of every World Cup tournament and watched by approximately as many viewers worldwide as the Super Bowl. Besides giving the host country an opportunity to present itself to a worldwide audience, the World Cup Draw first and foremost determines the final schedule of the World Cup and thus can make or break a team’s chance of success, even before the first game is played. For the first round of World Cup play, the 32 finalists were divided into eight groups of four teams each. The teams in each of the eight groups played a round-robin tournament to determine the two teams that would advance to the knockout stage. A win was worth three points, a draw one point, and a loss zero points. The two teams with the most points from each group advanced to the knockout stage. In case two or more teams are tied, a series of tie-breakers are employed, taking into account the number of goals scored and conceded in all three games. The advancing teams were then arranged in a predetermined bracket for the four knockout stages. In the first knockout stage—the round of 16—the eight group winners were paired with the eight runner-ups (i.e., the winner of Group A plays the runner-up of Group B and vice versa), with the winners advancing to the quarter-finals and subsequently to the semifinals and final. In case a knockout game ended in a draw, two 15-minute overtime periods were played to determine a winner. If the teams were still tied after overtime, the
CHANCE
15
Table 1—Eight First-Round Groups of the 2010 FIFA World Cup Pot
Group A
Group B
Group C
1
South Africa
(85)
Argentina
(7)
Germany
(5)
2
Mexico
(18)
South Korea
(48)
United States
(11)
Australia
(24)
3
Uruguay
(25)
Nigeria
(32)
Algeria
(29)
Ghana
(38)
4
France
(9)
Greece
(16)
Slovenia
(49)
Serbia
(20)
Pot
Group E
(6)
England
Group D
Group F (3)
Italy
Group G (4)
Brazil
Group H
1
Netherlands
(1)
Spain
(2)
2
Japan
(40)
New Zealand
(83)
North Korea
(91)
Honduras
(35)
3
Cameroon
(14)
Paraguay
(21)
Ivory Coast
(19)
Chile
(17)
4
Denmark
(27)
Slovakia
(33)
Portugal
(10)
Switzerland
(13)
Within each group, the teams are arranged by pots. Each team’s FIFA ranking from October 2009 is given in parentheses.
winner was determined in the dreaded penalty shoot-out. To determine the eight first-round groups in the draw, the 32 finalists were divided into four pots of eight teams. Each group was then determined by selecting one team from each pot. The first pot included host South Africa with the top seven teams from the October 2009 FIFA ranking. As a result, the seeding ensured the teams from Pot 1 could not face each other in the first round. The other pots contained the qualified teams from North America, Asia and Oceania (Pot 2), Africa and South America (Pot 3), and Europe (Pot 4). Further, the drawing protocol also ensured that no two teams from the same continent could be placed in the same group, with the exception of Europe, where a maximum of two teams per group were permitted. Table 1 summarizes the pots and displays the result of the draw. After the draw, football fans and journalists around the world discussed the varying strengths of the groups, as well as which teams got lucky and which received an unfavorable draw. These discussions were of a subjective nature. As statisticians, we have all the necessary tools to objectively evaluate the draw’s impact on the 32 finalists and the performance of each team. One possible way of doing so is to use existing rankings to quantify a team’s ability and then simulate the World Cup multiple times 16
VOL. 24, NO. 2, 2011
to estimate the expected performance for each team.
Assessing the Strength of Teams To quantify a team’s strength for the simulation study, three rating systems for international football were used: the FIFA/Coca-Cola World Ranking (FIFA Rating), World Football Elo Rating (Elo Rating), and Soccer Power Index (SPI Rating) published by ESPN. All collected ratings represent the latest standings (June 1, 2010) prior to the 2010 World Cup (see Table 2). The FIFA rating (www.fifa.com/ worldfootball/ranking/lastranking/gender=m/ fullranking.html) is published roughly every month and is a points rating that includes performances over the previous 48 months, with more recent results weighted more heavily. Points are earned for each match played based on the result, importance of the game, and strength of the opponent. Like the FIFA rating, the Elo rating (www.eloratings.net/world.html) is based on historical data. However, instead of using a points system, the Elo rating uses a statistical model to determine pairwise winning probabilities. The Elo rating allows the calculation of winning probabilities for the two teams involved in a match, but it does not allow the computation of the probability of a draw. For this reason, that feature of the Elo rating was not used in the simulation study.
In contrast to the FIFA and Elo ratings, the SPI rating (http://soccernet.espn. go.com/spi) was designed by Nate Silver to be a forward-looking rating. It involves a multi-step procedure that takes into account the status/competitiveness of a match, assessment of offensive and defensive strength of the teams, and lineup of the players involved, as long as they are playing for club teams in the world’s top four leagues (i.e., England, Germany, Italy, and Spain). In addition, the odds from six bookmakers for an outright World Cup win on June 1, 2010, were used to determine a fourth rating system, the Bookie rating (see Table 2). The other ratings undoubtedly strive for a good representation of a team’s strength, but errors in the assessment of the teams are generally of little consequence to their publishers. In stark contrast, a bookmaker’s livelihood largely depends on offering accurate and reliable odds and thus forces him to be an expert on the offered bets. Consequently, betting odds have frequently and successfully been employed as a reliable source of forecasting the outcome of sporting events. To balance each bookmaker’s subjective judgment, the decimal odds were averaged for each team. Then, the decimal odds, d, were first converted into winning probabilities, p, according to (1)
Table 2—Country Codes, FIFA, Elo, SPI, Bookie, and Penalty-Kick Ratings (Historical Record in Parentheses) for Each of the 32 Finalists Collected and Computed on June 1, 2010 Nation
Group
FIFA
Elo
SPI
Bookie
PK-Rating
BRA
Brazil
G
1611
2085
88.6
-1.74
0.64 (6-3)
ESP
Spain
H
1565
2078
87.5
-1.56
0.50 (3-3)
POR
Portugal
G
1249
1838
81.6
-3.32
0.75 (2-0)
NED
Netherlands
E
1231
2011
84.0
-2.47
0.29 (1-4)
ITA
Italy
F
1184
1922
80.0
-2.76
0.33 (2-5)
GER
Germany
D
1082
1919
82.0
-2.71
0.75 (5-1)
ARG
Argentina
B
1076
1899
84.7
-2.08
0.64 (6-3)
ENG
England
C
1068
1972
86.2
-2.00
0.25 (1-5)
FRA
France
A
1044
1855
79.5
-3.02
0.50 (3-3)
GRE
Greece
B
964
1726
71.5
-5.18
0.50 (0-0)
USA
United States
C
957
1741
78.4
-4.51
0.71 (4-1)
SRB
Serbia
D
947
1833
80.3
-4.23
0.33 (0-1)
URU
Uruguay
A
899
1819
81.4
-4.75
0.40 (3-5)
MEX
Mexico
A
895
1870
77.8
-4.61
0.45 (4-5)
CHI
Chile
H
888
1851
81.7
-4.29
0.33 (0-1)
CMR
Cameroon
E
887
1698
75.0
-4.74
0.50 (3-3)
AUS
Australia
D
886
1766
74.5
-5.15
0.60 (2-1)
NGA
Nigeria
B
883
1696
73.1
-4.90
0.60 (5-3)
SUI
Switzerland
H
866
1746
73.1
-5.34
0.33 (0-1)
SVN
Slovenia
C
860
1648
72.9
-5.82
0.50 (0-0)
CIV
Ivory Coast
G
856
1725
78.2
-3.64
0.50 (3-3)
ALG
Algeria
C
821
1536
63.3
-6.16
0.50 (2-2)
PAR
Paraguay
F
820
1730
76.0
-4.51
0.25 (0-2)
GHA
Ghana
D
800
1682
72.6
-4.50
0.33 (0-1)
SVK
Slovakia
F
777
1626
67.6
-5.77
0.75 (2-0)
DAN
Denmark
E
767
1761
76.6
-5.00
0.50 (1-1)
HON
Honduras
H
734
1725
74.8
-6.95
0.50 (1-1)
JAP
Japan
E
682
1690
70.2
-6.01
0.60 (2-1)
KOR
South Korea
B
632
1766
73.9
-5.55
0.63 (4-2)
NZL
New Zealand
F
410
1531
57.4
-7.69
0.50 (0-0)
RSA
South Africa
A
392
1548
66.7
-5.03
0.50 (1-1)
PRK
North Korea
G
285
1533
60.2
-7.49
0.50 (0-0)
The teams are sorted in descending order by their FIFA rating. The penalty-kick rating was computed as a weighted proportion of wins from a default 1-1 record and the historical record.
In this equation, represents the bookmaker’s profit margin included in decimal odds and was chosen such that the probabilities for all 32 teams add to
1. Finally, these probabilities were transformed with the logit function to obtain the Bookie rating. As a last step, all four ratings were standardized to mean 0 and standard deviation 1.
Last, a penalty-kick rating (Table 2) for each team was computed to reflect the strength of teams historically performing well (e.g., Argentina, Brazil, and
CHANCE
17
Logistic Regression This statistical method is used for prediction of the probability of occurrence of an event by fitting data to a logistic curve—et/(1 + et)—instead of a line, as in linear regression. Like many forms of regression analysis, it may involve several numerical predictor variables. However, the response variable is generally binary (occurrence or nonoccurrence). For example, the probability that a person has a heart attack within a specified time period might be predicted from knowledge of the person’s age, sex, and body mass index.
A quick—and surprisingly reliable— method to obtain proper and realistic probabilities is to use single game odds from bookmakers and match them with a measure for the competitive balance between the two teams. Toward this end, the difference of the standardized Bookie ratings (the ‘match rating’) (i.e., mA,B = rA – rB) was chosen. The single game decimal odds for all 48 first-round World Cup matches were collected from the same six online bookmakers as above. To ensure symmetry, the odds were arranged such that the bookmaker’s favorite was always listed first. As before, these decimal odds were then averaged for all games and converted into probabilities using Equation 1. This data was then used to model the probability of a win for Team A vs. Team B based on the match-rating, mA,B, using a logistic regression model, (3)
Germany) or poorly (e.g., England, Italy, and The Netherlands) in the penalty shoot-out. This rating relied on historical data of penalty shoot-outs from all competitive games since 1976 up to (but not including) the 2010 World Cup. The penalty-kick rating was computed as a weighted average of a default 1-1 record and the actual record, (2) where wPK denotes the number of shootout wins and nPK the number of shootouts in which a team participated.
A Statistical Model to Determine the Outcome of a Game Given that Team A plays Team B, a reliable model relating the respective ratings, rA and rB, to win-draw-loss probabilities for Team A is needed. Since draws are more frequent in soccer than in most other sports, the model should realistically account for a draw. Further, it is desirable that the model be symmetric, in the sense that the probability of Team A winning against Team B does not depend on whether Team A is listed first or second in the match-up. Also, if the teams are of equal strength, the probabilities for a win and a loss should be equal. 18
VOL. 24, NO. 2, 2011
Due to the aforementioned symmetry, the probability of a loss of Team A vs. Team B is then given by
(4) Obviously, the probability of a draw is computed as D(mA,B) = 1 – (W(mA,B)+ L(mA,B)). Note that W(mA,B)+L(mA,B) < 1 (and consequently D(mA,B) > 0), when 0 is less than zero. For the collected data, the fitted parameter p values were = -0.6111 and = 0.6262. The fitted probability-functions and single-game probabilities obtained from the bookmakers are displayed in Figure 1. To use this model for all ratings described above, the match-rating, mA,B, was determined as the difference in the standardized ratings of teams A and B and then entered accordingly into equations 3 and 4. Since all standardized ratings are on the same scale, the model was not modified further. This win-draw-loss model was then used for the simulations of the group and knockout stages of the World Cup.
Simulating the World Cup To analyze the chances of the 32 finalists prior to the 2010 World Cup, the tournament was simulated 10,000 times for all four ratings. Historically, besides
the host’s ‘home-field advantage,’ teams from the same continent as the host have a ‘home-continent advantage.’ (All seven World Cups held in the Americas were won by South-American teams and nine of the 10 World Cups held in Europe were won by European teams.) To reflect this fact, previous simulation studies using the Elo rating added 100 points to the host’s Elo rating and 50 points to the ratings of teams from the same continent as the host. For the standardized Elo rating, this corresponds to an increase of 0.7 points for South Africa and of 0.35 points for the other African nations. The standardized FIFA, Elo, and SPI ratings were thus adjusted accordingly for the simulations. No adjustment was made to the standardized Bookie rating, since the odds for an outright win already accommodate home-field and homecontinent advantage. To simulate each of the first-round groups, the outcomes of all six games per group were randomly determined. The win-draw-loss model was employed to calculate the three outcome probabilities of every match based on the difference in standardized ratings of the teams involved. Once the outcome of a first-round game was determined, three points were allocated to the winning team or one point was allocated for both teams in case of a draw. At the end, the four teams in a group were ranked based on their point total from largest to smallest. Since only the outcome—not the scores—were modeled, ties were likely to occur at the conclusion of any group simulation. To resolve the ties, a random draw of the tied teams was employed to determine the final standing of the tied teams. The teams advancing from the group stage were then arranged according to the World Cup schedule for the knockout stage. The games in the knockout stage were simulated in a similar manner as in the group stage. However, a winner had to be determined in every game. Thus, in case the simulated outcome was a draw, another outcome was generated from the win-draw-loss model to simulate overtime. If the teams were still tied, a penalty shoot-out was simulated using the penalty-kick rating, rPK, to determine a winner. In this case, the probability of Team A defeating Team B was taken as .
Probability
Match Rating Figure 1. The averaged win-draw-loss probabilities for the better of the two teams obtained from the decimal odds of six online bookmakers. The observations were determined by Equation 1. Also displayed are the fitted curves for win, draw, and loss as obtained from equations 3 and 4.
For every simulated game, each team’s standardized opponent rating also was saved. At the conclusion of every simulation run, each team’s final position was determined and the number of games calculated. Also, each team’s opponent ratings were averaged to determine the strength-of-schedule rating of the simulation run. Once all 10,000 simulation runs concluded, these results were used to determine each team’s simulated probability of being eliminated in the group stages, the round of 16, the quarter-final, the semi-final, and the final, as well as the probability of winning the World Cup. Additionally, the strengthof-schedule ratings were averaged to estimate the expected strength-of-schedule ratings for the 2010 World Cup. As mentioned earlier, the draw fully determined the schedule of the World Cup. Not only did it assign the group stage opponents, it also determined which opponents a team would face or avoid in the latter stages of the tournament and, thus, could have a big impact on a team’s chances. To be able to judge this impact, 1,000 alternate draws were generated using
FIFA’s seeding procedure and drawing protocol. For each of the 1,000 alternate draws, the World Cup tournament was then simulated 1,000 times in the same manner as the actual draw. Thus, in this second simulation run, the effect of the actual draw on the performance of the participating teams is largely eliminated and each team’s simulated pre-draw probabilities of advancing to a given round, as well as the pre-draw expected strength-of-schedule ratings, were computed. Selected results from both simulation runs are given in tables 3 and 4 and Figure 2.
Who Got Lucky? The Impact of the Draw Judging from Table 3, it is evident that the teams in Group F (Italy, New Zealand, Paraguay, and Slovakia) got the most favorable draws. It can be argued that all these teams were at the lower end in their respective pots and also were to expect a manageable opponent from Group E in case they
finished at the top of their group, thus boosting their chances. North-American rivals Mexico and the United States (the strongest teams from the weak Pot 2) also profited from the draw by being placed into almost ideal groups. Due to their fortunate draws, both teams had realistic chances for advancing to the round of 16. Rather favorable draws could be attested to France, South Korea, and Nigeria. In the case of the latter two, both were placed in the relatively weak Group B. France was the favorite in a relatively balanced group, but was likely to face a beatable opponent from Group B in the next round. England (with the exception of the FIFA rating) also received a largely favorable draw based on a manageable group draw and the possibility of facing only opponents from the not-so-loaded groups A to D until the semi-finals. On the other hand, North Korea and South Africa received the most unfavorable draws. In the case of North Korea, being placed in arguably the toughest group all but eliminated their chances of advancing past the group stage.
CHANCE
19
Table 3—Difference in Simulated Expected Strength-of-Schedule Rating and the Simulated Pre-Draw Strength-of-Schedule Rating for Each of the Four Standardized Ratings FIFA
Elo
SPI
Bookie
RSA
0.3412
(28)
0.5475
(30)
0.4689
(30)
0.4659
(31)
MEX
-0.5237
(1)
-0.2457
(7)
-0.1117
(8)
-0.1687
(7)
URU
-0.1628
(11)
0.0228
(18)
0.0604
(20)
0.1871
(30)
FRA
-0.3952
(2)
-0.1428
(11)
-0.0206
(15)
-0.1749
(6)
ARG
-0.0756
(15)
0.0620
(21)
-0.0269
(12)
-0.0354
(14)
KOR
-0.0457
(18)
-0.1637
(10)
-0.1514
(7)
-0.0666
(12)
NGA
-0.2639
(6)
-0.1203
(12)
-0.0628
(10)
-0.0156
(16)
GRE
-0.1753
(10)
-0.0108
(15)
0.0943
(23)
0.0741
(19)
ENG
0.0956
(22)
-0.3470
(6)
-0.2000
(6)
-0.1652
(8)
USA
-0.2286
(8)
-0.4648
(1)
-0.3785
(5)
-0.3623
(1)
ALG
-0.0669
(16)
-0.1909
(9)
0.2407
(26)
0.0849
(21)
SVN
0.1315
(26)
-0.1957
(8)
-0.0225
(14)
0.0991
(23)
GER
0.0998
(23)
0.1298
(26)
0.1876
(24)
0.1761
(29)
AUS
-0.1322
(12)
0.0497
(19)
0.0717
(21)
0.0878
(22)
GHA
-0.0404
(19)
0.1060
(24)
0.1893
(25)
0.1078
(24)
SRB
0.0117
(20)
-0.0054
(16)
0.0112
(18)
0.1238
(27)
NED
-0.0781
(14)
-0.0412
(13)
-0.0076
(17)
-0.0566
(13)
JAP
-0.0579
(17)
0.1262
(25)
0.0814
(22)
-0.0706
(11)
CMR
-0.1989
(9)
0.0082
(17)
-0.0759
(9)
-0.1315
(9)
DAN
0.1023
(24)
0.0752
(22)
-0.0294
(11)
-0.0283
(15)
ITA
-0.3595
(3)
-0.4219
(2)
-0.6457
(3)
-0.3332
(3)
NZL
-0.2896
(5)
-0.3497
(5)
-0.5048
(4)
-0.2356
(5)
PAR
-0.2336
(7)
-0.4054
(4)
-0.6774
(2)
-0.3583
(2)
SVK
-0.3536
(4)
-0.4055
(3)
-0.6809
(1)
-0.3190
(4)
BRA
-0.0837
(13)
-0.0141
(14)
0.0241
(19)
0.1167
(25)
PRK
0.9250
(32)
0.5786
(32)
0.7062
(32)
0.6802
(32)
CIV
0.2968
(27)
0.0821
(23)
-0.0163
(16)
0.0632
(18)
POR
0.1153
(25)
0.0547
(20)
-0.0227
(13)
0.0745
(20)
ESP
0.0868
(21)
0.1926
(27)
0.2449
(27)
-0.0725
(10)
HON
0.3614
(29)
0.4917
(29)
0.3077
(28)
0.1683
(28)
CHI
0.6079
(31)
0.4371
(28)
0.3426
(29)
0.0535
(17)
SUI
0.5076
(30)
0.5555
(31)
0.5562
(31)
0.1229
(26)
The respective ranks are in parentheses. Positive values indicate the expected strength-of-schedule rating is higher than the pre-draw strengthof-schedule rating. Differences with absolute value greater than 0.40 are bold. Teams are arranged according to their first-round groups.
South Africa received one of the toughest draws for any World Cup host in history, with Mexico and France being arguably the strongest teams out of their
20
VOL. 24, NO. 2, 2011
respective pots. Chile, Honduras, and Switzerland were not only placed into a group with top-favorite Spain (leaving them to compete for second place),
but also advancing past the group stage would have paired them up with one of the strong teams from Group G.
Table 4—Top 10 Over- and Under-Achieving Teams at the 2010 FIFA World Cup for Each of the Four Ratings P(matching or exceeding actual result) FIFA URU
Elo
SPI 0.1206
ESP
Bookie
0.1064
URU
0.1244
URU
0.0619
NED
0.1191
GHA
0.1361
NED
0.1505
ESP
0.1538
PAR
0.1971
ESP
0.1739
GHA
0.1542
GHA
0.1768
GHA
0.2006
PAR
0.2000
URU
0.1904
NED
0.1813
GER
0.2193
NED
0.2017
GER
0.2111
JAP
0.2664
KOR
0.2233
GER
0.2800
JAP
0.2385
PAR
0.2717
ESP
0.2303
JAP
0.3035
PAR
0.3043
GER
0.2747
JAP
0.2747
SVK
0.3500
NZL
0.3498
KOR
0.3118
ARG
0.3981
ARG
0.4325
SVK
0.3518
SVK
0.3795
NZL
0.4180
KOR
0.4678
KOR
0.4048
NZL
0.3917
P(matching or falling short of actual result) FIFA
Elo
SPI
Bookie
ITA
0.0378
ITA
0.0365
ITA
0.0402
ITA
0.0331
FRA
0.1176
FRA
0.1526
SRB
0.1758
FRA
0.0975
CMR
0.1705
SRB
0.2088
FRA
0.1841
SRB
0.2348
NGA
0.1751
CMR
0.2973
CMR
0.2055
CMR
0.2357
SRB
0.2578
NGA
0.2982
NGA
0.2398
NGA
0.2656
ALG
0.2762
HON
0.4092
HON
0.3588
ENG
0.3784
HON
0.4234
ENG
0.4244
CIV
0.4398
ALG
0.4290
BRA
0.4291
ALG
0.4530
ENG
0.4427
CIV
0.4754
GRE
0.4498
BRA
0.5287
DAN
0.5108
HON
0.5676
AUS
0.5796
DAN
0.5573
ALG
0.5505
DAN
0.5712
These rankings are based on the simulated probabilities for matching or exceeding the actual result (overachieving) or matching or falling short of the actual result (underachieving) before the start of the World Cup.
Of the favorites, Germany and Spain received draws with stronger opponents on average than expected. In Germany’s case, this was mostly due to a balanced group with dark horses Serbia and Ghana. In contrast, Spain was the overwhelming favorite for finishing on top of Group H. However, the choice of strong opponents for the round of 16 from Group G had a rather negative impact.
Surprises, Disappointments, and Upsets The simulation studies not only enable us to judge the impact of the draw, but also to look back and evaluate the
performances at the World Cup. To evaluate performances of the 32 teams, a ‘p-value’ approach was taken to judge a team’s result based on its simulated probabilities for the four ratings. To judge the most surprising performances, the p-value was computed as the (simulated) probability of reaching at least the actual result. To illustrate, the p-value for the U.S. team—which was eliminated in the round of 16—would be the probability of reaching the round of 16, the quarterfinals, the semi-finals, the final, or winning the World Cup. Similarly, to evaluate the most disappointing performances, a p-value can be
computed as the (simulated) probability of reaching, at most, the actual result. Again, for the U.S. team, this would be computed as the probability of finishing at the bottom of Group C, finishing third in Group C, or reaching the round of 16. The 10 smallest p-values of each kind are displayed in Table 4 for all four ratings. From these p-values, Uruguay’s march to the semi-finals can be seen as the most surprising performance, while Italy’s exit at the bottom of Group F is the most disappointing. Besides these, positively surprising performances also can be attested to the finalist, The Netherlands; the best African team, Ghana; winner, Spain; Paraguay; and
CHANCE
21
Table 5—Top 10 Most Surprising Results at the World Cup FIFA
Elo
SPI
Bookie
ESP-SUI
0-1
0.103
ESP-SUI
0-1
0.117
ENG-ALG
0-0
0.140
ESP-SUI
0-1
0.111
FRA-RSA
1-2
0.113
FRA-RSA
1-2
0.129
ESP-SUI
0-1
0.140
ITA-NZL
1-1
0.139
ITA-NZL
1-1
0.160
SVK-ITA
3-2
0.134
ITA-NZL
1-1
0.142
SVK-ITA
3-2
0.144
SVK-ITA
3-2
0.180
ENG-ALG
0-0
0.146
FRA-RSA
1-2
0.157
ENG-ALG
0-0
0.170
KOR-GRE
2-0
0.206
ITA-NZL
1-1
0.166
SVK-ITA
3-2
0.161
FRA-RSA
1-2
0.199
RSA-MEX
1-1
0.225
RSA-MEX
1-1
0.198
PAR-NZL
0-0
0.177
PAR-NZL
0-0
0.212
PAR-NZL
0-0
0.246
SRB-GHA
0-1
0.222
SRB-GHA
0-1
0.222
FRA-MEX
0-2
0.227
CIV-POR
0-0
0.250
POR-BRA
0-0
0.233
DAN-JAP
1-3
0.241
GER-SRB
0-1
0.231
NZL-SVK
1-1
0.255
ENG-USA
1-1
0.240
RSA-MEX
1-1
0.245
ENG-USA
1-1
0.240
JAP-CMR
1-0
0.256
PAR-NZL
0-0
0.253
AUS-SRB
2-1
0.250
JAP-CMR
1-0
0.250
The probabilities are based on the match-outcomes (win, draw, loss), not the match-scores.
the entertaining German team. Also, the hosts of the 2002 World Cup—South Korea and Japan—were performing better than expected. Although being eliminated in the group stage, New Zealand’s third-place finish in Group F (ahead of defending champion Italy) also should be counted as one of the tournament’s most surprising moments. On the flip side, Italy was joined by fellow 2006 finalist France, European dark horse Serbia, and two of Africa’s most hopeful teams—Cameroon and Nigeria—as the most disappointing teams of the 2010 World Cup. It is also interesting to see that although Brazil reached the quarter-finals, its performance is seen as one of the biggest disappointments of the World Cup. Also, the English exit in the round of 16 at the hands of Germany was rather disappointing. We also can use the match-ratings and the win-draw-loss model to determine the least likely outcomes of the 64 matches played in South Africa (see Table 5). Judging from these, Switzerland’s win against Spain at the beginning of the World Cup was the most surprising result of the 2010 FIFA World Cup, followed by South Africa’s win against France, Italy’s draw against New Zealand and loss to Slovakia, and England’s draw with Algeria. It should be noted that most of the tournament’s upsets happened in the group stage. The most surprising result from the knockout stages was
22
VOL. 24, NO. 2, 2011
The Netherlands’ defeat of Brazil in the quarter-final, with probabilities ranging from 0.257 (FIFA) to 0.400 (Bookie).
The U.S. Team: Mission Accomplished or Missed Chance? When Ghana beat the U.S. team during overtime in Rustenburg, many pundits and fans pointed out that while reaching the round of 16 was the announced U.S. goal, a major opportunity of making a deep run had been missed. Based on the simulation results, we also can evaluate these statements further. The p-values introduced earlier show the probabilities of the U.S. team reaching at least the round of 16 ranged from 0.509 (FIFA) to 0.607 (SPI), while the probabilities of reaching at most the round of 16 ranged from 0.690 (SPI) to 0.784 (Elo). Thus, an early exit was more likely for the U.S. team than a deep run into the latter stages of the World Cup, and based on pre-tournament expectations, the performance can be considered a success. In particular, topping Group C in front of England was an unlikely feat, with simulated probabilities ranging from 0.187 (Bookie) to 0.268 (FIFA). However, the U.S. team finishing on top of Group C is the basis for the ‘missed chance’ argument: Germany was avoided in the round of 16 and Uruguay and South Korea seemed comparatively manageable opponents for the quarter-final.
To investigate this claim further, simulated (conditional) probabilities were used to assess the chances of the U.S. team reaching the quarter-final and beyond under the condition that the round of 16 was reached. These probabilities were computed for the expected performance before the draw, before the start of the tournament, and once the actual knock-out bracket was set (see Figure 2). Again, it can be seen that the draw was favorable regarding a deep run into the latter stages of the tournament. These chances improved by an even bigger margin once the final bracket was set. Indeed, these results indicate that a place in the semi-finals would not have been unrealistic and the 2010 World Cup could be viewed as a missed chance. One might conclude that while being initially placed in the weak Pot 2 prior to the draw, a favorable series of events would have allowed the U.S. team a deep run, but also enabled them to achieve their initial goal for the World Cup (i.e., the round of 16).
Looking Back One Last Time Analyzing the World Cup schedule, it becomes evident that the draw leveled the field more than expected by handing Brazil and Spain unfavorable draws, while boosting the chances of top teams such as France and Italy. However, the latter two delivered the most disappointing performances despite a favorable
Performance USA
Simulated Probability
Simulated Probability
Performance USA
Stage FIFA Rating
Stage Elo Rating
Performance USA Simulated Probability
Simulated Probability
Performance USA
Stage SPI Rating
Stage Bookie Rating
Figure 2. Simulated probabilities for the U.S. team of reaching at least any of the latter stages of the World Cup under the condition that the second round was reached. Represented by circles are the simulated probabilities for the actual World Cup after the conclusion of the group stage. The simulated (conditional) probabilities for the actual draw and pre-draw are displayed by triangles and pluses, respectively. The x-axis indicates quarter-final (QF), semi-final (SF), final (F), and winning the World Cup (1st).
draw. It also is notable that the most surprising results of the World Cup happened almost exclusively in the group stage and the second round had only minor upsets (e.g., The Netherlands over Brazil and Ghana over the USA). In the end, the World Cup delivered a final between two top-four teams and was won by a Spanish team that was always considered a front-runner for the world’s crown.
Further Reading Elo, A.E. 2008. The rating of chess players, past and present. San Rafael: Ishi Press.
Fédération Internationale de Football Association. 2007. How are points calculated in the FIFA world ranking? www.fifa.com/mm/document/ fifafacts/r&a-wr/52/00/97/fs-590_10e_ worldrankingpointcalculation.pdf. Hosmer, D., and S. Lemeshow. 2000. Applied logistic regression (2nd ed.). New York: John Wiley & Sons. Hunt, C. 2006. The complete book of soccer. Buffalo, New York: Firefly Books Inc. Leitner, C., A. Zeileis, and K. Hornik. 2009. Forecasting sports tournaments by ratings of (prob)abilities: A comparison for the EURO 2008. International Journal of Forecasting.
Rathgeber, A., and H. Rathgeber. 2007. Why Germany was supposed to be drawn into the group of death and why it escaped. CHANCE 20:22–24. Silver, N. 2009. A guide to ESPN’s SPI rating. http://soccernet.espn.go.com/worldcup/story/_/id/4447078/ce/us/guide-espnspiratings?cc=5901&ver=us. Stark, S. D., and H. Stark. 2010. World Cup 2010: The indispensable guide to soccer and geopolitics. Indianapolis, Indiana: Blue River Press.
CHANCE
23
A New Analysis of the Golf Handicap System: Does the Better Player Have an Advantage? James R. Lackritz
G
olf is one sport that allows two (or more) people of unequal abilities to compete on a predefined level, as established by its handicap system. Golfers with established handicaps can get together on any course and have a (perceived) fair competition. A “people vs. pros” event (most recently held in October 2009) gives amateurs at all levels the opportunity to qualify to play against PGA tour professionals. According to PGA Tour player Fred Couples, “I have watched this event for many years and think it is great for the game of golf. No other game can offer an amateur the ability to successfully compete with a professional. The handicap system makes this event a unique stop on the PGA Tour. I look forward to a great match with an amateur who will be doing their best to beat me at Pinehurst in October.” Individual, team, and tournament competitions exist on a daily basis for golfers of all levels. Regularly scheduled amateur golf events give even the worst golfers an opportunity to compete with or against the best golfers. Handicapping exists in other sports, but methods of determining the appropriate differential are not as scientific. In swimming or running, the weaker athlete could be given a time or a distance advantage. In tennis, there is a rating system from 2.0–7.0. But it is unclear when a 5.0 player goes against a 4.0 player what the fair handicap is that the lesser (4.0) player receives? In golf, the handicap system dictates exactly how
24
VOL. 24, NO. 2, 2011
many strokes the weaker player should be getting from the stronger player. Francis Scheid and S. M. Pollock are pioneers of research on the handicap system with Scheid’s 1971 article in Golf Digest and each with separate 1977 articles in Optimal Strategies in Sports. Scheid suggested the better golfer would have a winning proportion of 60–85%, if the handicap difference between the two players was more than three strokes. Pollock suggested that, under certain modeling assumptions, the better golfer has an advantage in both medal (total strokes) and match play. In match play, each hole is an individual contest. Whoever wins the most holes wins the match, regardless of the amount by which the hole is won.
Table 1—An Example of the Calculation of a Golfer’s Handicap from 20 Rounds of Golf Round
Adjusted Score (X)
Course Rating (R)
Course Slope (S)
D = 113 x (X-R)/S
1
83
70.5
121
11.7*
2
79
71.8
125
6.5*
3
89
70.5
121
17.3
4
90
69.8
118
19.3
5
79
70.5
121
7.9*
6
87
70.5
121
15.4*
7
94
70.8
127
20.6
8
82
71.8
125
9.2*
9
87
70.5
121
15.4*
10
90
70.8
127
17.1
New Handicap System
11
84
71.8
125
11.0*
Scheid and Pollock’s studies were based on the old handicap system, which looked strictly at the differential between the score and the course rating and adjusted for the maximum scores allowable for a given level of handicap. In the early 1990s, the handicap system was adjusted by factoring in the course slope rating, reflecting the difficulty of the course. The slope rating suggests a multiplier of how much difference there should be between a scratch golfer (one with a 0 handicap) and one who would be considered a “bogey” golfer, who might average one over par on each hole, or approximately 90 on a par 72 course of average difficulty. Previously, the handicap was only a function of the course rating, designed as the mean score for a scratch golfer. Therefore, under the previous system, golfers who had the same distribution of scores, regardless of where they played, had the same handicap. Under the newer system, the golfer that plays at a more difficult course (or courses) would have a lower handicap and be considered a better golfer. The individual-round differential (D), the score entered into the handicap system, is calculated from:
12
80
68.8
116
10.9*
13
90
69.8
118
19.3
14
91
70.5
121
19.1
15
80
70.8
127
16
88
69.5
120
17.4
17
95
71.8
125
21.0
18
83
69.8
118
12.6*
19
88
69.8
118
17.4
20
91
70.8
127
18.0
D 113(X R)/S. In this equation, X is a player’s adjusted gross score as computed from his actual 18-hole total, adjusted for a maximum allowable number of strokes on each hole, which depends upon the player’s previous handicap; R is the course rating;
8.2*
Average of best 10 rounds
10.9
Index (Average x 0.96)
10.5
* indicates the round was one of the best 10 out of these 20
S is the slope rating of the course, which depends on the positioning of the tees that launch each hole and other factors relating to the difficulty of the course; and 113 is considered the baseline slope rating for a course of average difficulty. For example, according to the USGA handicap system, a player with a 25 handicap is allowed a maximum of eight strokes on each hole in computing her adjusted gross score. Suppose she shot a round of 18 holes and scored 100, which included taking 9 shots on each of two separate holes. In this case, her adjusted gross score would be X = 100-2 = 98. If the course is rated R = 73 and has an above-average slope rating of S = 125, the player would be given a reduced
handicap of D = 113(98 – 73)/125 = 22.6 (all differentials are rounded to the first decimal place). In actually implementing the procedure, the current handicap system averages the best 10 of a player’s most recent 20 rounds, based on the calculation of the handicap differential (D) from the earlier equation, and then multiplies this average by 0.96. Table 1 gives an example of the calculation of a golfer’s handicap. One compelling reason for including only the best 10 scores is to keep golfers from inflating their indexes from one bad round, so that one’s index reflects something closer to the lower quartile of the golfer’s differential scores.
CHANCE
25
Simulation Analysis Methods In a statistically fair system, one would expect the chance of the better golfer and weaker golfer winning to be the same (below 50%, due to possible ties). However, better golfers suggest the system should be designed so the stronger golfer has an advantage. However, in a 2000 article in The American Statistician, Derek Bingham and Tim Swartz suggest that when both players play well, the weaker golfer wins more than half the time, and there are also approximately 13% ties. But, what happens when both golfers do not play at their best level? In previous studies, all scores used in the data analyses were obtained at the same course. What happens when two people from different clubs play against each other, or people from the same club go to a different course to compete? Do Bingham and Swartz’s conclusions apply to the general golfing population? To deal with this concern, the author collected handicap and score data from golfers who belong to the same club, but who played many of their rounds at other golf courses. The handicaps established for the golfers in this study are designed to apply across any golf course that has a rating and slope. This study is accomplished by using as basic inputs in a simulation analysis the scores of a sample of golfers of greatly varying abilities and handicaps, 20 rounds for each golfer. It is first shown that their scores do not follow a normal distribution, as is generally assumed in the literature, but rather
26 26
VOL. 24,24, NO. 2, 2, 2011 VOL. NO. 2011
that the distribution is skewed. Based on 10,000 simulated matches conducted through the maximum-likelihood estimated distributions, it is shown that the less-able golfer carrying the higher handicap is much less likely to win over either a given 18-hole round or in a two-, three-, or four-round tournament than is her superior opponent.
Data and Descriptive Statistics Current handicap indexes for members of the Torrey Pines Men’s Golf Club in La Jolla, California, as of August 2009 were sorted. Torrey Pines is a municipal golf club with two courses, North and South, and annually hosts the PGA tour stop in February, formerly known as the Buick Open, this year played with the Farmers Insurance Group as a sponsor. The South course, host of the 2008 U.S. Open and 2009 LPGA Samsung World Championship, has four sets of tees for men and one for women, course ratings from 72.5–78.1, and slope ratings from 125–143. The North course has three sets of tees for men (one for women), course ratings from 70.8–73.2, and slope ratings from 125–130. As a particularly attractive and scenic public course, it is subject to a high demand from both local and visiting golfers, so tee times are difficult to obtain. As a result, almost all members of the golf club have a significant number of their rounds played away from their “home” course as part of their handicap calculation. The Torrey Pines Men’s Club has slightly more than 1,000 members with established handicaps, ranging from 0–36.4. The study focused on players at one of seven handicap levels: zero, five, 10, 15, 20, 25, and 30. The five units (strokes) of separation from one level to the next allow for a wide range of abilities from which to do the analysis. Five people were chosen from each of these seven index levels, all of whose index was within + .2 of (or indexes closest to) the desired level. (It would have been desirable to have all five
golfers in each group at the exact same index, but this was not possible.) The author then downloaded each member’s scores from their most recent 20 rounds (both raw scores and differential scores [D] adjusted for handicap) from the Southern California Golf Association website. Therefore, there are 35 people, each with 20 rounds, for a total of 700 D-scores, which are used in the analysis for this study. Most scores did not come from either Torrey Pines course. Having five golfers at each index level allows a check for consistency among golfers at the same level. In that a handicap is calculated only on the best 10 of the most recent 20 D scores, there is a possibility at any level that while two players might be similar in their distribution of their best 10 scores, one golfer could be significantly higher or lower in their other 10 scores. When two golfers play, there is no guarantee as to what part of their distribution they will shoot (in spite of their belief that they should shoot their best round every time), so a player with a lower distribution of their highest 10 scores would have a distinct advantage if the players score in the weaker half of their distribution. Also, two players that have the same handicap may have different distributions of their best 10 scores, in that some players may be more consistent, while the other players could have better low scores and higher high scores, resulting in more variation.
Results Means and standard deviations were compared for the 20 D-scores from each golfer within a given handicap group (see Figure 1). The smallest difference in means for D-scores was in the handicap groups five, 10, and 15 (each about one stroke from the lowest mean to the highest), while the largest difference was almost 2.5 strokes for the group with the 25 handicaps. The gap in standard deviations from one player to another within the same handicap group was
Figure 1. Distributions of differential (D-scores) and raw scores by handicap
CHANCE
27
Figure 2. Distribution of merged D-scores and adjusted raw scores by handicap
lowest among the 15 handicappers and highest among the 25 handicappers. The next step in the analysis was a test for differences across the handicap groups in overall distribution of net scores. The net score is the difference between the D-score and the player’s index. A player with a five handicap and a D-score of 7.2 for a round would have a net score of 2.2, whereas a person with a 10 handicap and 13.8 D-score would have a net score of 3.8. This allows for scores adjusted by handicap level. If the handicap system is fair, one would expect consistency in the distribution of net scores across each handicap group. The Kruskal-Wallis test for equal 28
VOL. 24, NO. 2, 2011
distributions across the seven groups resulted in a p-value of .00, concluding there is a difference in distributions across the handicap groups. The next question that arises is whether there is consistency within the handicap groups. Is the distribution of D-scores for all players with a 15 index consistent? A series of one-way ANOVAs tested for equal means across all scores for each golfer within each index group. The F-tests resulted in p-values of .23, .89, .85, .90, .62, .34, and .89 for the groups with handicaps 0–30, respectively. Nonparametric Kruskal-Wallis tests resulted in p-values of .23, .41, .87, .91, .77, .49, and .99, respectively.
After concluding there are not significant differences in the scoring distributions within a given handicap level, scores were merged within each handicap group to be used for further testing and comparisons among the groups. This results in 100 scores for each group. The boxplots for each group are given in Figure 2, noting that the mean score for each group is more than one stroke (for the 0 handicap group) to four strokes (for the 30 handicap group) higher than the actual index of the group. Not surprisingly, the mean increases by more than five strokes from one handicap group to the next, even though the handicap increases by exactly five strokes per
Figure 3. Histogram and probability plots for D-scores for the 15 handicap group
group. This suggests that for the overall distributions of differential scores, the differences on those final 10 scores for each player gets farther away from the index as the index gets higher. Intuitively, better golfers should be more consistent than higher-handicap golfers. Figure 2 confirms the general increase in variation as the handicaps get higher. Boxplots for the raw adjusted scores are given at the bottom of Figure 2.
How Often Does the Better Golfer Win? To address the issue of how often the better golfer wins, it is first necessary to analyze the distributions of differentials
within each handicap group. Previous studies all assumed approximate normal distributions of scores to obtain their estimates, even conceding that there could be a longer right tail. Most golfers concede that their worst scores are farther above the mean than are their best scores below the mean, which casts doubt on the normality assumption. Bingham and Swartz viewed the total score as the sum of strokes for 18 holes. Although on a typical course, there may be four par threes, 10 par fours, and four par fives, the scores for each hole are not generally considered to be independent and identically distributed (iid). Furthermore, even within holes with the same par rating, the level of difficulty
will vary. And even if the 10 par-four holes were iid (personal experience suggests they are not), the number of holes of a given level of par are fewer than what is commonly needed to satisfy the Central Limit Theorem for nonsymmetrical distributions. Figure 3 illustrates a histogram and probability plot for the D-scores for the group that had handicaps of 15. As suspected, the distribution appears to be nonsymmetrical. The histograms and probability plots for the other handicap groups are available in the supplemental material at http://chance.amstat.org. To test the normality assumption, the differential scores for each handicap group were used as inputs into the CHANCE
29
Table 2—P-Values for Distribution Goodness-of-Fit Tests (Best Fit in Bold) 3-Parameter Group
Normal
Lognormal
Weibull
0
.256
.01
5
.14
10
.22
.05
.25
.29
.13
.04
15
.02
.20
.01
.01
.18
.01
20
.04
.30
.01
.04
.25
.21
25
.39
.43
.02
.47
>.250
>.25
30
.14
.02
.01
.01
.09
>.25
.15 *
Weibull >.50
*
Gamma
Logistic
.01
.09
>.50
*
>.25
Weaker Golfer Handicap
* no convergence reached
Better Golfer Handicap Figure 4. Percentage of time better golfer wins (ties)
distribution-identification algorithm of Minitab, version 15.1. The results are the possible distribution options given in Table 2. The p-values are given for each test for each group. At the =.05 level of statistical significance, the normal distribution is rejected for the groups with 15- and 20-handicap golfers. Furthermore, for no group does the normal distribution give the best fit. Based on the results from the tests in Table 2, the individual distributions that give the closest fit were identified with the maximum likelihood estimates (MLE) of the parameters of the distribution. To give a more accurate evaluation of the overall chances of golfers winning under any conditions, the entire distribution of scores is considered. From the distributions determined to be the best 30
VOL. 24, NO. 2, 2011
fit from Table 2 with the MLE, 10,000 random differentials were generated for each handicap group. These scores were compared across each pair of handicap groups. After the differentials were adjusted for handicaps, the results were compared to see who would win in the 10,000 “random” rounds. As the assumed distributions are continuous (differentials are normally rounded to the first decimal place), the decision rule was that the player had to win by more than .5 strokes (after handicap adjustments) to win the match. Therefore, a golfer with a five handicap playing a golfer with a 15 handicap has to win the simulated match by more than 10.5 strokes, and the weaker golfer wins if his simulated score is no more than 9.5 strokes worse than that of his opponent.
The results from each pair of 10,000 matches are given in a heat map in Figure 4, for which the darkest cells are the matchups that give the strongest chance of the better player winning. The first number is the percentage of the time the better golfer wins, and the number in parentheses represents the percentage of ties. For example, in the 0/5 matchup, the 0-handicap golfer wins the match 51.1% of the time, with 8.1% ties, so the 5-handicapper wins 40.8% of the time. While the percentages of the better golfer winning are not as high as Scheid originally suggested, the better golfer has a clear advantage. Only one match (5 vs. 10) is even, and the closest matches are ones in which the gap between the two golfers is only five strokes. As the gap gets larger,
Chance of Winning Chance of Winning
Index
Index Figure 5. Percentage of time each golfer wins in a stroke play tournament
CHANCE
31
Figure 6. Results for better golfer if golfers play at same percentile of their distribution
32
VOL. 24, NO. 2, 2011
the odds of the better golfer winning increase, up to the 65.5% chance that the scratch (0-handicap) golfer has over the 30 handicap golfer, who only wins this match 28% of the time. As the handicap gap increases between the two golfers, the odds of the better golfer winning seem to improve linearly—so consistently that linear regression trend models based to predict the percentage of how often a golfer wins have an R-squared value of almost 89%. The trend models, with all coefficients having p-values of .00, are W1 = 45.545 - 0.631 H1 + 0.663 H2 and W1+T = 54.810 - 0.651 H1 + 0.567 H2. In these models, W1 is the percentage of the time the better golfer wins; W1+T is the percentage of the time the better golfer wins or ties; H1 is the handicap of the better golfer; and H2 is the handicap of the weaker golfer. This model now allows an estimate of the winning odds for any match between two golfers. It is plausible to suspect that the better golfers would have an inherent advantage in a multiple-round tournament. Using the simulated differential scores, one-, two-, and four-round tournament results were generated in which one player from each handicap group participated. The results of this competition are given in Figure 5. In a one-round tournament, the scratch golfers and five-handicap golfers each win about 18% of the time; the 30-handicap golfer only wins about 10% of the time. In a two-round tournament, the winning percentages increase for the better golfers and decrease for the weaker ones. For a four-round tournament, the gap from top to bottom gets even wider, as the scratch golfer wins 28% of the time, but the worst golfer carrying a 30 handicap only wins 6% of the time.
Playing at the Same Level of Distribution Bingham and Swartz suggested that when both golfers play at the top of their games, the weaker golfer with the higher handicap will have an advantage of up to two strokes. They estimate that the weaker golfer wins or ties anywhere from 63.5% to 65.6% of the matches when both players play well.
"In a statistically fair system, one would expect the chance of the better golfer and weaker golfer winning to be the same (below 50%, due to possible ties). However, better golfers suggest the system should be designed so the stronger golfer has an advantage." To verify this claim, D-scores were generated for the present set of golfers, for each handicap range for deciles ranging from one to nine (10th percentile to 90th percentile) from the distributions generated with the MLE parameter estimates. Scores for each decile were compared between pairs of handicap groups to determine who would win when both players shoot at comparable levels. The results are given in Figure 6. The white, silver, and gray bars give the percentiles at which the better golfer wins, ties, and loses. For the 0/5 matchup, the 0-handicap golfer wins when both players play at the same level for all percentiles from 30–90. They tie at the 20th percentile, and the 5-handicapper wins if both play their best golf, at the 10th percentile. When playing at the same level, the weaker player rarely wins, except for an occasional win when both play at their best 10% for matchups of zero vs. five, 10 vs. 20, and 10 vs. 25 handicaps. Additionally, the simulation results showed that for matches between bad golfers who are not playing well, the weaker golfers win at their worst level (90th percentile) for the 20 vs. 25 and 20 vs. 30 matches and the 30-handicap golfer also defeats the 25-handicap golfer at the 80th percentile. The simulation results of Figure 6 make it apparent that the better golfer has a distinct advantage when the golfers play at the same level.
CHANCE
33
Conclusions Golfers want to believe they have a chance to win and the handicap system is fair. Fair competition creates more opportunities and more excitement, which, in turn, should generate more interest in the game. This article tests that concept, modeling golfers’ play within their distributions of scores according to their handicaps. Examining the entire distribution gives a better overall assessment of the chances of the better player winning under any conditions, over different courses with different levels of difficulty. Five players were selected from each of seven handicap levels. The database includes the handicap-differential scores for all 20 rounds for all players from their handicap calculations. Most rounds in the database were not played at a player’s home course. Previous studies assume one’s golf scores to be approximately normally distributed. Model fitting shows, however, that the data do not fit a normal distribution for handicap groups 10 and 15. Indeed, for none of the groups does the normal distribution give the best fit to the 100 data points collected for that group. The distributions were thus fit for each group using maximum-likelihood parameter estimates, and 10,000 random scores for each group were generated from each distribution. In the random matchup, the better, lower-handicap golfer has a distinct advantage over a weaker golfer, an advantage that increases with the handicap gap. The weaker golfer must almost always play better within his distribution than the stronger golfer to have a substantial chance of winning. When each golfer plays at the same
percentile within her distribution, the weaker golfer seldom prevails. And, in one-, two-, and four-round tournaments adjusted for handicaps, the lower handicap players win more often than do their weaker opponents. By the same token, it is well within the realm of possibility that tournaments will be won more often by higher-handicap players, due to significantly higher numbers of mid-level and higher handicappers. An examination of handicap distributions finds that only a small percentage of players have handicaps lower than 10. The study does not account for equitable stroke control, which allows a maximum score to be reported at various handicap levels and is designed to keep golfers from inflating their handicaps from one or two bad scores in a round. Casual observation formed over more than 40 years on the links, however, suggests better golfers would have fewer adjustments for bad holes than would higher-handicap golfers. But, for competition among players in the 20–30 handicap range, one would expect more high scores on individual holes. The handicap system also assumes honest reporting of scores. Yet, in almost every amateur tournament some golfers are accused of sandbagging, in which they seem to play better in competitive events than their index would have indicated. In particular, they are accused of intentionally scoring higher in their practice rounds so that they can play with a higher handicap in tournaments. The tournament director can report this to the association and request that the player’s handicap be adjusted accordingly.
Individual hole-by-hole scores could not be obtained, so this study does not analyze potential results for match play. One might intuitively reason that if total scores produce an advantage for the better player, then an advantage would similarly exist on individual holes adjusted for the handicap. If one believes the better player should have an advantage somewhat proportional to the difference in handicap levels, the current handicap system appears to accomplish this objective.
Further Reading Bingham, D. R., and T. B. Swartz. 2000. Equitable handicapping in golf. The American Statistician 54(3):170–177. Heiny, Erik L. 2008. Today’s PGA tour pro: Long but not so straight. CHANCE, 21(1):10–21. Kelley, B. Accessed 2011. Golf handicap FAQ: What is equitable stroke control? http://golf.about.com/cs/ handicapping/a/whatisesc.htm. Pollock, S. M. 1977. A model of the USGA handicap system and ‘fairness’ of medal and match play. In Optimal Strategies in Sports, ed. S. P. Ladany and R. E. Machol, 141–150. Amsterdam, North Holland: Elsevier. Scheid, F. 1975. You’re not getting enough strokes. The Best of Golf Digest 32–33. Scheid, F. 1977. An evaluation of the handicap system of the United States Golf Association. In Optimal Strategies in Sports, ed. S. P. Ladany and R. E. Machol, 151–155. Amsterdam, North Holland: Elsevier.
Visit The Statistics Forum—a blog brought to you by the American Statistical Association and CHANCE. http://statisticsforum.wordpress.com
34
VOL. 24, NO. 2, 2011
How Good Is Your Eyeballing? David Rockoff and Heike Hofmann
S
o, how good is your eyeballing? The eyeballing game (http:// woodgears.ca/eyeball) by Matthias Wandell provides you with an opportunity to put a number to your skill. A set of seven geometrical tasks reveals to us the limitations of our graphical perception. For an overview of the tasks, see figures 1 and 3. Players are challenged to solve, as accurately as possible, tasks such as identifying the midpoint between two
points, completing a parallelogram given two of the edges, or finding the center of a circle. Based on data recorded from 23,000 game plays over three days, we are interested in evaluating how these tasks relate to statistical graphics. The main purpose in drawing a data graphic is to efficiently and effectively communicate information to an audience. We have to distinguish between graphics for exploratory
data analysis and presentation graphics. The former are graphics for ourselves used during the exploratory process of a data analysis, while the latter are polished graphics we use to present our findings to the world. Both types should convey information in a way that is easily and accurately interpreted by the reader while minimizing cognitive burden. For any display, the reader should be able
CHANCE
35
Find the mid-point of the line segment
Bisect the angle
Mark the point equidistant to the edges
Mark the center of the circle
Make a right angle
Find the point of convergence
Figure 1. Tasks of the eyeballing game, from top left: midpoint identification, angle bisection, center (of the incircle) of a triangle, right angle, convergence
to glean information that is concise, accurate, and easily digestible. William Cleveland and Robert McGill conducted experiments to determine the effectiveness of graphical elements in their 1984 JASA and 1985 Science articles. We use the eyeballing game as another source of data to assess individuals’ performance on particular perceptual tasks. Our main objectives with these data are the following: 1. To rank the set of tasks according to their level of difficulty in the overall population 2. To assess short-term learning potential. What effect does multiple game-playing have on a player’s perceptual accuracy? 3. To find strategies of the best. We use hierarchical clustering to determine whether players used particular strategies to achieve their best scores. 36
VOL. 24, NO. 2, 2011
4. To determine whether there exist subgroups in the population that exhibit similar performance on different tasks (i.e., subsets of individuals who have particular difficulties on one set of tasks, but perform well on others).
The Game The eyeballing game consists of a set of seven geometric tasks as shown in Figure 1: 1. Finishing a parallelogram given two edges [PG] 2. Finding the midpoint between two points [MD] 3. Bisecting an angle [BA] 4. Identifying the center of the incircle of a triangle [TC] 5. Finding the center of a circle [CC]
6. Making a right angle given a line [RA] 7. Finding the point of convergence [CG] of three given lines Most of these tasks are straightforward, but some are quite complex and allow for different strategies in finding a solution. For example, locating the center of a triangle can be approached in different ways (see Figure 2). The center can be found as the center of the triangle’s incircle, which requires the player to first imagine the incircle, and then find the center of this circle. This is similar to finding the center of a circle, if a step more complex. Alternatively, the task can be solved by finding the point of convergence of the three lines bisecting each of the triangle’s angles. This alternative strategy makes the task more closely related to both the angle bisection and the convergence. For its complexity alone, the triangle center should be a fairly difficult task; in
(a) Strategy 1
(b) Strategy 2
Figure 2. Sketch of two strategies for identifying the center of a triangle. On the left, the solution is found as the center of the incircle; on the right, the solution is the point of convergence of the three lines bisecting the angles.
Adjust to make a parallelogram
Adjust to make a parallelogram
Adjust to make a parallelogram
Figure 3. The parallelogram task in the eyeballing game. The player drags the point located in the box to form a parallelogram with the three other points. The two lines that are joined at the box move with the mouse pointer to aid the player, while the other two lines remain fixed. The image on the right shows the final choice of the user (the point at the center of the box in the interior) together with the correct solution.
particular, we should find support in the data that it is a task more difficult than any of the other tasks it seems related to. While playing the game, various visual guides are offered for each task to aid the player in identifying the correct location. In the parallelogram task (illustrated in Figure 3), two connected line segments are given, forming half of a parallelogram. Two additional lines are shown extending from the endpoints of the fixed segments to the mouse position, following its every move. These guides aid the player by giving instantaneous feedback of the current state of the task.
After each attempt, the correct location is shown along with the player’s final choice of location. The exact error is given at the bottom of the playing field. The score, or error, is calculated as the distance between the player’s choice and the correct location. Distance is measured in pixels for all but the two angle tasks. The error in the right angle and bisect angle tasks is measured as twice the degrees between the correct line and the player’s solution. A full game consists of cycling through each of the tasks three times. An average score across all tasks is calculated. Lower scores indicate better performance.
The “playing field” consists of a rectangle measuring approximately 500 x 520 pixels. Since error is measured as distance in pixels, the theoretical worst score should be 721. With some extra effort, however, it is possible to achieve even worse scores. The majority of games have scores in the single digits for most tasks, but a few games have scores that are extraordinarily high—over 1,000 in some cases— leaving us to wonder whether players 1) did not understand the task to be performed, 2) just clicked randomly on the playing field, or 3) attempted to get as poor a score as possible. We would
CHANCE
37
Any game with a task score above its corresponding cutoff was excluded. This leaves 22,188 games for analysis, slightly under 96% of the original data (see Table 1 for a summary of the filtered data).
What’s Difficult? What’s Easy?
like to exclude scores for these games from our analysis, since our focus is on accuracy with which tasks can be performed. The trouble lies in demarcating a clear cutoff between actual attempts and random noise. There is an additional reason to remove extreme values from the data set. For example, Figure 4 is a scatterplot of two tasks: triangle center and circle center. At first glance, this might be strong evidence toward strategy 1—there appears to be a strong linear relationship between performance on the two tasks. In fact, the correlation is 0.915. However, this relationship is artificially enhanced by high-leverage points by games with exceptionally high scores on both tasks. When the scores of these players are removed from the data set, the relationship is much less prominent. A restriction to scores under 100 removes less than 5% of the data points, but lowers the correlation to only 0.315. If we limit scores to 10 and under, the correlation becomes 0.252. 38
VOL. 24, NO. 2, 2011
While we cannot decide on players’ motivation based on their scores from a game, we employed the following filters to eliminate the worst games: Circle Center: Circles have mostly a radius of 100 2 Random Points: The average distance between two randomly placed points on the playing field is 266 (see www.math.kth.se/~johanph/habc.pdf) 3 Random Points: For the midpoint task, the average error based on random clicks for a solution is 231 (10,000 simulations). Three random clicks are also the basis for finding an average angle between two randomly placed lines. The average angle is 48.5 degrees (based on 10,000 simulations), corresponding to an error of 98. 4 Random Points: Make up three random lines, resulting in an average error of 85 for the angle bisection task (10,000 simulations)
From Table 1, we see that the severe skewness of scores is one of the most prominent features in this data set. Generally, scores were the highest for the parallelogram, while circle center and midpoint were the easiest of the non-angle tasks. More than 75% of the scores on these tasks are under 10. Convergence, parallelogram, and midpoint tasks appear to have the greatest variation in scores, while circle center and bisect angle appear to have the smallest. Convergence is the only task for which mean and median disagree to the point it has an effect on the ordering of task difficulty. The mean score on convergence is much higher than the median, indicating this task has even more large outliers than the others. Figure 5 shows histograms of scores for each task. Plots have been restricted to scores under 15 to better compare score distributions by task. Among the most difficult tasks, more players did extremely well with convergence than with parallelogram (38% scored under five, compared to 26%), but more players also did poorly (15% scored above 20 on convergence, compared to 12% on parallelogram), suggesting the existence of subpopulations in the data: those players who are good at finding the point of convergence and those who have difficulty with it. Wilcoxon rank tests show that all pairwise differences between tasks are significant at well below the 0.0001 level, validating the above ordering of tasks. One way to eliminate the impact of a player’s overall ability and emphasize relative difficulty of tasks is to switch focus from absolute task scores to task rankings: For each game, the task on which the player scored the best is assigned a 1 and the task on which the player scored the worst is assigned a 7. In almost one out of every three games, the player’s best task—or tied for best—was the right angle; a similar number had convergence as their poorest. Figure 6 illustrates the distribution of ranks by task.
Table 1—Descriptive Statistics by Task Task Bisect Angle Right Angle Circle Center Midpoint Convergence Triangle Center Parallelogram Overall Average
Min 0.1 0.0 0.0 0.0 0.3 0.6 0.7 0.9
1st Qu 2.1 2.2 3.1 2.9 3.9 4.2 4.9 4.2
Median 3.3 3.7 4.3 4.4 6.1 6.2 7.5 5.8
Mean 4.6 6.7 5.3 8.7 13.3 10.3 13.3 8.9
3rd Qu 5.0 6.3 5.8 6.9 11.5 10.1 12.3 9.3
Max 84.6 97.8 99.9 229.9 254.4 256.7 255.4 130.6
Std. Dev. 6.2 11.0 6.7 17.5 21.1 14.2 20.2 9.6
Tasks are ordered according to increasing median scores. As hypothesized earlier, finding the triangle center is one of the harder tasks.
Density
Figure 4. Scatterplots comparing scores of triangle center and circle center tasks. The strong linear relationship apparent in the left plot disappears when we zoom in on the majority of points. The plot on the left shows all games. The second plot is a zoom into games with scores of less than 100 on both tasks (95.5% of all games). The third plot further zooms into games with scores of 10 or lower on both tasks (62.5% of all games). The lines on top show a smooth line of the relationship between X and Y variable.
Score Figure 5. Multiple histograms of scores by task. Tasks are ordered according to median score. Only games with individual task scores of up to 15 are shown. CHANCE
39
Percentage
Task Figure 6. Percentage of games in which each task ranked first, second, etc. Darker shades indicate better ranks. The relative ranking of tasks overall coincides with the ordering based on absolute scores.
Table 2—Number of Games per Player Games Number of players
1 5390
2 1766
Frequent Players By matching screen names and IP addresses, we are able to identify players across different games. This shows that over the time frame of our data, 22,188 games were played by 9,111 individuals. Table 3 shows the breakdown of number of games per player. Most players only played once, but we are dealing with 3,721 players with more than one game during the time period under investigation. In fact, one player came back for 129 repetitions of the game over the course of the three days. We will use data on those ‘repeat offenders’ to gain insight on whether and how much improvement in accuracy is possible. 40
VOL. 24, NO. 2, 2011
3 698
4 375
5 230
6 137
7 103
The left side of Figure 7 shows boxplots of average score by game number. It appears there was some improvement from the first game to the second game, and a decrease in interquartile range, but only tiny changes in subsequent games. It is important to note that these games are likely not the players’ actual first (second, third, etc.) games ever, but their first games over this particular three-day period. The improvement from first to second attempt also is seen in some individual tasks, particularly parallelogram, triangle center, and convergence. There is generally no improvement visually discernible
8 70
9 57
10 41
>10 244
on further attempts. For most tasks, the variance of scores appears to decrease from the first attempt to the second. On the right side of Figure 7 are boxplots for the difference in accuracy between first and second game by task. Generally, the most gain also comes with the largest variability. For the three tasks of parallelogram, triangle center, and convergence—which we previously identified to be the most difficult—we see the most improvement from first to second attempt. At the same time, these three tasks also have the three largest interquartile ranges in differences in scores between the first two attempts, again emphasizing their difficulty.
Average Accuracy Difference Between First Two Attempts
Attempt
Task Figure 7. Boxplots of scores by attempt. On the top, average scores are shown in side-by-side boxplots for attempts 1–10. On the bottom, boxplots are shown of differences in scores from first to second attempt by task. The vertical axis of both plots is restricted to better emphasize detailed structure.
CHANCE
41
Table 3—Spearman Rank Correlations Between Pairs of Tasks and Correlations Based on Task Ranks Bisect Angle Right Angle Circle Center Midpoint Convergence Triangle Center Parallelogram
BA 1.000 -0.215 -0.161 -0.186 -0.168 -0.085 -0.117
RA 0.297 1.000 -0.247 -0.245 -0.104 -0.201 -0.123
CC 0.257 0.288 1.000 -0.127 -0.196 -0.079 -0.127
MP 0.294 0.307 0.345 1.000 -0.194 -0.168 -0.123
CG 0.340 0.420 0.354 0.400 1.000 -0.208 -0.187
TC 0.334 0.344 0.338 0.393 0.457 1.000 -0.218
PG 0.340 0.400 0.347 0.443 0.493 0.454 1.000
Avg 0.500 0.578 0.504 0.630 0.742 0.706 0.739
Spearman rank correlations between pairs of tasks. From the overall correlations, the triangle task seems to be more closely related to convergence, rather than the circle center task, supporting strategy 2 over strategy 1. Below diagonal (italics): correlations based on task ranks. Correlations are expected to be negative, since ranks are constrained to sum to 28.
Table 4—Factor Loadings for Principal Components
PC1 PC2 PC3 PC4 PC5 PC6 PC7
PG -0.420 0.168 0.239 0.462 0.724
TC -0.397
CG -0.417 0.241
0.302 0.654 -0.567
0.229 0.536 -0.651
The improvement from the first attempt to the second attempt is highly significant for all tasks and the overall average, based on paired Wilcoxon rank tests. Improvement on the third and fourth attempts is significant for triangle center, convergence, and the overall average— and borderline significant for midpoint and parallelogram. It is encouraging to see that, regardless of our level of ability, we are able to improve our performance and perceptual skills by playing the game over a short range of time.
Task Similarities and Differences Correlations between tasks reveal we are indeed dealing with a set of relatively independent tasks (see Table 3). A corresponding principal component analysis shows that only the first 42
VOL. 24, NO. 2, 2011
MP -0.377 0.317 -0.133 0.467 -0.648 -0.238 -0.212
CC -0.338 0.593 -0.372 -0.620
RA -0.359 -0.283 0.592 -0.486 -0.288 -0.346
BA -0.327 -0.682 -0.638 -0.104
principal component carries more than its own share of the variance. This component is almost a straight average of individual task scores (see Table 4). Since tasks seem to test relatively different perceptual skills, our next question focuses on whether we can identify subsets in the population that show similar patterns. To make this independent from an individual’s overall skill level, we again use rankings of scores, rather than absolute scores. There are two main questions we are interested in here: 1. Can we detect different strategies in approaching tasks? Since this will show most prominently in games where ‘nothing went wrong,’ we will only consider top games. For computational purposes, we limited this clustering to the top 10% games.
Cum. Prop. 46% 57% 67% 77% 85% 93% 100%
Description Straight average Contrast BA vs. CC BA vs. RA MP vs. CC and RA MP vs. TC TC vs. PG and CG PG vs. CG
2. Can we distinguish subsets in the population that exhibit different skill sets? To decrease variability, we only considered players with at least five games, averaged each player’s task ranks, and based the clustering on Euclidean distance between these average ranks. Strategies in game-playing: clustering top games While the overall average was considered for selecting the games, the clustering, itself, was done on task ranks using Euclidean distance and Ward’s method for linkage. The result is shown in Figure 8. The dendrogram is annotated by the tasks that are most different between splits. On the right side of the figure, parallel coordinate plots show an overview of each of the seven clusters.
Rank
Task Figure 8. Dendrogram of hierarchical clustering of top 10% games (top). The clustering result supports seven distinct clusters shown in parallel coordinate plots on the bottom. Thin lines indicate individual games. Thick lines indicate cluster means. CHANCE
43
Rank
Task Figure 9. Dendrogram of hierarchical clustering of average ranks for frequent players and parallel coordinate plots of average task ranks for four clusters
44
VOL. 24, NO. 2, 2011
Overall, clusters show a surprisingly tight structure, with the exception of clusters one and four, which are the results of—accidental?—bad performance on one of the angle tasks and look more like mishaps and less like strategic game playing. These two clusters correspond to the first two splits in the dendrogram, which are dominated by the performance on angle tasks. The first cut on the left distinguishes between performances on two tasks: identifying the midpoint and location of the circle center. If performance is high on one task, it’s low on the other, indicating that these two tasks involve opposing skill sets. Similarly, the next split contrasts performance on finding the center of a triangle and convergence. This is a group exhibiting some evidence against Strategy 2, which would suggest a positive relationship between these variables. However, looking at the pattern of cluster seven in the parallel coordinate plots, we see that finding the center of the circle has a likewise high rank, indicating neither Strategy 1 nor 2 fits this set of games. The last split we want to consider— resulting in clusters five and six—is evidence in favor of Strategy 2. While the main distinction between these clusters is performance on the parallelogram task, there is also a big difference between performances on both the triangle task and convergence. Cluster five consists of games with relatively low ranks (i.e. relatively good performance) on both convergence and the triangle tasks, whereas cluster six consists of games where both tasks have relatively higher ranks on average. However, in terms of overall performance, neither strategy seems to be more successful than the other. All clusters show similar distributional statistics for their average scores, indicating that all of these subgroups resulted in similarly good performances. Since we are only considering games with a score in the top 10%, the ‘worst’ average is 3.25. Based on the dendrogram, we decided to consider seven clusters. They vary in size between 147 (cluster six) and more than 400 (clusters one, two, and four). What we could not learn from the principal component analysis, we now learn from the clustering by following the spikes and gaps. Generally, the angle tasks seem to be easier, while completing the parallelogram seems—with the exception of cluster six—to be among the hardest tasks for everybody.
Clustering frequent players To find subsets of different players in the data, clustering was done on average task ranks for each player with at least five games, again using Euclidean distance and Ward’s method. The resulting dendrogram is shown in Figure 9, along with parallel coordinate plots of the four main clusters. The dendrogram’s splits are annotated by the main tasks involved. You can find a movie at http:// hofroe.net/eyeballing demonstrating the use of the data visualization program GGobi for an interactive investigation of the clustering result. The first split in the clustering is, again, driven by the performance on the right angle task. Cluster three consists of 200 players for whom this task falls among the more difficult ones. Interestingly, out of the four clusters, the players of cluster three have the least difficulty finding the center of a circle. The second most-important split features again the combination of the triangle and convergence tasks, supporting Strategy 2. Here, the two tasks are positively correlated. Players of cluster one have more difficulty with both tasks. Players of cluster four have particular problems with the parallelogram task, but this task is—as seen before in individual games—consistently hard for everybody. Overall, the clusters show similar average performance. Cluster three’s average score is slightly increased (8.34), an indication that good performance on the right angle is crucial to do well overall in the eyeballing game. The best players can be found in cluster two (average score is 5.41), which captures players who are doing well on both angle tasks and relatively well on the convergence task.
Conclusions and Further Work Again, we are asking ourselves how the eyeballing game relates to statistical graphics, and there are some direct conclusions that we can draw from some of our findings in this data set. What might be the most encouraging finding is that players, independently from ability, are able to improve accuracy on these perceptual tasks significantly by practicing. The presence of different subgroups in the population reinforces the fact that different people pick up different signals in the data depending on how they are encoded visually.
As designers of statistical graphics, we need to make sure to encode crucial signals in charts in multiple ways. Completing a parallelogram based on two given line segments is among the hardest tasks for everybody—yet another strike against three-dimensional graphics. Or what else is involved in deciphering 3-D charts but reading and completing parallelograms? Obviously, the task of reading a 3-D chart is slightly different. Given a point, we are asked to complete the parallelogram with two given axes. It would be informative to be able to also include this as one of the eyeballing tasks and have an audience play with it, as well as identifying a set of tasks more closely related to reading statistical charts than woodworking. As statisticians, we also deplore not having the data collected in a controlled environment. It was, indeed, a surprise of the rather unpleasant kind to find out scores for some of the games were so high that we suspected players of trying to achieve the worst possible score. In a controlled environment, we would be able to monitor difficulty of tasks. It seems obvious, for example, that finding a circle center of a larger circle is more difficult than for a smaller one. Keeping track of these gaming parameters would allow us to more precisely quantify the relationship between circle size and accuracy of finding its center. Testing for beneficial effect of visual aids such as gridlines or rulers could also be based on an experiment of this form. One aspect that has been introduced recently to the game is the speed of answering; this might give additional insight into readers’ perceptual abilities.
Further Reading Cleveland, W. S., and R. McGill. 1984. Graphical perception: Theory, experimentation, and application to the development of graphical methods. Journal of the American Statistical Association 79(387):531–554. Cleveland, W. S., and R. McGill. 1985. Graphical perception and graphical methods for analyzing scientific data. Science 229:828–833. Cook, D., and D. F. Swayne. 2007. Interactive and dynamic graphics for data analysis, with examples using R and GGobi. Springer. Philip, J. 2007. The probability distribution of the distance between two random points in a box. www.math.kth. se/~johanph/habc.pdf. CHANCE
45
There Will Be Blood: On the Risk-Return Characteristics of a Blackjack Counting System W. J. Hurley and Andrey Pavlov
A
number of recent books, media reports, and movies have described the exploits of groups who have sought to make money by card counting at casino blackjack tables. Among these, the MIT blackjack Team is perhaps the most famous. The media tend to focus on the upside of the endeavor. No doubt this attention fuels the popular imagination that the road to blackjack wealth has only two requirements: learning a card-counting system and traveling to Vegas on a regular basis to reap the rewards. It turns out this primrose path is likely to have some ugly turns and detours. To make the point, we examine how a card-counting player’s wealth would change over time if he could vary bet size with impunity. To do this, we estimate the distribution of dollar returns as time passes assuming our player uses Humble and Cooper’s Hi-Opt I system with an infinite stake. Not surprisingly, we find that there is money to be made in the long run. But the stochastic nature of the game can give rise to periods of sustained heavy losses. While blackjack card counting is not against the law in most jurisdictions, U.S. casinos can and do refuse a gambler the right to play if they suspect he or she is a card-counter. To avoid detection, card-counters must organize in teams. The basic idea is that a “spotter,” a player who counts, but never varies his bet, signals a teammate to sit at the table when the count is favorable to a player. This “big man” or “gorilla” then makes large bets relative to the spotter’s. As we will demonstrate, the organization of these teams is difficult because it’s hard to set up team incentives that overcome short-run periods of heavy losses. This analysis also has some implications for the recreational blackjack player who is thinking about learning a counting system to reverse the tide of his gradual losses. Such a change in course doesn’t make a lot of sense given the immediate downside risk and the volume of betting required to give a reasonably high probability of positive profits.
The Hi-Opt I System A significant number of blackjack players use what is termed the Basic Strategy: The player considers his two cards and the dealer’s up card and only these three cards when determining how to play the hand. The limitation of this approach is that
CHANCE
47
Table 1—The Gywnn-Seri Simulation Results Card 2 3 4 5 6 7 – 9 10 A
Percent Change in Average Value +0:37 +0:44 +0:52 +0:64 +0:45 +0:30 – –0:13 –0:53 –0:48
This table shows the percentage change in value to the player when a particular card is removed from the deck.
it ignores all the cards that have been played up to that point. Edward Thorp showed it’s advantageous to the player if the deck is relatively rich in aces and 10s (kings, queens, jacks, and 10s) and advantageous to the dealer if the deck is relatively rich in 2s, 3s, 4s, 5s, 6s, and 7s. John Gwynn and Armand Seri showed this was the case in “Experimental Comparisons of Blackjack Betting Systems” by performing the following Monte Carlo experiment for 10,000,000 iterations: 1. The dealer shuffles one deck without the x’s. For example, x could be the 4s. 2. The dealer follows standard Reno/Tahoe rules. 3. The player follows the Basic Strategy to play his cards. 4. The player bets 1 unit every hand. 5. After each hand is played, we record what the player wins or loses. Table 1 shows the percentage change in the player’s average return when various cards are removed from the deck. Consider the case when the 5s are removed. The player’s return goes up by 0.64%. In other words, the player would like to play without the 5s if possible. Consequently, a deck rich in 5s is favorable to the dealer. On the other hand, when the 10s are removed, the player’s return goes down by 0.53%. Therefore, 10s are valuable to the player. When experienced blackjack players are being taught the details of counting systems, some will ask why 10s and As are more valuable to the player because, to them, they are just as valuable to the dealer. This experiment demonstrates that a deck rich in 10s is more valuable to the player. But more importantly, it demonstrates that it would be good if the player could keep track of the cards already played to determine when the shoe is favorable to him. Again, Thorp was the first to point this out. His book, Beat the Dealer, is a 48
VOL. 24, NO. 2, 2011
gambling classic. In it, he develops a system he calls the Hi-Lo system. It is a simple system, and for this reason a number of researchers have tried to improve it. It is interesting that the MIT blackjack team, in it’s early stages, tried variants of Hi-Lo, but eventually came back to it because the more complex systems were more difficult to apply. The team could just not generate the superior returns these systems promised. For our consideration here, we employ the Hi-Opt I system because it is as easy as the Hi-Lo system to use and offers a slightly higher theoretical payout. Keeping track of what cards have been played and making a calculation to determine whether the deck is rich in 10s and aces is, on the face of it, a difficult thing to do. One way around it is to use a counting system. These systems associate a value with each card, and as play proceeds, the player keeps a running sum of the values of all cards played. The Hi-Opt I system uses the following mapping of cards to values: Card
Value
2
0
3
+1
4
+1
5
+1
6
+1
7
0
8
0
9
0
10s
–1
A
0
Let us suppose that, after the first hand of a shoe (the “shoe” is the plastic device that holds the 4–8 decks of cards the dealer dispenses as play proceeds), the cards played are 5, 2, 10, 10, 6, A, 7, 10, 3, A, 10. Then the count is: 1 + 0 + (–1) + (–1) + 1 + 0 + 0 + (–1) + 1 + 0 + (–1) = –1: This negative count indicates the deck is relatively rich in cards advantageous to the dealer and therefore it would be reasonable for the player to decrease his bet. In the language of the Hi-Opt I system, this count is termed a running count. Any counting system that associates values with the various cards and then uses a running sum of these values to judge the likelihood that a bet wins is subject to error as the following example demonstrates. Imagine a shoe comprises eight decks (416 cards). Suppose one deck has been played and the count is +6 based on the Hi-Opt I counting system. Compare this to the case in which four decks have been played and the count is +6. Clearly, the bet after four decks is more advantageous to the player than the one after one deck. To adjust for this, the Hi-Opt I system employs a true count (TC), which is just the running count divided by the number of decks remaining in the shoe. In Figure 1, we present a graph of how the TC changes as cards are played. The horizontal axis measures Card#. It begins at 1 and goes to 350 (the approximate point where the dealer
Figure 1. An example of how the TC changes as the shoe is dealt. TC is measured along the vertical axis. The horizontal axis is the sequence of cards played numbered one through 320. The curve shows how the TC changes as the deck is dealt. In this case, the TC is positive over most of the shoe.
Figure 2. Another example of how TC changes. In this case, the TC is negative over most of the shoe.
reshuffles the shoe). The vertical axis measures the TC. Note that the TC is positive for most of the shoe. Another example is shown in Figure 2. In this case the TC is negative for just about the whole shoe.
Hi-Opt I System Performance We first assume a player understands that the correct play of the cards depends on the TC at the start of the hand. These strategies are outlined in Lance Humble and Carl Cooper’s The World’s Greatest Blackjack Book.
CHANCE
49
Table 2—The Simulation Results
= 0.80 B
Sample Mean
Sample St. Dev.
1
-0.05298
8.93928
2
0.18993
13.78078
4
0.67574
25.10876
6
1.16155
36.94651
8
1.64736
48.92539
= 0.70 B
Sample Mean
1
-0.08042
8.41066
2
0.05879
12.91073
4
0.33722
23.45447
6
0.61566
34.48059
0.89409
45.64081
8
Sample St. Dev.
= 0.60 B
Sample Mean
Sample St. Dev.
1
-0.10412
7.80158
2
-0.03129
11.98124
4
0.11436
21.77914
6
0.26001
32.02590
8
0.40566
42.39732
This table shows sample means and standard deviations when the shoe is reshuffled at various percentages of the shoe and for various ratios of the maximum to minimum bet.
3. The player wagers 1 unit when TC 0 and B units when TC > 0. 4. The shoe is reshuffled after a % of it has been played. We executed 100,000 shoe iterations to develop the distribution of dollar returns for a number of parameter values. We chose three values of a%: 0.60,0.70, and 0.80; B took the values 1 (with no counting), 2, 4, 6, and 8. Not surprisingly, the resulting histograms are consistent with a normal distribution. The resulting sample means and sample standard deviations are shown in Table 2. Note that, for all reshuffling percentages, there are positive expected dollar means. Consequently, players can make money in the long run and, in principle, ought to be able to make lots of money by scaling their bets. Nonetheless, the standard deviations are quite high relative to their means, which suggests the players would have to play for a long time to assure themselves of positive dollar profits. It is also worth noting the significant effect of the reshuffling point. The expected dollar values are halved for each 10% decrease in the reshuffling point.
Hi-Opt I System Performance Over Time Suppose a player makes a minimum bet Bmin = 100 and a maximum bet Bmax = 800 and the dealer reshuffles at 80% of the deck. The dollar return on shoe i; Si, is normally distributed with mean 100(1.64736) 164.74 (1) and standard deviation
100(48.92539) 4,892.39. (2) Now, suppose our bettor plays for a weekend under these assumptions. We’ll assume it requires a weekend to play 50 shoes. Then his dollar return for the weekend is W = S1 S2 S50.
(3)
W is normally distributed with mean The next step is to determine the optimal betting strategy. To do this, we first assumed the optimal betting strategy took the following form: Bet Bmax units if TC > TC*; otherwise bet Bmin < Bmaxunits. That is, if the TC is sufficiently high, bet the maximum; otherwise, bet the minimum. We used a Monte Carlo simulation to determine a TC* that maximized return: The optimal TC* was between 0 and 1 for all parameter values. Since the expected return was virtually the same at TC = 0, 1, and the interior optimum, we chose TC* = 0 for the optimal betting rule. To measure how player wealth changes over time, we chose the shoe as our basic unit of analysis and undertook a Monte Carlo simulation with the following assumptions: 1. A single player plays against the dealer. 2. The player uses the Hi-Opt I system without error. That is, both the running count and the number of decks remaining are estimated perfectly to get the TC. Moreover he plays the cards perfectly based on TC and the Humble and Cooper rules.
50
VOL. 24, NO. 2, 2011
W 50 8,236.8
(4)
and standard deviation (5) This is a stunningly high standard deviation. To give it some perspective, even though the mean winnings are $8,237, there is still a 40% chance of losing money. To calculate how bad this loss could be, consider the 95% confidence interval [$59,570, $76,044].
(6)
Interpreting the lower limit, there is a 2.5% chance a player could loose at least $60,000 in a weekend. This certainly goes some distance in explaining why the MIT blackjack team was going to Las Vegas with suitcases full of cash and chips. The upper limit suggests our player could win more than $76,044 with 2.5% probability. Again, this gives some credence to
Table 3—Mean Winnings and Confidence Intervals 95% Confidence Interval
#Shoes Played
Average Winnings
Pr (Winnings 0)
1,000
164,736
0.857
138,507
469,979
2,000
329,472
0.934
99,378
758,322
3,000
494,208
0.967
31,024
1,019,440
4,000
658,944
0.983
52,459
1,265,429
5,000
823,680
0.991
145,609
1,501,751
Lower Bound
Upper Bound
Values x 10ˆ-6
This table shows the average winnings, probability of making positive winnings, and 95% confidence interval for various numbers of shoes played.
Values in Thousands Figure 3. The histogram of gross winnings for a team betting a $100 minimum and an $800 maximum over the first 100 weekends of play
reports that the MIT blackjack team was sometimes traveling back to Boston with more than $1 million. But the point is that, over a weekend, there can be wide swings in earnings. Again, keep in mind, that we’re allowing our player to vary bet size without drawing the attention of the casino. This would be difficult to do in practice. Now let’s consider the longer term. Table 3 shows average winnings and 95% confidence intervals for various numbers of shoes played ranging from 1,000 to 5,000. There are no surprises here. If you play a positive expected value game long enough, you’re going to make money. So if our bettor decides to play at least 5,000 shoes, his average winnings will
be $823,680, there is just about a zero chance of losing money, and top end winnings could exceed $1.5 million.
Gambler’s Ruin Of some importance for the point of this paper are the capital requirements for a blackjack team. Continuing with the same example ($100 minimum bet and $800 maximum bet), we undertook a simulation to calculate the stake required to ensure a team is able to withstand early losses. Figure 3 shows the histogram of the lowest value of cumulative winnings over the first 100 weekends. The mean of this
CHANCE
51
Figure 4: Histogram of the time (in weekends) when team capital was lowest
statistic is a loss of $56,745. But note there is a 10% chance that the loss could be as high as $200,000. Hence, the stake required to ensure a blackjack team can withstand the possibility of significant early losses is substantial. We also looked at the timing of when the cumulative winnings would be lowest. In Figure 4, we present a histogram of the week when the cumulative loss is largest. The exponential shape indicates clearly it happens early in the endeavor. The average time is 9.8 weekends.
Discussion That there’s gold at casino blackjack tables for card-counters is not really news. What this analysis makes clear is that a team has to bet a significant amount of money over a long period of time to ensure themselves a pot of gold. What’s more, there are surely going to be some pretty nasty bumps along the way. Card-counters don’t make money every time they go to the casino. And on some occasions, they suffer devastating losses. It is these periods of loss that make the organization of blackjack teams difficult. One of the early MIT blackjack team members, John Chang, said the following: The way the MIT blackjack team changed over the years has reflected those concerns. Initially, we set a time target of say six months, and paid people only if we won. We found that if we were unlucky at the beginning, we would only have a few diehards left playing. Typically, those diehards were people with big investments, so they were playing to protect their investments. Then, we thought we could solve that if everyone was equally invested. In reality, you can’t do that. We tried the socialist approach and tried to get people to commit to trips at the beginning of the team bank. People would still drag their feet, so we had higher shares of win if
52
VOL. 24, NO. 2, 2011
you committed to more. But a higher share of nothing is still nothing. We tried a team compensation system based on getting your maximum bet out. It is important to encourage people to put the money out. But some people are just afraid. The point is that it’s difficult to organize and motivate a blackjack team in the face of these short-run setbacks, which are surely inevitable. Keep in mind that all our calculations assume a team can effectively vary their bet size without being caught by the casino. Casinos work hard to discover counters and sometimes remove noncounters. To avoid detection, a team must have a sophisticated and disciplined approach. Finally, we remark that the prospects aren’t good for the recreational blackjack player. Playing five nights a year with a nightly stake of $2,000 has some upside, but also plenty of downside. For such players, the long run does not admit assured profits and therefore learning a counting system like Hi-Opt, while an interesting exercise, is of limited monetary value.
Further Reading Gwynn, John M. Jr., and Armand Seri. 1978. Experimental comparisons of blackjack betting systems. Presented at the 4th National Conference on Gambling and Risk Taking, Reno. Humble, Lance, and Carl Cooper. 1980. The world’s greatest blackjack book. New York: Broadway Books. Munchkin, Richard W. 2002. The MIT blackjack team: Interview with team manager Johnny C. Blackjack Forum 22(3), http://blackjackforumonline.com/content/interviewJC.htm. Thorp, Edward. 1962. Beat the dealer. New York: Vintage Books.
Here’s to Your Health Mark Glickman, Column Editor
Assessing the Seasonality of Influenza Meena Doshi
S
easonal influenza, also known as the flu, is a significant public health problem worldwide. One of the key aspects to the study of influenza is its seasonal variation. Like clockwork, influenza strikes (infects) the majority of its victims during the winter months in both the Northern and Southern hemispheres. How are these annual winter epidemics assessed? How are statisticians contributing? What are their challenges? The answers to these questions are of more than academic interest. While commonly underestimated by health practitioners and the general public alike, influenza represents a major public health problem in the United States each year. The Centers for Disease Control and Prevention (CDC) estimates that 5%–20% of U.S. residents become symptomatic every year. Influenza causes an average of 36,000 deaths and 114,000 hospitalizations annually in the United States. More than 90% of these deaths and approximately 50% of hospitalizations occur among persons ≥ 65 years of age, with persons ≥ 85 years of age having substantially higher rates of death than any other age group. Influenza is the seventh-leading cause of death for people above 65 and below four years of age, and among the top 10 in almost all age groups. Every year, influenza costs the United States $2–$5 billion dollars in physician visits, lost productivity, and lost wages. This cost could increase to 160 billion in the event of influenza pandemic. More than 2,400 years after influenza-like epidemics were first documented by Hippocrates, influenza viruses continue to plague us with annual epidemics and sporadic pandemics. The influenza pandemic of 1918, also known as the “Spanish Flu,” was the largest in recent history, causing approximately 500,000 deaths in the United States and 20 million deaths worldwide. In the 20th century, there have been several pandemics following the Spanish flu: 1946, 1957 (Asian flu), 1968 (Hong Kong flu), 1977. Historically, the periodicity of pandemics has ranged between nine and 39 years, so a new pandemic in the next few years is considered a looming, if not inevitable, threat. The impact of influenza epidemics and the threat of future pandemics provide the impetus for comprehensive year-round influenza surveillance activities at international, national, and state levels.
What Is Influenza? The flu is a viral infection caused by the influenza virus, affecting the respiratory tract. The Italians in the 17th
“To everything there is a season, a time for every purpose under heaven. A time to be born; a time to die.”
~ Ecclesiastes 3:1–2
CHANCE
53
Influenza Virus
century ascribed it to the “influence of the stars,” hence the name “influenza.” Symptoms of influenza include fever (often higher in children), chills, cough, sore throat, runny or stuffy nose, headache, muscle aches, and often extreme fatigue. Its spread is primarily airborne, especially in crowded enclosed spaces. Most people recover completely within 1–2 weeks; however, severe complications can occur, particularly in children, elderly people, and other vulnerable groups. Bacterial pneumonia is the most common potentially fatal complication. Viral pneumonia, bronchitis, and sinus and ear infections are some of the other complications. The cold and flu are both respiratory illnesses, but they are caused by different types of viruses. The influenza virus is much more potent than the virus that causes the common cold. That explains why symptoms of the two are very different, even though they bear similarities. For starters, the colds center themselves in the head, while influenza affects many areas of the body. Flu symptoms usually come on quickly (within 3–6 hours) and consist of a fever, body aches, dry cough, and extreme tiredness. Cold symptoms are less severe and people experience a stuffy nose, productive cough, slight tiredness, and limited body aches. A clinical case definition of influenza based on symptoms and observable signs is important for the appropriate management of individual influenza patients. However, there is no standardized clinical case definition for influenza and various case definitions are used in different countries. The variability of case definitions makes identifying the exact cases of influenza a perpetually complex problem. Influenza is often defined by influenza-like illness (ILI). ILI, also known as acute respiratory infection (ARI) and flu-like syndrome, is a medical diagnosis of possible influenza or other illness causing a set of common symptoms. In the United States, the CDC defines an ILI as an individual with the following symptoms: Fever of 100º degrees Fahrenheit or higher, and cough, and/or sore throat. But, even with this definition in place, any clinical diagnosis of influenza is actually a diagnosis of ILI. 54
VOL. 24, NO. 2, 2011
Influenza Virus Taxonomy The influenza virus is a complex, constantly changing virus. Belonging to the family of Orthomyxoviridae, there are three types of influenza viruses: A, B, and C. Type A causes severe to moderate illness in people of all ages; type B causes milder epidemics. Type C infection usually causes either a mild respiratory illness or no symptoms at all; it does not cause epidemics and does not have the severe public health impact that influenza types A and B do. Efforts to control the impact of influenza are aimed mainly at types A and B. Influenza type A viruses are divided into subtypes based on differences in two viral proteins called the hemagglutinin (H) and the neuraminidase (N). Across species, there are 16 H subtypes (H1–H16) and nine N subtypes (N1–N9). Currently, only two subtypes—A (H1N1) and A (H3N2)—are found to be circulating in humans. Unlike type A, influenza type B do not have subtypes. As per the World Health Organization (WHO), influenza viruses are named by the virus type, geographic origin of the first isolate, specimen number of that isolate, year it was isolated, and virus subtype (e.g., influenza A/Fugian/411/2002 (H3N2)). The currently circulating two influenza A subtypes and influenza B strains are included in each year’s influenza vaccine.
An Ever-Changing Enemy Like other diseases, why don’t we get immunized against the flu? Why is it that if we have the flu, we can catch the flu again and again? The answer can be found in the way our immune system operates, the speed at which these viruses evolve, and the host species they infect. It is said that influenza is a master of disguise. With many diseases, one infection is enough: We become immune and are never infected by that disease again. However, influenza has the capacity to gradually change its appearance over time so our immune systems can no longer recognize it. This is why we can be infected next time again,
and also why the vaccine must be changed from year to year. The influenza virus is constantly changing in two ways: antigenic drift and shift. Antigenic drift occurs through small changes in the virus that happen continually over time. Antigenic drift produces new virus strains that may not be recognized by antibodies to earlier influenza strains. For example, a person infected with a particular flu virus strain develops antibodies against that virus. As newer virus strains appear, the antibodies against the older strains no longer recognize the “newer” virus and infection with a new strain occurs. This is one of the main reasons people can get the flu more than once. In most years, one or two of the three virus strains in the influenza vaccine are updated to keep up with the changes in the circulating flu viruses. For this reason, people who want to be immunized against influenza need to receive a flu vaccination every year. Antigenic shift is an abrupt, major change in the influenza A viruses, resulting in a new influenza virus that can infect humans and has a hemagglutinin protein or hemagglutinin/ neuraminidase protein combination that has not been seen in humans for many years. Antigenic shift results in a new influenza A subtype. If a new subtype of influenza A virus is introduced into the human population, most people have little or no protection against it and the virus can spread easily from person to person so that a pandemic may occur. Influenza viruses are changing by antigenic drift all the time, but antigenic shift happens only occasionally. Influenza type A viruses undergo both kinds of changes; influenza type B viruses change only by the more gradual process of antigenic drift. Influenza A viruses infect several animals, including pigs, horses, other mammals, and aquatic birds, as well as humans, whereas influenza B virus only infects humans. From time to time, influenza A viruses in animals and birds jump species and infect humans. The virus that caused the pandemic of 1918 is believed to have originated in pigs, while the pandemics of 1957 and 1968 are believed to have originated in birds. Places in which birds, pigs, and humans live in close proximity are thought to play a particularly important role in creating a favorable environment for antigenic shifts and drifts.
systems. Newer mathematical models are required that will more accurately assess seasonality in an effort to better predict when an outbreak will peak and how many people may fall ill. In case of epidemic and pandemic events, this could aid in better public health planning. Further research in this area could enhance our understanding of influenza transmission and seasonality, which are influenced by factors such as indoor crowding during cold weather, seasonal fluctuations in host immune responses, demographic differences, virus virulence, and environmental factors (relative humidity, temperature, and UV radiation). This, in turn, could lead to better early warning of impending epidemics and enhanced prevention. Accurately determining the seasonal impact of influenza may increase the commitment of resources to lessen this morbidity.
Assessing the Seasonality of Influenza: Serfling Regression Influenza epidemics show strong seasonality—typically, November to March in the Northern Hemisphere and June to September in the Southern Hemisphere. The laboratoryconfirmed influenza activity in the United States is typically between week 49 of one year to week nine of the next (November to March). Although seasonality is one of the most familiar features of influenza, it is also one of the least understood. To assess the severity of infl uenza epidemics, excess mortality, the number of deaths actually recorded in excess of the number expected on the basis of past seasonal experience, has been used as a major index. This is because the number of deaths caused by acute respiratory diseases sharply rises during influenza epidemic. In 1963, Robert E. Serfling presented expected numbers of seasonal pneumonia-influenza mortality with a regression method. The approach became known as Serfling methodology. Since then, the CDC has been implementing the Serfling methodology, which indicates an epidemic whenever the observed time series data exceeds a threshold. Serfling regression models the weekly number of deaths due to pneumonia and influenza by the cyclic regression equation (1)
Identifying Seasonal Trends The flu is a winter disease. So, applying public health resources in a selective manner (immunization, health care workers, hospital beds, etc.) before the onset of the peak in infections has many advantages, including efficiency, cost savings, and, most importantly, reductions in morbidity and mortality. If the vaccine can be manufactured efficiently, just in advance of the flu season, it is more likely to protect against the most likely strains. Manufacturing can be ramped up to provide the maximum supply just as it is needed. A prominent failure of this system was seen in the vaccine shortage of the 2004 to 2005 season. By maximizing the number of persons immunized, the rapid spread of a disease can be limited. By applying resources during only a short period of time, before the season, cost savings can be obtained. Public health and even individual practitioners can then focus on other problems in the off-season. For each of these processes to take place in time, more accurate and reliable methods of disease surveillance are crucial to forecasting outbreaks and implementing warning
where Yt is the number of deaths from pneumonia and influenza in week t when there is no epidemic. The parameter 0 is the baseline weekly number of deaths, without seasonal and secular trends, and t is noise that is assumed to have mean 0 and variance 2. The parameterr is the linear effect of time, t and the sine-wave component
cap-
tures the underlying sinusoidal behavior of seasonal influenza. Assuming the errors are uncorrelated and normally distributed, the standard least squares method is used to estimate the model parameters 0,,1,2, 2 from nonepidemic data and compute confidence bounds about the predicted values. The Serfling model is flexible and can be applied to data in which Yt are counts, proportions, or rates. As an example, Figure 1 displays the epidemic threshold computed by using Serfling’s method on the proportion of deaths from pneumonia and influenza (P&I) recorded in the United States from 122 sentinel cities between January 2006 and CHANCE
55
Pneumonia and Influenza Mortality for 122 U.S. Cities
% of All Deaths Due to P&I
Week Ending 01/01/2011
Weeks Figure 1. Weekly proportion of deaths from pneumonia and influenza in 122 sentinel cities in the United States between January 2006 and May 2011 after normalization with the number of deaths for all causes. Source: CDC, Morbidity and Mortality Weekly Report
May 2011. The 122 cities mortality program is one component of influenza surveillance in the United States performed by the CDC. Each week, the vital statistics offices of 122 cities across the United States report the total number of death certificates received and the number of those for which pneumonia or influenza was listed as the underlying or contributing cause of death by age group. This data represents approximately 25% of the U.S. population. The percentage of deaths due to P&I are compared with a seasonal baseline and epidemic threshold value is calculated for each week. The seasonal baseline of P&I deaths is calculated using Serfling regression applied to data from the previous five years. An increase of 1.645 standard deviations above the seasonal baseline of P&I deaths is considered the “epidemic threshold” (i.e., the point at which the observed proportion of deaths attributed to pneumonia or influenza was significantly higher than would be expected at that time of the year in the absence of substantial influenza-related mortality). The model classifies weekly mortality data as epidemic when the observed data exceeds the epidemic threshold for two or more consecutive weeks. Since the mid-1960s, this cyclic regression model has been adopted by the CDC’s influenza surveillance programs to determine epidemic influenza activity and excess mortality attributed to influenza. However, this model has its limitations. The Serfling regression model is formulated as a simple linear regression containing terms of intercept, linear trend, and a pair of harmonic terms. Here, the response variable in the model, Yt, the number of deaths, is arguably a Poisson variable so that the error term is not symmetrically distributed. While this shortcoming can be addressed through a log-linear Poisson model instead of a least-squares regression, Serfling represented mortality as a function of time, 56
VOL. 24, NO. 2, 2011
assuming that predicted mortality would follow a regular seasonal pattern with gradual long-term trend. The assumption of a regular seasonal pattern—or symmetric and/or cyclical pattern of a season—may not be true, as every season will have its behavior in terms of its peak time, amplitude, intensity, and perhaps even the number of cyclical patterns. Also, the assumption about linear trend may not necessarily hold for long durations, and certainly not across seasons. Another limiting feature of the Serfling approach is that at least three years of data are required. Like some other diseases, influenza is not a reportable disease. This means that during the nonepidemic phases, only sparse data are available. The lack of good baseline data presents a serious problem to measure excess mortality, which is calculated as additional deaths above the baseline. This problem is worsened by the standard practice of excluding epidemic periods when estimating the model parameters. This is necessary using the Serfling method to avoid biased parameter estimates. A more accurate estimate of excess deaths must overcome these deficiencies and use all available data, not just nonepidemic mortality.
Estimating Influenza-Associated Morbidity and Mortality Over the years, several types of models have been developed to estimate the burden of influenza. Conceptually, influenzarelated morbidity or mortality is defined as the difference between the observed number of deaths, hospitalizations, absenteeism, or other health outcomes related to influenza during influenza epidemics and the expected number of these health outcomes in the absence of influenza.
In the earlier days, excess mortality rate due to influenza was the most-used index to quantify the impact of influenza; later, excess number of deaths due to influenza became more frequently used. Serfling's landmark study in estimating the excess deaths associated with influenza was published in 1963. For many years, rate difference models, which are one of the simplest statistical methods, have been used to systematically estimate the burden of influenza. These models use virology data to identify the influenza-active period or the baseline periods with no or little influenza activity, and then calculate the influenza-associated mortality or morbidity by taking the difference in mortality or morbidity rates between the two categories of pre-defined periods. The estimation of influenzaassociated burden derived from this method strongly depends on the choice of baseline period. To improve the accuracy of estimation, some studies used two kinds of baseline periods, which are summer baseline period and peri-seasonal baseline period. The peri-seasonal baseline period is defined as a period with little influenza activity (e.g., < 5% of the season’s total positive isolates for the virus) in winter months, and the summer baseline period is defined based on summer months. The estimation was found to be lower when the peri-seasonal baseline period was used as the reference. This is mainly due to other respiratory viruses, such as respiratory syncytial virus (RSV) and rhinoviruses also circulating during the period. Additionally, using a peri-seasonal period as the reference can be useful for adjusting other important seasonal factors, such as temperature, humidity, and other environmental characteristics with well-defined seasonality. The Serfl ing model was further refi ned in a 2006 paper by William Thompson and colleagues. To estimate infl uenza-associated mortality, they used the Poisson regression approach and included additional long-term trends. For this article, two sources—National Center for Health Statistics data and World Health Organization influenza virus surveillance data—were used. Thompson’s model was fit to three death categories: underlying pneumonia and influenza deaths, underlying respiratory and circulatory deaths, and all-cause deaths. Two Fourier terms were included in the model, along with indicators of influenza and RSV transmission. These indicator variables were defined by the proportions of specimens testing positive for influenza A (H1N1), influenza A (H3N2), influenza B, and RSV. These models provided specific estimates of the outcomes associated with each of the commonly circulating influenza strain each season. Thompson’s model controlled for confounding that could be caused by seasonally correlated causes of mortality such as influenza and RSV transmission. Although this was an improvement over the Serfling model, it had several limitations. The proportion of positive-tested specimens was likely to be a poor measure of excess mortality. While a high proportion of positive-tested specimens was compatible with high levels of influenza transmission (and excess mortality), this was not necessarily true.
Annual Harmonic Regression A group of researchers at Tufts University has developed an improved model as a supplementary methodology called Annual Harmonic Regression (AHR). The salient feature of this methodology is to accurately describe disease seasonality. This is achieved by building seasonality characteristics based
on regression parameters. These parameters have interpretations that help access seasonality of influenza. Although the Serfling regression provides information about whether a particular season exceeds a constant epidemic threshold, it does not provide further description about a particular season or information about variations between seasons. Serfling regression and other variations of the models developed earlier are based on the assumption that the essential nature of seasonality is “static”—bearing no variation from year to year. However, this assumption may not be true. Each influenza season varies in terms of the timing, magnitude, and intensity of seasonal outbreak. Also, earlier researchers developed models that were confined to the removal of the seasonal trend to detect any unusual trend (e.g., local outbreak). In other words, they simply ignored seasonality. The AHR can capture the seasonal fluctuations and measure changes in timing, severity, and amplitude of seasonal peaks in a particular year as well as over different years. The AHR model Y(t)i exp{0,i + 1,i sin(2t) + 2,i cos(2t) + i,t} (2) is a generalized linear model (GLM) with a Poisson distribution and a logarithmic link function to model the weekly number of deaths. Y(t)i is the disease incidence at time t (in weeks) within a particular flu season i (i=1 to N). The error term i,t is assumed to be i.i.d, uncorrelated, and additive. 0,i is the intercept of the yearly epidemic curve or a baseline level. 1,i and 2,i are the respective coefficients of the harmonic variables. The two harmonic variables in the equation are sin(2t) and cos(2t), where is the weekly periodicity and, in this case, 1/52.25. The periodicity could be varying length and hence have a different unit of temporal aggregation (e.g., monthly, weekly, or daily within a year); however, a week is the most common time unit for surveillance systems. In this methodology, weeks are numbered so that the first week of an influenza season begins on July 1 of a year and ends June 30 the following year. This convention ensures that the flu peak occurs near the center of the influenza season. The parameters of 1,i and 2,i estimated from Model 2 are used to obtain the phase angle parameter and amplitude for each season. These parameters are further used to calculate the peak timing and intensity. The intensity can be thought of as the severity or the strength of the influenza outbreak in that particular season. The standard deviations for each of the seasonality characteristics also can be calculated. In spite of the limitations of Serfling’s regression, it can be used for prediction in subsequent years’ mortality rates. The Annual Harmonic model is limited in that sense. It is descriptive and not usable for predictions because seasons are modeled independently. The AHR can give a more accurate description of a season and facilitate comparison between seasons, which the Serfling regression is unable to do. The application of the Annual Harmonic model in the 2010 paper by Julia Wenger and Elena Naumova that was published in PLoS ONE demonstrates there is information that could be known in advance about a particular season. For example, if the there is a change in the influenza strain for a season, then the peak timing for that season might be earlier, or if the peak timing is earlier than the previous year, then it is likely that the season might be severe in comparison to the previous year. CHANCE
57
Table 1—Seasonality Characteristics for 13 Influenza Seasons in MA Influenza Season 1991-92 1992-93 1993-94 1994-95 1995-96 1996-97 1997-98 1998-99 1999-00 2000-01 2001-02 2002-03 2003-04
No. of Cases 409 338 188 193 181 363 414 437 527 171 264 124 989
Peak Timing in Weeks (95%CI) 29.69 (+ 3.40) 36.97 (+ 4.70) 28.14 (+ 4.30) 32.84 (+ 3.86) 25.96 (+ 2.94) 23.07 (+ 3.28) 30.27 (+ 2.63) 27.73 (+ 4.48) 19.86 (+ 3.45) 22.5 (+ 2.89) 24.3 (+ 4.12) 26.64 (+ 3.38) 16.47 (+ 4.05)
Pneumonia and Flu Hospitalizations in Massachusetts, 1991–2004 Here, we provide an illustration of the use of the annual harmonic regression. The data for this analysis comes from the U.S. Census Bureau and Center for Medicare and Medicaid Services (CMS). The U.S. census data consisted of the statelevel age-specific population for Massachusetts from census 1990 and 2000 (each abstracted from summary file 1 for age group of 65 years and older). The CMS data set contains all hospitalization records from 1991 through 2004. This data set covers the U.S. population eligible for social security benefits from January 1, 1991, to December 31, 2004, and represents more than 98% of the U.S. population 65 years or older and those with end-stage renal disease who are entitled to Medicare Part A benefits. Data were abstracted from the deidentified CMS data set. Each record contained information about age, race, gender, and ZIP code of residence. Personal medical information is represented by up to 10 diagnoses coded by ICD-9M, the date of admission, the length of stay at the hospital, destination of discharge, and death indicator. Influenza (ICD-9M 487) was selected as the outcome for this analysis, so the inclusion criteria were as follows: (a) any of 10 original diagnostic codes beginning with “487”; (b) age greater than or equal to 65; (c) hospital admission date in 1991–2004; and (d) residence in the state of Massachusetts.
Application of Annual Harmonic Regression Large variability in the seasonality characteristics was observed across the 13 flu seasons in Massachusetts. The main seasonality characteristics estimated from the annual regression Poisson models are summarized in Table 1. The table shows that influenza seasons of 2003–2004 and 1999–2000 had an 58
VOL. 24, NO. 2, 2011
Intensity (95%CI) 18.22 (+ 0.77) 12.53 (+ 1.07) 4.10 (+ 0.11) 3.64 (+ 0.29) 4.78 (+ 0.04) 10.25 (+ 0.14) 12.98 (+ 2.47) 15.03 (+ 0.17) 25.51 (+ 0.02) 2.96 (+ 0.05) 7.29 (+ 1.09) 2.05 (+ 0.89) 49.88 (+ 0.12)
earlier peak timing and higher intensity as compared to the other seasons. Also, the influenza seasons of 1994–1995 and 1992–1993 had a peak timing that was delayed as compared to other seasons. In addition, the table shows that most of the influenza seasons had peak timing between 22–30 weeks and intensity that ranged from 2–18. The 2002–2003 season was the mildest, with only 124 influenza cases, while the 2003–2004 season was the most severe, with 989 cases. The peak timing for influenza in Massachusetts varied between week 16 (3rd week in October from a regular calendar year) for the 2003–2004 season to week 37 in the 1992–1993 season (1st week of March from a regular calendar year). The peak timing seems to follow an increasing trend for 1999–2000, 2000–2001, 2001–2002, and 2002–2003. The intensity ranged from 2.05 in season 2002–2003 to 49.88 in season 2003–2004. The one season that is striking is 2003–2004. It had the earliest peak timing and the largest number of laboratory-confirmed cases of influenza. According to the CDC, the flu season of 2003–2004 highlighted the need for early vaccination. In 2003, the season started early and influenza was especially prevalent. Over half the states indicated widespread infection in all ages, primarily with the influenza A (H3N2) strain. Visits to health care providers were fivefold the national baseline and morbidity/ mortality figures were close to epidemic proportions, including 143 influenza-related deaths in children under the age of four. The 2003–2004 flu season provided evidence that young children are hospitalized at a higher rate than older children during flu season. 143 children died from complications related to the flu, with 41% of those under the age of two. Therefore, the Advisory Committee on Immunization Practices (ACIP) recommended that healthy children aged 6–23 months and 24–59 months and people close to them should be included in the high-priority group for vaccination in subsequent influenza
seasons. Additionally, two new surveillance activities regarding children were introduced by the CDC. These include reports of laboratory-confirmed influenza hospitalizations among children