Debia$ed Aggregated Poll$ and Prediction Market Price$ David Rothschild
D
o your hopes and fears over an upcoming elec...
54 downloads
296 Views
5MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Debia$ed Aggregated Poll$ and Prediction Market Price$ David Rothschild
D
o your hopes and fears over an upcoming election shift with each new poll released? If so, consider Figure 1 the antidote to your daily roller coaster. Other than being an enjoyable pastime for many people, accurately documenting the underlying probability of victory at any given day before an election is critical for enabling academics to determine the impact of shocks to campaigns. Further, the more accurate and real-time the probabilities, the more efficient the public can be in choosing how to invest their time and money (in terms of having the greatest marginal impact for them and/or their cause) and the more efficient political organizations and campaigns can be in spending those resources. Shifts in the underlying probability of victory cannot justify the volatility illustrated by the daily poll-based probability in Figure 1, which is standard in the previous literature. Aggregated pollbased probabilities provide a more realistic progression of probabilities near the trend of the daily movement. As a practi-
6
VOL. 23, NO. 3, 2010
cal implication, daily polls are so volatile it is hard to grasp anything else on a chart that includes their probabilities. The aggregated poll-based probability is derived with coefficients from a linear regression of results on the daily poll snapshot in past election cycles. The poll snapshot is the linear trend of all the polls released up to that day. It is projected onto Election Day by regressing the final vote share on the poll snapshot for each day before the election in previous election cycles (2000 and 2004 data are used for the presidential races): Vyr = α + ßPyr + eyr, where y is a given year and r is a given race. The daily projections for 2008 are created using the unique alpha and beta derived for each day before the election (T): Vˆ2008.T = αT + ßTP2008.T. The alpha corrects for antiincumbency bias (i.e., the depression of incumbent party numbers in the late summer/early fall), and the beta corrects for reversion to the mean (i.e., elections narrow as Election Day approaches). Both αT and ßT are statistically significant, confirming that debiasing polls does make more accurate forecasts. A probability of victory is calculated by assuming the projections are the mean of a normal distribution. After assigning a variance based on the historic accuracy of the projections on a given day, one can determine the percentage of possible outcomes in which each candidate has the higher number of votes. Do you debate (or bet) with your friends and family the probability of a candidate winning an upcoming election and how that probability shifts after a controversial vice presidential choice or televised debate? The chart in Figure 2 is the same as Figure 1, but excludes the daily poll-based probability and adds prediction marked–based forecasts and
annotations of the major events from the election cycle. The prediction market prices are from Intrade (www.Intrade.com). Intrade trades binary options that pay, for example, $10 if the chosen candidate wins and $0 otherwise. Thus, if there are no transaction costs, an investor who pays $6 for a “democrat to win” stock and holds the stock through Election Day, earns $4 if the Democrat wins and loses $6 if the Democrat loses. In that scenario, the investor should be willing to pay up to the price equaling the estimated probability of the Democrat winning the election. Figure 2 illustrates why prediction market prices need to be debiased to correct for the favorite-longshot bias (i.e., the restriction of market prices as they approach 0 and 1 due to transaction costs, liquidation concerns, and nonrisk neutral investing), where the previous literature only debiased polls. This bias is a major problem for the accuracy of prediction markets, as 44 of 51 Electoral College races had consensus predictions over 90% for one of the candidates on the eve of the 2008 election. The cost of not debiasing is seen in the difference between the prediction market price and the debiased prediction market-based probability as Obama’s victory becomes increasingly certain. The favorite-longshot bias is ameliorated with the transformation suggested by Andrew Leigh in “Is There a Favorite-Longshot Bias in Election Markets?” presented at the 2007 University of California, Riverside Conference on Prediction Markets, which is estimated (and suggested) prior to my sample using data from presidential predication markets from 1880 to 2004: Pr = Φ (1.64*Φ-1(price)), where Φ is the cumulative normal distribution
Further Reading Erikson, Robert S., and Christopher Wlezien. 2008. Are political markets really superior to polls as election predictors? Public Opinion Quarterly 72:190-21. Leigh, Andrew, Justin Wolfers, and Eric Zitzewitz. 2007. Is there a favoritelongshot bias in election markets? Paper presented at the 2007 UC Riverside Conference on Prediction Markets, Riverside, California. Rothschild, David. 2009. Forecasting elections: Comparing prediction markets, polls, and their biases. Public Opinion Quarterly 73:895-916.
Probability of Victory for Incumbent Party Candidate
Date
Figure 1. Probability of victory in the national popular vote for the incumbent party candidate (John McCain) in the 2008 presidential election based on debiased daily and aggregated polls
Probability of Victory for Incumbent Party Candidate
function. The transformation converts price to a normal Z-score (standardized to mean 0 and variance 1), multiples by 1.64 (thereby inflating the value), and then computes a probability. Figure 2 also demonstrates why extending forecast research to statelevel races is essential to gathering the data necessary to determine some causality or, at minimum, a fuller description of correlations between events and electoral outcomes. First, the figure shows that while there is strong correlation between the polling and prediction market-based forecasts, there is still considerable variation at points during the cycle. Both sets of probabilities have Republican candidate John McCain moving up after the Republican National Convention and the announcement of Sarah Palin as his running mate, but only the market has him crossing the 50% threshold (i.e., predicting he wins the election). Second, even if there were a consensus on the underlying national values, it is impossible to determine causality of events on outcomes using national data calibrated daily; there are too many overlapping events in too few races. Finally, there is evidence that the national popular vote prediction markets may suffer from manipulation by people motivated to gain publicity for their chosen candidate, but this evidence does not extend to the state-level markets. Moreover, the national popular vote does not determine the winner of the U.S. presidential election, since the election outcome hinges on the results in 51 individual sovereignties through the Electoral College.
Date
Figure 2. Probability of victory in the national popular vote for the incumbent party candidate (John McCain) in the 2008 presidential election based on debiased aggregated polls, prediction market prices, and debiased prediction market prices
CHANCE
7
Voting by Education in 2008
8
VOL. 23, NO. 3, 2010
Blacks Hispanics Other
T
raditionally, the Republican Party has been made up of the elite, while the Democratic Party has received the votes of common people. The 2008 election, however, pitted military veteran John McCain against Harvard law graduate (and University of Chicago lecturer) Barack Obama. And, especially after the bitter Hillary Clinton/Obama primary election campaign, there was interest in the political attitudes of voters with different levels of education. We decided to go beyond raw crosstabs and look at vote by education, age, and ethnicity. Here is what we found: In Red State, Blue State, Rich State, Poor State: Why Americans Vote the Way They Do, we used multilevel modeling to estimate opinion for small groups, but, here, we are simply plotting weighted survey data (with the Pew Research survey weighting augmented by further weighting adjustments to match state-level demographic and voting breakdowns) and letting the reader smooth the graphs by eye in some of the low-sample-size categories among the ethnic minorities (Figure 1). We designed the graphs to speak for themselves, but we offer a few comments, first on the technical level of statistical graphics and, second, on the political background. Our grid of 16 plots is an example of the “small multiples” idea advocated by pioneering graphics researchers Jacques Bertin and Edward Tufte. We used a common scale (scaling the bounds of the y-axis from 0% to 100% and carefully bypassing the default options in R, which would have resulted in difficult-to-interpret axes with far too many labels on the scale) and added a guide line at 50% for each. We labeled the outside axes carefully—the defaults in R won’t do the job here—and added enough notes at the bottom to make the entire package self-contained, a must for graphs that will be cut and copied on the web, sometimes without attribution or links back to the original source. As for the political context, the generally nonmonotonic relation between education and vote preference is not new: For the past several national elections, Democrats have done well among the voters with the lowest and highest levels of education. It’s a good idea to break up the estimates by age (because of trends in education levels over time and because many people in the youngest age category are still in the middle of their educations) and ethnicity (because of the high correlations between low education, minority status, and Democratic voting). Despite the patterns we see for education levels, voting by income remains strongly patterned along traditional lines, with Democrats continuing to win the vast majority
Whites
Andrew Gelman and Yu-Sung Su
Figure 1. Republican vote in 2008 by education, among age/ethnic groups Education categories: no high school, high school, some college, college grad, postgrad. Estimates based on Pew research pre-election polls. Error bars show +/- two standard errors. Area of each circle is approximately proportional to number of voters in the category. Graph courtesy of the National Science Foundation, Department of Energy, and Institute for Education Sciences
of low-income voters and Republicans doing best among the top 10%–20%. Much more can be done in this area with regression models and further graphs. The plots shown here simply illustrate how much can be learned by a simple (but not trivial) grid of line plots.
Further Reading Bertin, J. 1967, 1983. Semiology of graphics. Trans. W. J. Berg. Madison: University of Wisconsin Press. Gelman, A., and Ghitza, Y. 2010. Who votes? How did they vote? And what were they thinking? New York: Columbia University. Gelman, A., D. Park, B. Shor, and J. Cortina. 2009. Red state, blue state, rich state, poor state: Why Americans vote the way they do. 2nd ed. Princeton, New Jersey: Princeton University Press. Tufte, E. R. 1990. Envisioning information. Cheshire, Connecticut: Graphics Press.
Risk-Limiting Vote-Tabulation Audits: The Importance of Cluster Size Philip B. Stark
P
ost-election vote-tabulation audits compare hand counts of votes in a collection of groups (clusters) of paper records (voter-verified paper audit trail or VVPAT) to reported machine counts of the votes in the same clusters. Vote-tabulation audits can serve a variety of roles, including process monitoring, quality improvement, fraud deterrence, and bolstering public confidence. All of these raise statistical issues. This article focuses on audits that check whether the machine-count outcome is correct. The outcome is the set of winners, not the numerical vote totals. The machine-count outcome is correct if it agrees with the outcome that a full hand count of the paper audit trail would show. Hand counts can have errors, but many jurisdictions define the correct outcome to be the outcome a hand count shows. Moreover, when the hand count of a cluster of ballots disagrees with the machine count, jurisdictions typically repeat the hand count until they are satisfied that the problem is with the machine count, not the hand count. Generally, the only legally acceptable way to prove a machine count outcome is wrong—and to repair it––is to count the entire audit trail by hand. An audit that has a prespecified chance of requiring a full hand count if the machinecount outcome is wrong—no matter what caused the outcome to be wrong—is called a risk-limiting audit. The risk is the maximum chance that there won’t be a full hand count when the machine-count outcome is wrong.
The Role of Statistics in Risk-Limiting Audits Statistics lets us reduce the amount of counting when the machine-count outcome is right, while ensuring there is still a big chance of counting the entire audit trail if that outcome is wrong. Risk-limiting audits can be couched as hypothesis tests. The null hypothesis is that the machinecount outcome is incorrect. To reject the null is to conclude that the machine-count outcome is correct. A type I error occurs if we conclude that the machine-count outcome is correct when a full hand count would show that it is wrong. The significance level is the risk. It is natural and convenient to test the null hypothesis sequentially: Draw a random sample of clusters and audit them. If the sample gives strong evidence that the null hypothesis is false, stop auditing. Otherwise, expand the sample and evaluate the evidence again. Eventually, either we have counted all the clusters by hand and know the correct outcome or we stopped auditing without a full hand count. We can limit the risk to level α by designing the audit so the chance it stops short of a full hand count is at most α in every scenario in which the machine-count outcome is wrong. The amount of hand counting needed to confirm a correct outcome is indeed correct depends on the sampling design,
margin, number of ballots cast, number and nature of the differences the audit finds, and number of votes for each candidate in each of the clusters from which the sample is drawn. Clusters typically correspond to precincts or precincts divided by mode of voting (e.g., votes cast in person vs. votes cast by mail). We shall see that using smaller clusters can dramatically reduce the amount of hand counting when the machine-count outcome is right.
Heuristic Examples Jellybeans We have 100 4 oz. bags of various flavors of jellybeans. Some bags have assorted flavors, some only a single flavor. Each 4 oz. bag contains 100 jellybeans, so there are 10,000 in all. I love coconut jellybeans and want to estimate how many there are in the 100 bags. The canonical flavor assay for jellybeans is destructive tasting, so the more we test, the fewer there are left to share. Consider two approaches. 1. Pour the 100 bags into a large pot and stir well. Then, draw 100 beans without looking. Estimate the total number of coconut jellybeans to be the number of coconut jellybeans in the sample multiplied by 100. 2. Select one of the 4 oz. bags at random. Estimate the total number of coconut jellybeans to be the number of coconut jellybeans in that bag of 100 multiplied by 100. Both estimates are statistically unbiased, but the first has much lower variability. Mixing disperses the coconut jellybeans pretty evenly. The sample is likely to contain coconut jellybeans in roughly the same proportion as the 100 bags do overall, so multiplying the number in the sample by 100 gives a reasonably reliable estimate of the total. In contrast, a single bag of 100 selected at random could contain only coconut jellybeans (if any of the bags has only coconut) or no coconut jellybeans (if any of the bags has none). Since the bags can have different proportions of coconut jellybeans, 100 selected the second way can be likely to contain coconut jellybeans in a proportion rather different from the overall proportion, and multiplying the number of coconut jellybeans in that bag by 100 could have a large chance of being far from the total number of coconut jellybeans among the 10,000. To get a reliable estimate by counting the coconut jellybeans in randomly selected bags, we would need to test quite a few bags (i.e., quite a few clusters). It’s more efficient to mix the beans before selecting 4 ounces. Then, 4 ounces suffices to get a reasonably reliable estimate. CHANCE
9
Table 1—Chance That a Sample of 500 Ballots Contains at Least One Misinterpreted Ballot in Various Scenarios Misinterpreted ballots by precinct
Randomly selected precinct of 500
10 randomly selected clusters of 50
Simple random sample of 500
10 in every precinct
100%
100%
99.996%
10 in 98 precincts, 20 in 1 precinct
99%
~100%
99.996%
20 in 50 precincts
50%
99.9%
99.996%
250 in 4 precincts
4%
33.6%
99.996%
500 in 2 precincts
2%
18.4%
99.996%
Note: There are 100 precincts containing 500 ballots each, and 1,000 of the 50,000 ballots (2%) are misinterpreted. Column 1: the way in which the 1,000 misinterpreted ballots are spread across precincts. Columns 2–4: the way in which the sample is drawn. Column 2: one precinct of 500 ballots drawn at random. Column 3: 10 clusters of 50 ballots drawn at random without replacement. Column 4: a simple random sample of 500 ballots. When a precinct is subdivided into 10 clusters, the number of misinterpreted ballots in those clusters is assumed to be equal.
Conversely, suppose a sample of 100 jellybeans drawn the first way contains no coconut jellybeans. We would then have 95% confidence that there are no more than 293 coconut beans among the 10,000. In contrast, if a sample drawn the second way contains no coconut jellybeans; we would only have 95% confidence that there are no more than 9,500 coconut jellybeans among the 10,000. To have 95% confidence that there are no more than 293, we would have to test at least 63 of the 100 bags—63 times as many jellybeans as a simple random sample requires. How Salty Is the Stock? We have 100 12 oz. cans of stock composed of a variety of brands, styles, and types: chicken, beef, vegetable, low-sodium, regular, etc. We want to know how much salt there is in all 1,200 ounces of stock as a whole. The salt assay ruins the portion of the stock tested: The more we test, the less there is to eat. Consider two approaches: 1. Open all the cans, pour the contents into a large pot, stir well, and remove a tablespoon of the mix. Determine the amount of salt in that tablespoon and multiply by the total number of tablespoons in the 100 cans (1T = 0.5 oz, so the total number of tablespoons in the 100 cans is 12×100×2 = 2,400T). 2. Select a can at random, determine the amount of salt in that can, and multiply by 100. Both estimates are statistically unbiased, but the first estimate has much lower variability: That single tablespoon is extremely likely to contain salt in roughly the same concentration the 100 cans have on the whole. In contrast, a can selected the second way can be likely to contain salt in a concentration rather different from the 1,200 ounces of stock as a whole, unless all the cans have nearly identical concentrations of salt. For the first approach, we can get a reliable estimate of the total salt from a single tablespoon (0.5 oz) of stock. But, for the second approach, even 12 ounces of stock is not enough to 10
VOL. 23, NO. 3, 2010
get a reliable estimate. The first approach gives a more reliable result at lower cost: It spoils less stock. To get a reliable estimate by sampling cans, we would need to assay quite a few cans selected at random. A single can is not enough, even though it contains 24 tablespoons of stock—far more than we need in the first approach. It’s more efficient and cheaper to mix the stock before selecting the sample.
Connection to Election Auditing A vote-tabulation error that causes the machine-count margin to appear larger than the true margin is like a coconut jellybean or fixed quantity of salt. A precinct or other cluster of ballots is like a bag of jellybeans or a can of stock. Drawing the audit sample is like selecting a scoop or a bag of jellybeans or a tablespoon or can of stock. Counting ballots by hand has a cost: The more we have to count, the greater the cost. Hence, we want to count as few ballots as possible as long as we can still determine whether the electoral outcome is correct—whether the number of errors is insufficient to account for the margin of victory. Similarly, testing the flavor of jellybeans or assaying the salt in the soup also has a cost. (Although I’d volunteer to determine the flavor of jellybeans, gratis.) There are also costs for reporting votes in small clusters and organizing ballots so those clusters can be retrieved, just as there are costs involved in opening all the bags of jellybeans and mixing them together and in opening all the cans of soup and mixing them together. Reporting votes for small clusters of ballots also can reduce voter privacy. In the food examples, the first approach is like auditing individual ballots or small clusters. All the ballots are mixed together well. A relatively small sample can give a reliable estimate of the difference between the machine counts and what a full hand count would show for the entire contest. The second approach is like auditing using precincts or other large clusters of ballots. Many errors that increased the apparent margin could be concentrated in a small number of clusters, because there is no mixing across clusters.
Table 2—Chance That the Percentage of Misinterpreted Ballots in a Sample of 500 Is at Least 1% in Various Scenarios Misinterpreted ballots
Randomly selected precinct of 500
10 randomly selected precinct clusters of 50
Simple random sample of 500
10 in every precinct
100%
100%
97.2%
10 in 98 precincts, 20 in 1 precinct
99%
~100%
97.2%
20 in 50 precincts
50%
62.4%
97.2%
250 in 4 precincts
4%
5.7%
97.2%
500 in 2 precincts
2%
18.4%
97.2%
Note: There are 100 precincts containing 500 ballots each, and 1,000 of the 50,000 ballots (2%) are misinterpreted. Column 1: the way in which the 1,000 misinterpreted ballots are spread across precincts. Columns 2–4: the way in which the sample is drawn. Column 2: one precinct of 500 ballots drawn at random. Column 3: 10 clusters of 50 ballots drawn at random without replacement. Column 4: a simple random sample of 500 ballots. When a precinct is subdivided into 10 clusters, the number of misinterpreted ballots in those clusters is assumed to be equal.
A single cluster drawn using the second approach doesn’t tell us much about the overall rate of vote-tabulation errors, no matter how large the cluster is (within reason). To compensate for the lack of mixing across clusters of ballots, we need to audit many clusters, just as we need to count the coconut jellybeans in many bags or assay many cans of soup if we don’t mix their contents across clusters before drawing the sample.
Numerical Examples Suppose we have 50,000 ballots in all, 500 ballots cast in each of 100 precincts. We will draw a random sample of 500 ballots to tally by hand to check against machine subtotals. Consider the three ways of selecting 500 ballots: (i) drawing a precinct at random, (ii) drawing 10 clusters of 50 ballots at random without replacement, and (iii) drawing 500 individual ballots at random without replacement (a simple random sample). Method (i) gives the least information about the whole contest; method (iii) gives the most, as we shall see. The smaller the clusters are, the harder it is to hide error from the random sample. Suppose that for 1,000 (i.e., 2%) of the ballots, the machine interpreted the vote to be for the machine-count winner, but a manual count would show a vote for the apparent loser. What is the chance the hand count of the votes in the sample finds any of those 1,000 ballots? For method (iii), the chance does not depend on how the misinterpreted ballots are spread across precincts: It is about 99.996%, no matter what. But for methods (i) and (ii), the chance does depend on how many incorrectly interpreted ballots there are in each cluster. For simplicity, assume that when a precinct is divided into 10 clusters, the number of misinterpreted ballots in each of those 10 clusters is the same. For instance, if the precinct has 20 misinterpreted ballots, each of the 10 clusters has two misinterpreted ballots. Table 1 gives the resulting probabilities. They vary widely. In the case most favorable to precinct-based sampling, hand counting a single randomly selected precinct is guaranteed to find a misinterpreted ballot (10, in fact). But the chance falls quickly as the misinterpreted ballots are concentrated into
fewer precincts. In the scenario least favorable to precinctbased sampling, the chance is only 2% for a randomly selected precinct and 18.4% for 10 randomly selected clusters of 50—but remains 99.996% for simple random sampling. If misinterpretations are caused by equipment failures in precincts, that might concentrate errors in only a few precincts. If misinterpretations occur because poll workers accidentally provided voters pens with the wrong color or type of ink, that might concentrate errors in only a few precincts. If a fraudster tries to manipulate the outcome, he or she might target the ballots in only a few precincts to avoid detection or for logistical simplicity. In these three hypothetical situations, if the sample is drawn by selecting an entire precinct, it could easily be squeaky clean. But, with the same counting effort, the chance of finding at least one error if the 500 ballots are drawn as a simple random sample remains extremely high, 99.996%, whether the misinterpreted ballots are concentrated in only a few precincts or spread throughout all 100. Even when the sample does find misinterpreted ballots, the percentage of such ballots in the sample can be much lower than the percentage in the contest as a whole. As before, suppose that for 1,000 (i.e., 2%) of the ballots, the machine interpreted the vote to be for the machine-count winner, but a manual count would show them to be for the apparent loser. What is the chance that the percentage of misinterpreted ballots in the sample is at least 1%? Table 2 gives the answers for the same set of scenarios. In the situation most favorable to precinct-based sampling, hand counting a single randomly selected precinct is guaranteed to reveal that at least 1% of the ballots were misinterpreted (in fact, it will show that 2% were). But, the chance falls quickly as the misinterpreted ballots are concentrated into fewer precincts. In the case least favorable to precinct-based sampling, the chance is only 2% for a randomly selected precinct and 18.4% for 10 randomly selected clusters of 50, but remains 97.2% for simple random sampling. Using smaller clusters increases the chance that the percentage of misinterpreted ballots in
CHANCE
11
Table 3—Upper 95% Confidence Bounds for the Number of Misinterpreted Ballots for Three Ways of Drawing 500 Ballots at Random When the Sample Contains No Misinterpreted Ballots Randomly selected precinct of 500
10 randomly selected clusters of 50
Simple random sample of 500
95.0%
25.7%
0.58%
Note: There are 100 precincts containing 500 ballots each. Columns 1–3: the way in which the sample is drawn. Column 1: One precinct of 500 ballots is drawn at random. Column 2: 10 clusters of 50 ballots drawn at random without replacement. Column 3: A simple random sample of 500 ballots is drawn. The bounds are obtained by inverting hypergeometric tests.
the sample will be close to the percentage of misinterpreted ballots in the contest as a whole. Smaller clusters yield more reliable estimates. Suppose the hand counts and machine counts match perfectly for a sample drawn in one of the three ways—no errors are observed. What could we conclude about the percentage of misinterpreted ballots in the contest as a whole, at 95% confidence? Table 3 gives the answers. For the same counting effort, the simple random sample tells us far more about the rate of misinterpreted ballots in the contest as a whole.
Discussion Audits that have a guaranteed minimum chance of leading to a full hand count whenever the machine-count outcome is incorrect—thereby repairing the outcome––are called risklimiting audits. The risk is the largest chance that the audit will not proceed to a full hand count when the machine-count outcome is incorrect. Risk-limiting audits can be implemented as sequential tests of the null hypothesis that the machine-count outcome is incorrect. The significance level is the risk. I introduced risk-limiting audits in 2007 and conducted six field-pilots of risk-limiting audits in 2008 and 2009. In April 2010, the American Statistical Association (ASA) Board of Directors endorsed risk-limiting audits and called for risklimiting audits to be conducted for all federal and state contests and a sample of smaller contests. While 22 states have laws that require some kind of post-election vote-tabulation audit, none currently requires risk-limiting audits. California bill AB 2023, signed into law in July after unanimous bipartisan support in the California Assembly and Senate, calls for an official pilot of risk-limiting audits in 2011. The ASA, California Common Cause, Verified Voting Foundation, and Citizens for Election Integrity Minnesota endorsed AB 2023. Reducing cluster size can dramatically reduce the hand counting required for risk-limiting audits. For instance, a 2009 risk-limiting audit in Yolo County, California, audited a cluster sample of 1,437 ballots to attain a risk limit of 10%. Clusters were precincts, split by mode of voting (in person vs. by mail). A simple random sample of just six ballots––about 240 times fewer––would have sufficed, if no errors were found. There are tradeoffs. Using smaller clusters requires votetabulation systems and procedures that report subtotals for
12
VOL. 23, NO. 3, 2010
smaller clusters, and it requires elections officials to be able to locate and retrieve the paper trail for those clusters. There is also a tradeoff between cluster size and voter privacy. If a group of voters can be linked to a cluster of ballots with similar voting patterns, one can determine how those voters voted. The biggest impediment to efficient risk-limiting audits is the inability of current commercial vote-tabulation systems to report the machine interpretation of small clusters of ballots or individual ballots. The next generation of vote-tabulation systems should be designed with auditing in mind.
Further Reading American Statistical Association. 2010. American Statistical Association statement on risk-limiting post-election audits. www.amstat.org/outreach/pdfs/Risk-Limiting_Endorsement.pdf. Lindeman, M., M. Halvorson, P. Smith, L. Garland, V. Addona, and D. McCrea. 2008. Principles and best practices for post-election audits. www.electionaudits.org/files/ best%20practices%20final_0.pdf. Saldaña, L. 2010. California assembly bill 2023. www. leginfo.ca.gov/pub/09-10bill/asm/ab_2001-2050/ab_2023_ bill_20100325_amended_asm_v98.html. Stark, P. B. 2008. Conservative statistical post-election audits. Ann. Appl. Stat. 2:550–581. Stark, P. B. 2008. A sharper discrepancy measure for postelection audits. Ann. Appl. Stat. 2:982–985. Stark, P. B. 2009. CAST: Canvass audits by sampling and testing. IEEE Transactions on Information Forensics and Security 4:708–717. Stark, P. B. 2009. Efficient post-election audits of multiple contests: 2009 California tests. Presented at the Conference on Empirical Legal Studies. Stark, P. B. 2009. Risk-limiting post-election audits: P-values from common probability inequalities. IEEE Transactions on Information Forensics and Security 4:1005–1014. Stark, P.B., 2010. Super-simple simultaneous single-ballot risklimiting audits, 2010 Electronic Voting Technology Workshop/ Workshop on Trustworthy Elections (EVT/WOTE ‘10), D. Jones, J.J. Quisquater and E.K. Rescorla, eds.
Fixed-Cost vs. Fixed-Risk Post-Election Audits in Iowa Jonathan Hobbs, Luke Fostvedt, Adam Pintar, David Rockoff, Eunice Kim, and Randy Griffiths
E
lectoral integrity has been a forefront issue during the past decade. Policy developments such as voter-verified paper records and postelection audits bolster transparency and voter confidence. A post-election audit is a manual recount of some or all of the voting units, generally precincts, involved in a particular race. Machine-recorded tallies are typically reported as precinct totals; an audit includes a manual recount of actual ballots to compare to the reported totals. States have a variety of audit laws, many of which mandate recounts of a fixed percentage of randomly selected precincts. An audit procedure should detect an outcome-altering error with high probability in a cost-effective manner. Audits that limit risk, which is the probability of certifying an incorrect election outcome, achieve these objectives, but can involve complicated procedures. From a statistical perspective, risk can be thought of as the probability of a type II error failing to detect an incorrect outcome. There is no type I error, or declaring an incorrect outcome when the reported result is correct, because a correct outcome cannot be overturned based on audit results alone. In races with large margins of victory or many precincts, risk-limiting audits are more cost effective than their fixed-percentage counterparts, and risk-limiting audits can have substantially reduced risk in close races or small populations. Across the nation, states must balance a number of considerations, including efficiency (cost), statistical effectiveness, public perception, and logistical constraints. Two important logistical constraints governing elections are: 1. That elections are usually administered at the county level. In Iowa, county auditors oversee elections. 2. That vote counts are reported in batches, usually by precinct or polling location. These are the totals audited in a recount.
Iowa’s Proposed Legislation Optical scan ballots are used in all 99 Iowa counties. As an example, sample ballots from the 2008 general election for Johnson County—Iowa’s fourthlargest county—are available at www. johnson-county.com/auditor/sampbalgeneral/ sampbal2008general.htm. A paper record is a prerequisite for a post-election audit, but Iowa is one of 25 states that do not require a statewide manual audit of election results. The state code does include provisions for candidates to request recounts. In addition, county auditors may request administrative recounts in their jurisdictions. Responding to the concerns of engaged citizens, the Iowa secretary of state formed a post-election task force in 2008. The task force included state legislators, county auditors, and electoral experts. Over the course of a year, the group drafted a bill that was introduced in the 2009 legislative session. Members of the Iowa task force tackled the challenges for the state, most notably the substantially varying populations across Iowa’s counties. The task
force based Iowa’s bill on a post-election audit regulation passed in Minnesota in 2006. The approach attempts to formulate a relatively simple audit procedure that accounts for the state’s uneven spatial distribution of voters. Specifically, Iowa’s proposed legislation calls for a post-election audit after each general election. Depending on the election year, either the results for president or governor would be audited with another randomly selected state or federal office. Each county would perform its audit by randomly choosing a specified number of precincts and recounting the votes for those precincts. Counties would select one precinct if the county has seven or fewer precincts. Counties with more than seven precincts would select the following: Two precincts if the county has 50,000 or fewer registered voters Three precincts if the county has 50,001–100,000 registered voters Four precincts if the county has more than 100,000 registered voters CHANCE
13
Assessing Iowa’s Proposed Legislation
Figure 1. Number of registered voters (in thousands) by county in Iowa. White text indicates counties that would be required to audit four precincts.
Figure 1 displays Iowa’s electoral makeup in 2010. Polk County, which includes Des Moines, has nearly 300,000 registered voters and almost twice as many as the next-largest county. Just four counties would fall into the bill’s largest tier, and only five would audit three precincts. Eighty of Iowa’s 99 counties would audit two precincts. The bill calls for escalation within a county if the recount reveals a discrepancy of at least 0.5% from the reported result. After this initial round of escalation, a statewide recount could be declared at the discretion of the Iowa secretary of state. The bill did not meet approval of both legislative houses by the end of the 2010 legislative session.
Characteristics of Audit Methodology The American Statistical Association (ASA) has increasingly advocated postelection audits in recent years. In April 2010, the ASA Board of Directors recommended routine risk-limiting audits for all federal and state elections. Previously, the organization endorsed three of the principles and best practices for post-election audits developed by ElectionAudits.org: Principle 5, Risk-Limiting Audits Post-election audits reduce the risk of confirming an incorrect outcome. Audits designed explicitly to limit such risk (risk-limiting audits) have advantages 14
VOL. 23, NO. 3, 2010
over fixed-percentage or tiered audits, which often count fewer or more ballots than necessary to confirm the outcome. Principle 6, Addressing Discrepancies and Continuing the Audit When discrepancies are found, additional counting and/or other investigation may be necessary to determine the election outcome and find the cause of the discrepancies. Principle 7, Comprehensive All jurisdictions and ballot types— absentee, mail-in, and accepted provisional—should be subject to the selection process. Principle 7 is satisfi ed in the proposed legislation because Iowa has optically scanned paper ballots and the absentee ballots in each county are considered a precinct. Therefore, all ballots would have a chance of being selected in the audit. Principle 6 is also satisfied because the audit would escalate within a county when a discrepancy of at least 0.5% is found. Principle 5 is not completely satisfied for two reasons. First, the risk and efficiency depend on the margin of victory. In addition, the decision to escalate to a full recount is ultimately left to the secretary of state, which makes the computation of the true risk impossible, as an individual’s judgment cannot be quantified.
Possible Miscount Situations There are potentially many situations— from apparently benign machine malfunctions to fraudulent actions—that can cause miscounts or discrepancies between a precinct’s reported vote totals and the true election outcome. In the instance of malicious election fraud, a surreptitious strategy for altering the election results is to switch many votes in as few precincts as possible. This vote-switching is perhaps a conservative alternative to potentially more realistic situations. A common issue in U.S. elections is an undercount, which occurs when the total votes tabulated for a particular race fall short of the total ballots cast. A famous undercount example was in Florida during the 2000 presidential election, when a vote for a candidate was not recorded on 178,145 ballots. While a small number of these ballots were individuals who simply chose not to vote for a candidate, there was a large undercount of votes due to the poorly designed butterfly ballot. Further, a Scripps Howard study found seven states, as well as 544 individual counties, had a larger percentage of undercounts than Florida’s 2.9%. An undercount would effectively have half the impact of a switched vote, so more votes and/or precincts would need to be changed by undercounts than actual vote-switching to alter an election outcome. Numerical Assessment Procedure To convey the efficacy of Iowa’s proposed legislation, two mechanisms for generating miscounts in a precinct are combined with two dispersal scenarios of miscounted precincts throughout the state. The first dispersal scenario is that of random dispersal. Under this scenario, the precincts in which miscounts occur are selected randomly with equal probability and without replacement from the state’s precinct population. Such a scenario corresponds to a plausible situation in which the vote tabulating machines malfunction at random precincts. The second scenario involves a systematic dispersal of the miscounts, where they are concentrated in the largest precincts. This represents a worstcase scenario, since it would take the least amount of miscounted precincts
to alter the results. This scenario corresponds to a situation in which the election results are being altered by a dishonest entity that wishes to avoid detection (i.e., election fraud). The two miscount-generating mechanisms are vote-switching and undercounts. To simplify assessments, only two-candidate statewide races are considered—Candidate A vs. Candidate B, where Candidate B is incorrectly reported to have won. The reported election results from the 2006 Iowa general election are used in this assessment. To assess risk, the first step is to determine the precincts in which miscounts occur. This is accomplished under all combinations of the two miscount-generating mechanisms and two dispersal scenarios by using the idea of maximum within precinct miscount (WPM), which is taken to be 20%. WPM is seen as the maximum level of discrepancy that would go unnoticed by observation. Assuming that the observed, but incorrect, margin of victory (MoV) is the same in all miscounted precincts as in the overall election, a precinct is selected (according to one of the dispersal scenarios) and either 20% of the precinct’s total votes are switched from Candidate B to Candidate A or 20% of the precinct’s total are assumed to be uncounted and should have gone to Candidate A. This is repeated in additional precincts until enough votes have been changed to give Candidate A more than 50% of the total vote. Under any combination of miscount mechanism and dispersal scenario, there are M counties statewide. For each county i = 1, …, M, let ki denote the number of precincts the county is required to audit, ni the total number of precincts in the county, and xi the number of miscounted precincts in the county. Once the precincts in which miscounts occur have been determined, risk can be computed as the following: Risk = P [all miscounts undetected] M
= ∏ P [all miscounted precincts i=1 in ith county undetected | xi miscounted precincts] M
= ∏ f(0 | ni ,xi ,ki), i=1
Figure 2. Risk for Iowa’s proposed election audit procedure, for various observed margins of victory. The horizontal dotted line indicates a risk of 0.01.
where f is the hypergeometric probability mass function (pmf). The precinct selection process is a case of sampling without replacement from a dichotomous population, a situation that the hypergeometric distribution describes. This hypergeometric probability will be unity when there are no miscounted precincts in a county, and the probability decreases with an increasing number of precincts with miscounts; more miscounted precincts translates to lower risk. The hypergeometric result is a consequence of the proposed sampling procedure, regardless of how miscounts are dispersed. Results In the case of random miscount dispersal, there is a distribution of risk because a different random selection of precincts will lead to a different value of risk. The 99th percentile of that distribution is shown in Figure 2, which also shows that the proposed legislation varies in its effectiveness with large discrepancies as the observed MoV decreases. In fact, when the MoV is 0.5% and switches occur in the largest precincts, the risk is more than 70%, versus 20%
for randomly occurring switches. At this MoV, the undercount mechanism with random dispersal has a risk of 2%. For increasing MoV, there is better agreement among all but the most extreme scenario. When MoV exceeds 2%, the risk in the remaining three scenarios is below 0.01 (marked by the horizontal dotted line). However, when random miscounts occur and MoV increases, the sampling procedure becomes increasingly inefficient (in terms of the number of votes counted), since the risk begins to fall far below 0.01. The procedure is inefficient, but has low risk for the undercount scenarios when MoV exceeds 2.0%. As Figure 2 shows, the vote-switching scenario executed in the largest precincts represents a worst-case scenario.
A Risk-Limiting Procedure As the previous section’s numerical assessment illustrates, Iowa’s proposed audit procedure would have large risk for small MoV, and the procedure becomes inefficient when the MoV is large. An alternative procedure is a statistical audit with greater efficiency (SAGE). The procedure, which is also known as the CHANCE
15
Figure 3. Expected precincts sampled in negative exponential. Horizontal dotted line represents number of precincts sampled under Iowa’s proposed legislation.
Figure 4. Median precincts sampled in negative exponential with margin of victory 0.5%
16
VOL. 23, NO. 3, 2010
Figure 5. Median precincts sampled in negative exponential with margin of victory 2%
negative exponential (NegExp) method, accounts for varying precinct sizes by assigning a unique probability of audit selection for each precinct. For a race with N total votes cast and ni votes cast in precinct i, the probability pi of selecting precinct i for audit is ei ei
N(MoV) ppii ==11--r rN(MoV)
The desired risk r is specified beforehand for the entire contest. The quantity ei is an error bound specific to each precinct and reflects the assumed miscount mechanism. The vote-switching and undercount mechanisms are again investigated. Appropriate error-bound values for these two mechanisms, respectively, are ei,s = 2ni(WPM) ei,u = ni(WPM) The NegExp procedure’s potential is evaluated with another numerical assessment. For varying MoVs, NegExp is applied for a statewide race with turnout like that of the 2006 general election and WPM of 20%. Figure 3 displays the expected number of precincts sampled under NegExp, with the two miscount mechanisms compared to that of the fixed count under the proposed legislation. The expected number of precincts can serve as a proxy for the cost of the audit, thus the NegExp procedure has cost that depends on MoV. The actual NegExp sampling procedure also was simulated county by county for MoVs of 0.5% and 2% under the
vote-switching mechanism. As Figure 3 shows, the expected number of precincts sampled statewide is 322 for a 0.5% MoV and 89 for a 2% MoV. Since elections are administered by county, the breakdown of precincts audited by county is also important. Figures 4 and 5 display the median number of precincts sampled in each county in these two situations. The NegExp procedure would place a larger burden than the proposed legislation on the state’s largest counties, most notably Polk County. At the same time, smaller counties would have reduced cost, especially for larger MoV.
Summary Iowa’s proposed post-election audit legislation includes provisions for escalation if a county’s recount reveals a substantial discrepancy. In addition, the procedure is comprehensive in including all ballots in the selection process. A numerical assessment finds that the risk may be substantial for close races and the procedure would be inefficient, but have low risk, if the race is not close. The introduction of comprehensive post-election audit legislation in Iowa is an encouraging development for electoral integrity, as is the recent ASA Board of Directors’ recommendation for widespread implementation of risk-limiting audits. We advocate the further development of risk-limiting audits in Iowa and interaction between statisticians and election officials.
Further Reading American Statistical Association. 2010. Statement on risk-limiting post-election audits, www.amstat.org/outreach/ pdfs/Risk-Limiting_Endorsement.pdf. Ash, Arlene, and John Lamperti. 2008. Florida 2006: Can statistics tell us who won congressional district-13? CHANCE 21(2):18–24. Ash, Arlene, and Philip B. Stark. 2009. Thinking outside the urn: Statisticians make their marks on U.S. ballots. Amstat News 384:37–40. Aslam, Javed A., Raluca A. Popa, and Ronald L. Rivest. 2008. On auditing elections when precincts have different sizes. In USENIX/ACCURATE electronic voting technology workshop. ElectionAudits.org. Principles and best practices for post-election audits. www.electionaudits.org/principles. Hargrove, Thomas. Too many votes go uncounted. www.ejfi.org/Voting/ Voting-8.htm. McCarthy, John, Howard Stanislevic, Mark Lindeman, Arlene S. Ash, Vittorio Addona, and Mary Batcher. 2008. Percentage-based versus statistical-power based vote tabulation audits. The American Statistician 62:11–16.
CHANCE
17
Election Audits in Florida Linda J. Young and Dan McCrea
F
lorida’s first election audit was in response to a crisis. On March 7, 2006, a small election involving just three municipalities was conducted in Pinellas County. When counting the votes, memory cartridges failed; they indicated they were “already read” when they were not. In spite of multiple attempts to restart the count, the system continued to fail for more than two hours. One candidate said he had been unable to vote. On March 10, the election was certified. Subsequently, in response to concerns, an election audit was conducted April 11–12 by David Drury, chief of the Florida Bureau of Voting Systems Certification (BVSC); Richard Harvey of BVSC; and Paul Lux, assistant supervisor of elections from Okaloosa County, Florida. When asked what the protocol would be, Drury responded, “There are no auditing standards for Florida elections. … We are on a learning curve here.” Consequently, this first audit was impromptu and lacked transparency. Its goal was to determine how many votes were counted, not whether machines accurately recorded votes. It demonstrated the need for standards for election audit goals, procedures, and transparency. Eight months later, in the November 2006 general election, Sarasota County, Florida, recorded 18,000 undervotes in the Congressional District 13 race. An undervote occurs when a voter makes no selection in a given race. The 18,000 votes represented almost 15% of the voters who went to the polls and cast ballots in the Democraticleaning county. The rest of District 13 contributed about half the vote, but had about 5,000 undervotes. The Republican candidate had 369 more votes than the Democratic candidate. Arlene Ash and John Lamperti conducted a statistical analysis of the election and concluded in a CHANCE article “beyond any reasonable doubt”
18
VOL. 23, NO. 3, 2010
that the voters wanted the Democrat, not the Republican, to represent the 13th District. However, no precedent exists for using statistical analyses to overturn elections. Despite there being two separate contest-of-election lawsuits filed in the District 13 race and a year-long investigation conducted by Congress, no definitive cause for the anomaly could be found. The lack of paper ballots and post-election audit procedures made a meaningful audit impossible and led to the Republican representing District 13 in Congress. Following these and other irregularities, Florida passed the “paper-trail” legislation in 2007. The law requires that, with the exception of voters with disabilities, all votes be cast using paper ballots by July 2008. In the same bill, it passed a modest audit provision. Audits are legislated to be conducted in each county at least seven days after the county election results are certified to the state. Each county randomly selects one race from all races that appeared on any of its ballots. For the selected race, 2% of the county’s precincts in that race are randomly selected to be hand counted and compared to their machine counted counterparts. Each county’s audit attempts to resolve any discrepancies found between the manual and machine counts and reports the final nonbinding results to the state. The Florida audit has several strengths. First, it exists, making Florida one of only 18 states that conduct manual audits. The Florida audit provides for substantial transparency, allowing the public to closely observe and verify the proceedings. The audit includes all ballots—including those cast on Election Day and in early, absentee, and provisional voting—in the selected race and precincts. Ballots cast on both
optical scan and direct recording electronic machines (DRE touch screens, still allowed for persons with disabilities until 2012) are audited. Finally, random selection of both races and precincts is required. The Florida election audit is far from perfect. Each county is compartmentalized in the audit. No race beyond the size of a county race is effectively audited. Even though a larger race, such as the presidential, gubernatorial, or U.S. House of Representative district election, may be selected by one or more counties, they will not be audited across their full jurisdictional boundaries, making it impossible to audit the race effectively. The audit is post-certification. It must be reported by seven days after the county certifies its results to the state. The contest period expires the 10th day post-certification. Thus, only a small window of time is available for an aggrieved party to contest the election but, because the election is already certified, court is the only option. The sample sizes are fixed at 2% of the county’s precincts in that race. If the race is close, the sample size is smaller than needed to guard against certifying an incorrect outcome. If the race is not close, the sample size is larger than needed to ensure certifying a correct outcome, resulting in a waste of resources. The Florida election audit lacks fundamental integrity and uniformity. The state provides audit forms that separate overvotes, undervotes, and questionably marked ballots from properly voted ballots. However, it is not clear whether county canvassing boards should adjudicate the questionably marked ballots at
the time of the audit. In many cases, the questionably marked ballots are simply reported with no further action. The procedures prescribed in the Florida election audit lack a number of important elements associated with a rigorous audit. The ballot accounting and ballot chain of custody security are not adequately checked. No criteria defining what level of discrepancy should trigger an escalation of the audit or terms for that escalation are specified. No provisions are provided for targeted samples, sometimes called challenge audits, initiated either by election officials, candidates, or members of the public who have standing. Nothing specific is done with the results; no protocol is present for correcting results, reporting problems, or making recommendations. Finally, costs are neither tracked nor analyzed for efficiency. There is fairly wide recognition among the state legislators that the Florida audit bill needs to be improved. During March 2008, a meeting was held to discuss changes with Florida’s secretary of state, Kurt Browning. During the discussion, Browning mentioned he had two statistics courses during his graduate study and then declared, “I hate statistics,” a sentiment he repeated throughout the meeting. This illustrates how important it is to have modern, engaging statistical service courses at both the graduate and undergraduate levels. If future policymakers come to dislike statistics during the course of their education, it is much more difficult to get their support for programs with a strong statistical foundation when they are in an influential position.
Following the 2008 presidential election, Florida participated with other states in observing the election audit. With the help of the League of Women Voters, 55 volunteers observed the election audits in 37 of Florida’s 67 counties. For the 51,921 recorded machine votes in the counties observed, the manual count was 51, 871, an absolute change of 50. Undervotes ranged from 2 to 58, with a net change of 7, representing 14% of the absolute change in the count. Overvotes ranged from 2 to 58, with a net of 8 overvotes. The absolute total of all changes was 184. In summary, the good news is that Florida has an election audit law. The bad news is that Florida’s election audits are not based on a scientific foundation and are too weak to be effective. If, as in the 2000 election, a presidential race is close, the audits associated with that race will not provide assurance to the citizens of Florida or the nation that the certified outcome reflects the intent of the voters. When passed in 2007, there was general recognition that the audit law needed to be improved. The current secretary of state opposes using any statistical foundation for such a bill, and legislators have been reluctant to move forward without his support.
Further Reading Ash, Arlene, and John Lamperti. 2008. Florida 2006: Can statistics tell us who won Congressional District 13? CHANCE 21(2):18–27. Voting Integrity Alliance of Tampa Bay. 2006. Florida: Report of Pinellas County ‘audit.’, www.votetrustusa.org/ index.php?option=com_content&task=view &id=1200&Itemid=113.
Do You Want MORE? FREE online access to CHANCE is available for all subscribers who are also members of the American Statistical Association, including all K—12 member subscribers. Take advantage of this amazing benefit and read all the CHANCE features, columns, and supplements you want. Just log in to Members Only at www.amstat. org/membersonly today! CHANCE
19
Road Crashes and the Next U.S. Presidential Election Donald A. Redelmeier and Robert J. Tibshirani
T
he United States contains some of the world’s most dangerous roads, accounting for about 4% of the total global road deaths and 4% of the total global population. This pattern differs from other industrial countries such as Australia (0.16% of road deaths, 0.32% of global population), Canada (0.27% of road deaths, 0.46% of global population), and Germany (0.65% of road deaths, 1.21% of global population). The shortfall in U.S. road safety is a new issue, since American roads were considered the safest in the world 50 years ago. The shortfall is also distinctive (given the country’s leadership in drug and aviation safety) and not always recognized (sometimes by ratios that frame risks as deaths per vehicle or other unusual metrics). The large American losses from road deaths cannot be blamed on any one person or group. Each crash naturally involves drivers, and driver error contributes to about 93% of events. The U.S. legal profession also finds fault with vehicle manufacturers, tire companies, road engineers, and other large corporate industries. Activist groups emphasize anomalous individual behavior and the need for better regulations and 20
VOL. 23, NO. 3, 2010
enforcement. The medical profession generally focuses on acute resuscitation and directs less attention to crash prevention. Motor vehicle enthusiasts, in contrast, downplay the entire issue, arguing that the benefits from travel greatly outweigh the risks and that motor vehicles are designed much safer than horses and wagons of earlier times. We sought to call attention to the need for more U.S. road safety using a recurring event of national importance. Because road deaths are a daily occurence, one paradox is that this epidemic seems invisible, unless a small surge allows people to realize the enormous baseline numbers (unlike marathon deaths, where a single fatality garners national attention due to a near-zero baseline). The banality of commuter driving also contributes to the misconception that most drivers are knowledgeable and would gain nothing from science. Together, these attitudes create a sense of complacency that might permeate to the highest levels of government. The purpose of our research was to identify some road deaths that could be linked to the president of the United States. The U.S. president, of course, has a huge agenda and multiple competing
issues, so road safety may not always be a high priority. The American Constitution, furthermore, often assigns transportation policy to state, regional, and municipal governments. The adversarial relations between manufacturers, insurers, and other private sectors sometimes lead to policy stalemates, despite all good intentions. Road safety changes, as well, are almost impossible to introduce as randomized experiments so that determining success or failure is difficult, even with an infinite sample size and endless followup. The average American president, therefore, has no way of knowing how much more attention to direct toward road safety. We thought of a way to analyze a new rationale for the next American president to address the shortfall in U.S. road safety. Doing so required adapting our prior methods for studying road deaths related to championship football games and organized marathon races. The strength of this approach was to provide rigorous national data relatively free of confounding, despite a before-and-after design. We published our results as a brief article in the Journal of the American Medical Association, showing how more traffic can lead to more traffic deaths.
Figure 1. Line graph showing count of total number of individuals in fatal road crashes during U.S. presidential elections over a 32-year time span Note: Sample depicts nine total occasions (spaced every four years), along with corresponding control days (spaced one week immediately before and after Election Day). Line segments join data for Election Day (solid line), control day before election (dotted line), and control day after election (dashed line). Results show single most dangerous day in 1992 (an Election Day), single safest day in 2008 (a control day), before control day and after control day tied on one occasion in 1996, and a large range of variability over time. Overall, the before control is safer than the after control on three of eight occasions and more dangerous than the after control day on five of eight occasions. In total, Election Day is the most dangerous day on six of nine occasions and the least dangerous day on one of nine occasions.
The goal of this article is to describe the methods in greater detail, update the analyses with the latest election, and review our experiences dealing with popular media following publication.
Analysis President Barack Obama was elected on November 4, 2008, with the highest absolute voter turnout in history (131 million in 2008 compared to 126 million in 2004). Hereafter, we call this date “Election Day.” For comparison, we also examined October 28, 2008, (one week before) and November 11, 2008 (one week after). Hereafter, we call these dates “control days.” The total U.S. population at the time was about 304 million. The exact hours of polling varied in different regions, were not recorded in a simple manner, and typically ran from about 8:00 a.m. to 7:59 p.m. local time. Hereafter, we call this 12-hour interval “polling hours.” Note that the same 12 hours can be designated for control days to ensure identical time intervals appear in all analyses. We next retrieved road safety information for the entire United States from the National Highway Traffic Safety Administration (NHTSA). This continues to be
a major resource, dating back to 1975 (President Gerald Ford), and provides rigorous data for all fatal crashes on public roadways throughout the nation. In our research, we set the unit of analysis as the person, even though this violates the statistical assumption of fully independent observations (since people are nested within vehicles and vehicles nested within crashes). Information on the day and hour of each crash is considered extremely accurate, with little missing data ( 0.20). No single U.S. state showed a statistically significant decrease in risk. The relative risk was similar during morning, afternoon,
and evening polling hours, as well as crashes that did and did not involve alcohol. The risk was similar, regardless of whether a Democrat or Republican was elected. The risk was increased, regardless of whether the person was a driver, passenger, or pedestrian. Finally, the risk was similar for those known and not known to be dead. The NHTSA data set also makes it easy to explore other time intervals outside of polling hours. Analyzing the eight hours before polling began yielded a total of 352 individuals in fatal crashes over the nine election days, 767 over the
18 control days, an odds ratio of 0.91 (95% confidence interval 0.81 to 1.04, p = 0.18), and an absolute saving of about three individuals. Analyzing the four hours after polling ended yielded a total of 275 fatal crashes over the nine election days, 705 over the 18 control days, an odds ratio of 0.78 (95% confidence interval 0.68 to 0.90, p = 0.002), and an absolute saving of about nine individuals. Analyzing the full days immediately before and after the election show no significant savings or losses compared to controls.
CHANCE
23
Aftermath Since publishing our results, we have been approached by several journalists who posed several common questions. One of the most personal has been to explain our fascination with the United States given that we were each born and raised in Canada. When we published our article, however, both Canada and the United States were organizing national elections within four weeks of each other. Moreover, we believe U.S. culture has a major influence globally, so that policy efforts in America might also persuade world leaders in other countries toward more road safety. No other country, as well, has shown the initiative to share raw data about road safety freely for global science. A few curious journalists checked our backgrounds and asked probing questions about ulterior motives that might lurk behind our interest in road crashes. This caused us to reflect on more potential reasons why the general public often neglects road dangers— denial, misperception, and myopia. Denial occurs because driving is often essential for employment, recreation, and many other elements of quality of life in America. Misperceptions arise around the meaning of small probabilities since the average vehicle trip does not result in death for the driver. Myopia is a third reason since the daily toll of road deaths is dispersed over a huge country, so few people see the big picture (unlike a single airplane crash occurring in one location). Another frequent question was about who funded our research. Overall, the work cost minimal amounts, aside from our time (which was already covered by our academic positions). In contrast, the last U.S. presidential election involved about $1 billion in campaign financing, as counted by the Federal Election Commission. An unknown proportion of this influx of money was dispersed to nonmedical journalists and other members of the media through advertising revenues that support their businesses. In
24
VOL. 23, NO. 3, 2010
a world sometimes cynical about financial conflicts of interest, we worried that campaign finances create tension with a media striving for objectivity, yet receiving revenues and other perks from campaign spending. Many questions called for speculation about what explained the increase in risk. The obvious possibilities included more driving, mobilization of unfit drivers, heightened emotions, and unfamiliar pathways. A more subtle contributor could be decreased enforcement, since the police, themselves, are also busy voting and do not want to be seen as tampering with the democratic process. Another contributor might be rushing as responsible citizens try to squeeze one more duty into a hectic schedule and make up for time lost in long voting queues. The last two contributors are distinctly different from factors observed during holidays and explain why the surge in deaths during U.S. presidential elections exceeds the risk on New Years Eve in the United States over the same years. Several countermeasures are available to mitigate the increase in risk. Three simple measures are public safety messaging by electioneers who encourage people to vote, subsidized public transit, and tamper-proof remote voting. Photoradar, red-light cameras, and other automatic enforcement technologies are also an option popular in other countries. Changes in roadway design, vehicle technology, and additional engineering improvements have further potential to reduce roadway deaths. These latter two counter-measures could also lead to more safety gains every day. None of this would require a radical change in the democratic process. Some critics questioned whether our data indicated that rational adults should not vote in a U.S. presidential election. The observed increased risk of death, suggests that perhaps 25 potential fatalities might be blamed on the next U.S. presidential election. This absolute risk (about 1 per 107) is 10 times greater than
the probability of casting a pivotal vote in the election (about 1 per 108). Moreover, the increased risks likely extend to multiple other crashes that result in permanent disability, property damage, and lost time. Thus, a crude analysis might suggest the risks exceed the benefits for an average voter. Evidently, the reason why people vote must reflect a more enlightened perspective. All U.S. presidents during our study gave victory speeches marked by gratitude and humility. We are not aware of a time, however, when the president mentioned the road deaths caused by the vote. Our study indicates that an elected U.S. president owes a greater debt to the American people than generally recognized. The data indicate this is a bipartisan failure, since we found similar risks for both Democratic and Republican victories. Our research does not indicate which candidate could best help save the 100 lives lost each day in fatal crashes in the United States. Instead, the findings suggest that the U.S. president could give more thought to the 150,000 American road deaths anticipated during the next presidential term.
Further Reading Evans, L. 2004. Traffic safety. Bloomfield Hills, Michigan: Science Serving Society. Fatality Analysis Reporting System. Washington, DC: National Highway Traffic Safety Administration. www-fars.nhtsa.dot.gov. Mulligan, C.B., and C. G. Hunter. 2003. The empirical frequency of a pivotal vote. Public Choice 116:31–54. Redelmeier, D.A., and R. J. Tibshirani. 2008. Driving fatalities on U.S. presidential election days. Journal of the American Medical Association 300:1518– 1520. World Health Organization. 2004. World report on road traffic injury prevention. www.who.int/violence_ injury_prevention/publications/road_traffic/ world_report/en.
Career Pitching Statistics and the Probability of Throwing a No-Hitter in MLB: A Case-Control Study David McCarthy, David Groggel, and A. John Bailer
I
n the game of baseball, events such as hitting for the cycle, triple plays, and no-hitters rarely occur. When they do, those present witness something special. Such events have been studied frequently and in a variety of ways. The no-hitter is achieved when the pitcher pitches for a whole game and the opposing team’s members do not record a single hit. This occurs, on average, in about two of the more than 2,430 games per season. There is great variability in the quality of pitchers who have managed to throw a no-hitter. Bud Smith had a brief and unimpressive career, yet managed to throw one. Greg Maddux, who is destined for the Baseball Hall of Fame in Cooperstown, New York, was unable to throw a no-hitter during his brilliant career. Such results beg the question: Are those pitchers with superior skills more likely to throw a no-hitter, or is the event a “fluke” that is not skill-dependent? Also, can factors be identified that are associated with throwing no-hitters?
Characteristics That Might Influence Pitching a No-Hitter Nine career pitching statistics were used as predictors for the probability of throwing a no-hitter. All variable names, along with a brief description, are given in Table 1. Most variables are defined in terms of percentages or ratios, which tends to allow younger and older pitchers to be compared. Ideally, the number of innings pitched could be replaced by the number of innings pitched per start, but information concerning the number of innings pitched for starting and relief appearances are not separately available in an easily accessed data source.
Greg Maddux AP Photo/Mark Duncan
CHANCE
25
Table 1—Name and Description of the Nine Pitching Variables Variable
Description
Number of Games Started
Total number of games in which a pitcher was the starting pitcher for his side
Winning Percentage
Total number of games won, divided by the total number of games in which a pitcher received a decision (either a win or loss)
Earned Run Average (ERA)
Total number of earned runs allowed, multiplied by 9 and divided by the total number of innings pitched; small values are best
Complete Game Percentage
Total number of games completed (nine innings or more), divided by the total number of games started; a complete game is required to have a no hitter
Shutout Percentage
Total number of complete games in which no runs (earned or unearned) were allowed, divided by the total number of games started
Number of Innings Pitched
Total number of innings for which a pitcher was given credit for completing, with the completion of an inning represented by recording three outs
Opponents Batting Average Against
Total number of hits allowed, divided by the total number of opponent at-bats
Number of Strikeouts per 9 Innings Pitched
Total number of strikeouts, multiplied by 9 and divided by the total number of innings pitched
Number of Hits per 9 Innings Pitched
Total number of hits allowed, multiplied by 9 and divided by the total number of innings pitched
Building an Analysis Data Set All 97 pitchers who threw at least one no-hitter between the start of the 1960 season through the end of the 2008 season were included as cases. Pitchers were divided into three eras: 1960–1968, 1969–1989, and 1990–2008. Each era was different either in terms of offensive production, construction and use of pitching staffs, or both. For instance, after the 1968 season, the pitching mound was lowered due to dominant performances by pitchers such as Bob Gibson. This change ultimately helped increase offensive production. Also, beginning in the 1980s, it became more common to use relief pitchers (many of whom began to develop special titles such as “long reliever” or “closer”) 26
VOL. 23, NO. 3, 2010
and less common for starting pitchers to complete games, a necessary condition for a no-hitter to be thrown by a single pitcher. Finally, the 1990s and the new millennium gave rise to steroids and a greater emphasis on limiting pitch counts for starters, which increased offensive production even more while decreasing the average number of innings pitched by starters. For each era, a pitcher who had thrown a no-hitter (a case) was matched with four pitchers from that same era who had not (controls). Those pitchers who had not thrown no-hitters were selected from the list of the league leaders in number of games started for the year in which the pitcher they were being matched with had thrown his first (and often only) nohitter, beginning with the league leader
and going down. For example, the pitcher who threw the first no-hitter in 1960 was matched with the top four pitchers from the list of the league leaders in number of games started for 1960 who had not thrown a no-hitter, and then the pitcher who threw the next no-hitter in 1960 was matched with the next four pitchers from this same list who had not thrown a no-hitter, etc. Also, no pitcher was used more than once in the data set. This matching scheme was used to help control for changes over time in offensive production, construction and use of pitching staffs, and other factors that have not remained constant in Major League Baseball throughout the last 50 years. This scheme also meant the control group was composed primarily of starting pitchers, many of whom are considered to be some of the best pitchers of the last 50 years. Finally, this particular scheme guaranteed that many current pitchers were included in the data set, some of whom may throw a no-hitter before their career is over. The data set includes 485 pitchers, 97 of whom have thrown a no-hitter. All pitchers who threw at least one nohitter from 1960–2008 were identified using Wikipedia.com (accessed on July 8, 2009), and its accuracy was checked by cross-referencing it with information from MLB.com (accessed on July 8, 2009). No discrepancies were found, thus all relevant pitchers were identified. The career statistics for all pitchers were obtained from MLB.com. A table listing the pitchers in the data set by year and match status can be viewed at www.amstat. org/publications/chance/supplemental.
Modeling the Odds of Pitching a No-Hitter Conditional logistic regression, as described in Modelling Binary Data, was used to model the odds of throwing a nohitter given particular pitcher attributes. This was implemented using PROC PHREG in SAS. A 1:4 matching scheme was employed in this case-control study design, since each pitcher who has thrown a no-hitter is matched with four pitchers who have not. All nine predictor variables were standardized to put them on the same scale and make the results comparable. Thus, a one-unit change in a particular variable is
Table 2—Career Pitching Statistics by Whether a No-Hitter Was Thrown Career Pitching Variable
Summary Statistics No Hitter
Mean
Standard Deviation
Minimum
Lower Quartile
Median
Upper Quartile
Maximum
Starts
Yes No
308 246
171 134
18 28
170 145
300 231
412 314
773 756
Winning Percentage
Yes No
0.53 0.50
0.07 0.07
0.33 0.31
0.50 0.46
0.53 0.51
0.58 0.54
0.77 0.69
Earned Run Average
Yes No
3.71 3.94
0.51 0.49
2.76 2.75
3.37 3.62
3.65 3.92
3.96 4.22
5.56 5.54
Complete Game Percentage
Yes No
0.20 0.17
0.12 0.11
0.01 0.00
0.11 0.08
0.18 0.17
0.29 0.25
0.57 0.50
Shutout Percentage
Yes No
0.06 0.04
0.03 0.03
0.01 0.00
0.04 0.02
0.05 0.04
0.08 0.06
0.13 0.11
Number of Innings Pitched
Yes No
2206 1745
1249 922
99 205
1325 1055
2069 1611
2899 2190
5404 5282
Opponents’ Batting Average Against
Yes No
0.251 0.259
0.015 0.015
0.204 0.212
0.241 0.250
0.251 0.259
0.261 0.269
0.284 0.299
Strikeouts per Nine Innings Pitched
Yes No
6.04 5.56
1.27 1.22
3.15 2.83
5.15 4.76
5.96 5.57
6.63 6.24
10.67 10.08
Hits per Nine Innings Pitched
Yes No
8.55 8.90
0.66 0.67
6.56 6.88
8.10 8.46
8.55 8.88
9.02 9.32
9.90 10.75
Note: For each career pitching variable, statistics on the top row represent the 97 pitchers who threw a no-hitter, whereas statistics on the bottom row represent the 388 pitchers who did not throw a no-hitter.
now a change of one standard deviation for that particular variable. Additional models corresponding to the three eras were run with the “full” model that included all 485 pitchers and all three eras. This was done to investigate if there were differences in which variables were important in predicting the probability of throwing a no-hitter between the three eras. Since the eras were different in terms of offensive production and the way in which pitching staffs were used, it seemed reasonable to think of these as three distinct groups of pitchers. Specifically, single predictor models were used initially to model the effect of each of the variables separately. These single predictor models were run using
the entire data set and for each era. Multiple predictor models were then used to investigate the effect of several variables on the odds of throwing a no-hitter. Ultimately, the “final” multiple predictor models for the entire data set and each era were determined by using the technique of backward elimination.
Comparing Pitchers Who Threw No-Hitters with Their Peers Table 2 provides a summary of the career pitching statistics for each of the nine pitching variables for all 485 pitchers in the data set, subdivided by whether they threw a no-hitter during their career. For each career pitching variable, summary statistics on the top row apply to the 97
pitchers in the data set who threw at least one no-hitter during their career, while summary statistics on the bottom row apply to the 388 pitchers in the data set who did not throw a no-hitter. Those pitchers in the data set who threw no-hitters were superior, on average, to those pitchers who did not. For all career pitching variables where a larger value indicates a superior pitcher (games started, winning percentage, complete game percentage, shutout percentage, number of innings pitched, and strikeouts per nine innings pitched), the values for the means and quartiles are larger for pitchers who threw no-hitters. Also, for all career pitching variables where a smaller value indicates a superior pitcher (earned run average, opponents’ batting CHANCE
27
Table 3—Career Pitching Statistics by Era
Career Pitching Variable
Summary Statistics Era
Mean
Standard Deviation
Minimum
Lower Quartile
Median
Upper Quartile
Maximum
Starts
’60–’68 ’69–’89 ’90–’08
285 249 251
153 149 131
57 27 18
167 138 156
263 224 242
356 323 334
756 773 740
Winning Percentage
’60–’68 ’69–’89 ’90–’08
0.51 0.49 0.52
0.06 0.07 0.07
0.32 0.31 0.33
0.49 0.46 0.47
0.52 0.50 0.53
0.55 0.54 0.56
0.69 0.69 0.77
Earned Run Average
’60–’68 ’69–’89 ’90–’08
3.58 3.81 4.19
0.38 0.39 0.52
2.75 2.86 2.91
3.33 3.54 3.85
3.58 3.80 4.15
3.77 4.03 4.50
5.23 5.06 5.56
Complete Game Percentage
’60–’68 ’69–’89 ’90–’08
0.28 0.21 0.08
0.09 0.08 0.05
0.11 0.06 0.00
0.22 0.15 0.04
0.27 0.20 0.06
0.31 0.27 0.11
0.57 0.53 0.27
Shutout Percentage
’60–’68 ’69–’89 ’90–’08
0.07 0.05 0.02
0.02 0.02 0.02
0.02 0.00 0.00
0.05 0.04 0.01
0.07 0.05 0.02
0.08 0.07 0.03
0.13 0.12 0.07
Number of Innings Pitched
’60–’68 ’69–’89 ’90–’08
2122 1808 1686
1117 1035 875
610 205 99
1263 1025 1071
1951 1688 1598
2569 2231 2153
5350 5404 5008
Opponents’ Batting Average Against
’60–’68 ’69–’89 ’90–’08
0.249 0.258 0.262
0.013 0.015 0.016
0.205 0.204 0.213
0.241 0.250 0.252
0.249 0.260 0.263
0.258 0.266 0.273
0.276 0.298 0.299
Strikeouts Per Nine Innings Pitched
’60–’68 ’69–’89 ’90–’08
5.55 5.08 6.34
1.01 1.11 1.19
3.15 2.83 3.71
4.89 4.31 5.60
5.42 5.00 6.24
6.15 5.83 7.13
9.28 9.55 10.67
Hits Per Nine Innings Pitched
’60–’68 ’69–’89 ’90–’08
8.48 8.83 9.05
0.55 0.63 0.71
6.79 6.56 7.03
8.12 8.45 8.60
8.50 8.84 9.04
8.84 9.20 9.51
9.66 10.61 10.75
Note: For each career pitching variable, statistics on top represent the 115 pitchers who played from 1960–1968, whereas those in the middle represent the 190 pitchers from 1969–1989 and those on bottom represent the 180 pitchers from 1990–2008.
average against, and hits per nine innings pitched), the values for the means and quartiles are smaller for pitchers who threw no-hitters. There is also significant variation between the two groups for the number of games started and the number of innings pitched variables.
Changes in Pitching Performance Since 1960 The differences in career pitching statistics between pitchers from the three eras also were investigated by examining the summary statistics for each variable. This time, however, there was no partitioning based on whether a pitcher threw 28
VOL. 23, NO. 3, 2010
a no-hitter within each era. The results are given in Table 3. For each career pitching variable, summary statistics on the top apply to the 115 pitchers who played during 1960–1968, whereas those in the middle apply to the 190 pitchers from 1969–1989 and those on the bottom apply to the 180 pitchers from 1990–2008. Several pitching variables have changed over time. One of the most notable changes is the rapid decrease in complete game percentage that has occurred, which is significant given it is the only one of the nine variables that is absolutely necessary (along with starting at least one game) for a no-hitter to occur.
Not surprisingly, shutout percentage also has decreased significantly over time. The changes in these two variables are attributable to pitchers now being given specific pitch limits much more often and because there is a greater reliance on relief pitchers. It is also noteworthy that variables such as earned run average and hits per nine innings have risen over time, which is an indication that offensive production also has risen. This is not surprising, given the emergence of steroids, league expansion, smaller ballparks, etc. There also has been a decrease in the number of innings pitched over time, although this is likely due to those pitchers matched to
Table 4—Parameter Estimates for the Nine Single Predictor Variable Conditional Logistic Regression Models Using All Pitchers to Predict Throwing a No-Hitter Variables
Coefficient Estimates
Odds Ratio Estimates
Standard Deviation
Estimate
Standard Error
P-Value
Point Estimate
95% CI
Starts
144
0.44
0.11
0.0001
1.55
(1.24, 1.94)
Winning Percentage
0.07
0.52
0.13
< 0.0001
1.68
(1.30, 2.18)
Earned Run Average
0.50
-0.72
0.16
< 0.0001
0.49
(0.36, 0.66)
Complete Game Percentage
0.11
0.75
0.18
< 0.0001
2.13
(1.48, 3.04)
Shutout Percentage
0.03
0.95
0.17
< 0.0001
2.59
(1.87, 3.59)
Innings Pitched
1012
0.47
0.12
< 0.0001
1.60
(1.27, 2.01)
Opponents’ Batting Average
0.02
-0.67
0.14
< 0.0001
0.51
(0.39, 0.67)
Strikeouts per 9 Innings Pitched
1.25
0.49
0.13
0.0002
1.63
(1.26, 2.10)
Hits per 9 Innings Pitched
0.68
-0.66
0.14
< 0.0001
0.52
(0.40, 0.68)
Variable Name
Note: Data from 97 pitchers who threw no-hitters and 388 who did not are used. Standard deviations for each variable are given next to the variable names.
the pitchers who threw no-hitters from 1990–2008 being relatively young and having long careers ahead of them. It is clear that the three eras are different in terms of reliance on starting pitching and offensive production. These changes also give evidence that throwing a no-hitter appears to have become more difficult over time, and therefore it is sensible to investigate each era separately, as well as the data set as a whole.
Predicting the Odds of Pitching a No-Hitter: Single Predictor Variable Results To provide an initial look at the effect of each of the nine predictor variables on the odds of throwing a no-hitter, separate models were run for each variable. This was done using the overall data set and each of the three eras. The results are given in tables 4 and 5. As can be seen in Table 4, each of the single predictor variable conditional logistic regression models containing all pitchers produced highly statistically
significant results with the expected signs on the parameter estimates. For example, we would expect a variable such as complete game percentage to have a positive parameter estimate because, intuitively, it is reasonable to believe a pitcher is more likely to throw a no-hitter if he completes a higher percentage of his starts. Most important in the results are the odds ratio estimates for each of the variables. Specifically, the point estimate for the odds ratio indicates the effect that increasing a specific predictor variable one standard deviation has on the odds of throwing a no-hitter. The standard deviations for each variable are given next to the variable name. Thus, the odds ratio for the number of games started variable indicates that if a pitcher has 144 more career starts, then the odds of throwing a no-hitter increase by a factor of 1.55 (or 55%). The 95% confidence interval for the odds ratio associated with the number of games started variable indicates the odds increase by anywhere from a factor of 1.24 to 1.94. Of these nine variables,
the most important sole predictor of the odds of throwing a no-hitter appears to be shutout percentage. When the single predictor variable conditional logistic regression models were run for each variable during each of the eras, several differences between the eras became apparent. As Table 5 illustrates, only three variables (earned run average, opponents’ batting average, and hits per nine innings pitched) were significant at the 5% level during 1960–1968. All variables were significant at the 5% level during 1969–1989. All variables except strikeouts per nine innings pitched were significant at the 5% level during 1990–2008. Once again, all signs on the parameter estimates are as expected. The most important sole predictor variables during the last two eras appear to be shutout percentage and complete game percentage, while earned run average appears to be the most important sole predictor variable during the first era. Note that all point estimates and 95% confidence intervals for the odds ratios
CHANCE
29
Table 5—Parameter Estimates for All Single Predictor Variable Conditional Logistic Regression Models for the Three Eras to Predict a No-Hitter Coefficients
Variables
Odds Ratios
Standard Deviation
Era
Estimate
Standard Error
P-Value
Estimate
95% CI
Starts
153 149 131
60-68 69-89 90-08
0.10 0.68 0.48
0.21 0.19 0.24
0.64 0.0003 0.045
1.10 1.98 1.61
(0.73, 1.66) (1.37, 2.86) (1.01, 2.57)
Winning Percentage
0.06 0.07 0.07
60-68 69-89 90-08
0.19 0.75 0.46
0.29 0.22 0.20
0.51 0.0008 0.02
1.21 2.11 1.59
(0.69, 2.13) (1.36, 3.26) (1.07, 2.37)
Earned Run Average
0.38 0.39 0.52
60-68 69-89 90-08
-0.95 -0.95 -0.48
0.39 0.27 0.21
0.02 0.0006 0.02
0.39 0.39 0.62
(0.18, 0.83) (0.23, 0.67) (0.41, 0.93)
Complete Game Percentage
0.09 0.08 0.05
60-68 69-89 90-08
0.11 0.92 2.09
0.29 0.28 0.53
0.69 0.001 < 0.0001
1.12 2.52 8.08
(0.64, 1.98) (1.44, 4.39) (2.84, 22.98)
Shutout Percentage
0.02 0.02 0.02
60-68 69-89 90-08
0.38 1.07 1.58
0.26 0.28 0.35
0.14 0.0001 < 0.0001
1.46 2.91 4.87
(0.88, 2.42) (1.68, 5.04) (2.44, 9.70)
Innings Pitched
1117 1035 875
60-68 69-89 90-08
0.10 0.72 0.61
0.20 0.19 0.26
0.63 0.0002 0.02
1.10 2.05 1.84
(0.74, 1.64) (1.40, 3.00) (1.11, 3.04)
Opponents’ Batting Average
0.01 0.01 0.02
60-68 69-89 90-08
-0.73 -0.75 -0.55
0.32 0.21 0.21
0.02 0.0004 0.01
0.48 0.47 0.58
(0.26, 0.91) (0.31, 0.71) (0.38, 0.88)
Strikeouts per 9 Innings Pitched
1.01 1.11 1.19
60-68 69-89 90-08
0.50 0.75 0.21
0.29 0.21 0.20
0.09 0.0004 0.29
1.65 2.12 1.24
(0.93, 2.93) (1.40, 3.23) (0.83, 1.84)
Hits per 9 Innings Pitched
0.55 0.63 0.71
60-68 69-89 90-08
-0.72 -0.72 -0.56
0.33 0.21 0.21
0.03 0.0006 0.009
0.49 0.49 0.57
(0.26, 0.94) (0.32, 0.73) (0.38, 0.87)
Variable Name
Note: The model for the era from 1960–1968 includes 115 pitchers (23 who threw no-hitters and 92 who did not). The model for the era from 1969–1989 includes 190 pitchers (38 who threw no-hitters and 152 who did not). The model for the era from 1990–2008 includes 180 pitchers (36 who threw no-hitters and 144 who did not). The standard deviations for each variable during each era are given next to the variable name.
are interpreted in the same fashion as was previously discussed.
Predicting the Odds of Pitching a No-Hitter: Multiple Predictor Variable Results To investigate the effects of several variables on the odds of throwing a no-hitter, multiple predictor variable models were developed using all pitchers and for each era. In every model, the variable selection technique of backward elimination was employed, which involves beginning with all variables 30
VOL. 23, NO. 3, 2010
in the model and eliminating the one with the highest p-value (i.e., least significant). The “reduced” model is run with all the remaining variables, and, once again, the variable with the highest p-value is eliminated. This process is repeated until all remaining variables meet some criterion for inclusion in the model. In this case, the inclusion criterion is that all remaining variables must be significant at an alpha level of 0.05. In addition to the use of backward elimination, each model was run with two sets of initial variables. The first set used all nine predictor variables, whereas
the second set used all predictor variables except for innings pitched and hits per nine innings pitched. These two variables were dropped from the second set of initial variables because innings pitched is highly correlated with number of games started (r = 0.979, p-value < 0.0001), and hits per nine innings pitched is highly correlated with opponents’ batting average against (r = 0.995, p-value < 0.0001). A visual representation of the correlations between all nine predictor variables can be seen in the scatter plot matrix in Figure 1.
Figure 1. Scatter plot matrix for the nine pitcher performance predictor variables. All n = 495 observations, including pitchers with no-hitters and their matched controls, are combined. LOESS smooth superimposed on the scatter plot.
Table 6 gives the estimated coefficients for all “final” models using backward elimination and beginning with all variables except for innings pitched and hits per nine innings pitched. Note that any variables that did not appear in any of the “final” models are not included in the table. Also, those results that were different when all nine variables were used instead will be reported in the text. Once again, the most important parts of the results are the odds ratio estimates. In the multiple predictor variable case, the point estimate for the odds ratio indicates the effect that increasing a specific predictor variable one standard deviation, while holding all other variables constant, has on the odds of throwing a no-hitter. Thus, the odds ratio for the winning percentage variable in the model from 1969–1989 indicates that if a pitcher’s winning percentage increases by 0.07 (or 7%), holding all other variables constant, then the odds of throwing a no-hitter increase by a factor of 1.72 (or 72%) with a 95% confidence interval of 1.05 to 2.83. For the final model that includes the entire data set (denoted as “overall” in the table), the most important
variable is shutout percentage. The 95% confidence interval for the odds ratio indicates that if shutout percentage is increased by 0.03 (or 3%), holding opponents’ batting average constant (as this is the only other variable in the model), then the odds of throwing a nohitter increase by a factor of between 1.51 and 2.98. In the context of current starting pitchers, such an increase is equivalent to throwing one more shutout during a full season, since starting pitchers typically make between 30 and 35 starts during the year. As previously indicated, the only other variable included in the model using the entire data set is the opponents’ batting average against, whose negative sign makes sense: If hitters’ batting averages against a pitcher decrease, that indicates a pitcher is giving up fewer hits on average. The interpretation of the odds ratio and 95% confidence interval are similar to those variables with positive parameter estimates, except that the factor is now less than one. Therefore, if a pitcher’s opponents’ batting average against is increased by 20 points (i.e., 0.020 or 2%), holding shutout percentage constant, then the odds of throwing a nohitter decrease by a factor of 0.64 (or
36%). This indicates, as expected, that a pitcher is less likely to throw a nohitter as his opponents’ batting average against increases. Note that this procedure can be inverted to interpret the odds ratio as increasing by a certain amount given a decrease in a particular variable. In the current example, if a pitcher’s opponents’ batting average against is decreased by 20 points, holding shutout percentage constant, then the odds of throwing a no-hitter increase by a factor of 1.56, which equals 1/0.64. The era-stratified conditional logistic regression model fits explored which career statistics were important during different time periods. As is apparent in Table 6, the final models differed greatly between each era. As with the overall model, backward elimination was applied to the initial model that included all predictor variables except innings pitched and hits per nine innings pitched. From 1960–1968, the final model included winning percentage and earned run average. Of these, earned run average is clearly the most important, indicating an increase (decrease) in ERA of 0.38 is associated with a decrease (increase) in the odds of CHANCE
31
Table 6—Parameter Estimates from the Conditional Logistic Regression Models Using All Pitchers and the Three Eras Variables
Coefficients
Standard Deviation
’60-’68 ’69-’89
Odds Ratio Estimates
Estimate
Standard Error
P-Value
Point Estimate
95% CI
0.06
-0.99
0.50
0.05
0.37
(0.14, 1.00)
0.07
0.54
0.25
0.03
1.72
(1.05, 2.83)
0.38
-1.89
0.67
0.005
0.15
(0.04, 0.56)
0.03
0.75
0.17
< 0.0001
2.12
(1.51, 2.98)
’69-’89
0.02
0.72
0.29
0.02
2.05
(1.15, 3.65)
’90-’08
0.02
1.58
0.35
< 0.0001
4.87
(2.44, 9.70)
-0.45
0.15
0.002
0.64
(0.48, 0.85)
0.49
0.24
0.04
1.63
(1.02, 2.61)
Era Winning Percentage overall
’90-’08 Earned Run Average overall ’60-’68 ’69-’89 ’90-’08 Shutout Percentage overall ’60-’68
Opponents’ Batting Average overall
0.02
’60-’68 ’69-’89 ’90-’08 Strikeouts Per Nine Innings overall ’60-’68 ’69-’89
1.11
’90-’08 Note: Models were selected using backward elimination.
throwing a no-hitter by anywhere from a factor of 0.04 to 0.56 (1.79 to 25.0). Also note that the point estimate of the regression coefficient for the winning percentage variable is the opposite of what would be expected. From 1969–1989, the final model included winning percentage, shutout percentage, and strikeouts per nine innings pitched. Just as with the overall model, shutout percentage appears to be the most important variable during this era. Also, this was the only situation in which the final model changed depending on which set of variables was included in the initial model. When all nine predictor variables were included in the initial model, the final model was dramatically different from the model in Table 6. In this case, the final model 32
VOL. 23, NO. 3, 2010
included the number of innings pitched (OR: 1.76, 95% CI: [1.18, 2.64]) and opponents’ batting average (OR: 0.55, 95% CI: [0.35, 0.86]). These differences indicate the effect that highly correlated variables can have on the outcome of the final model. From 1990–2008, shutout percentage was the only variable included in the final model. Shutout percentage appears to be even more important during this era, as an increase of 0.02 (or 2%) in shutout percentage is associated with an increase in the odds of throwing a no-hitter of between a factor of 2.44 and 9.70. Although it does not appear in any of the final models, it is believed complete game percentage may have a significant effect on the odds of throwing a no-hitter.
The reason for its absence may be attributable to it being highly correlated with shutout percentage (r = 0.802, p-value < 0.0001). The high correlation between complete game percentage and shutout percentage likely helps to mask the true effect of complete game percentage on the odds of throwing a no-hitter, resulting in complete game percentage being highly insignificant and quickly eliminated from the model. For example, if shutout percentage is dropped from the initial model during the era from 1990– 2008, the “final” model includes only the complete game percentage variable (OR: 8.08, 95% CI: [2.84, 22.98]). This was also the case for the overall model, in which complete game percentage replaced shutout percentage as the other significant variable in the
final model if shutout percentage was dropped from the initial model. Clearly, these two variables have become particularly important during the current era, where complete games and shutouts have become increasingly rare.
Conclusion Overall, there are several conclusions that can be drawn from these analyses. First, shutout percentage appears to be the most important predictor of the probability of throwing a no-hitter. Second, the factors that are potentially predictive of throwing a no-hitter have changed over recent decades. For example, a pitcher’s career pitching statistics appear to have had the least significant effect on the probability of throwing a no-hitter during the first era, from 1960–1968. Finally, and not surprisingly, there is some evidence that better pitchers are more likely to throw no-hitters than weaker pitchers, which implies that throwing a no-hitter is not simply a fluke event. Shutout percentage appeared to be the most significant variable in all single and multiple predictor variable models except for the era from 1960–1968. The high correlation between shutout percentage and complete game percentage is a likely reason why complete game percentage did not appear in any of the final multiple predictor variable models. However, when shutout percentage was not included in some of the initial multiple predictor variable models, the estimated effect of complete game percentage was often significant. Since throwing complete games and shutouts has become somewhat rare in today’s game, it was also not surprising that these variables were more important during the current era than previous eras. Earned run average appeared to be the only important variable for the era from 1960–1968 when multiple predictor variable models were run for that era. Such results indicate that perhaps the pitchers in the data set during that era were homogeneous. The smaller sample size relative to the other two eras may have been a contributing factor, as well. Were this era to be lengthened to cover as many years as the other two, there is a possibility that more variables would stand out as being significant predictors of the probability of throwing a no-hitter.
It is interesting that, with the possible exception of shutout percentage, none of the various pitcher characteristics stood out across all eras as important for predicting no-hitters. There are two possible explanations that come to mind. First, the results observed here may be an artifact of the formation of the matched controls. By selecting the matched controls from the lists of league leaders in number of games started, we may have washed out some of the effects of pitcher characteristics. It may be that pitching a no-hitter could be equally likely among these league leaders. A future analysis might build an analysis data set by randomly selecting a control for a no-hitter pitcher from the pool of all pitchers who did not throw a no-hitter in a particular season. Second, to a lesser extent, the defensive ability of the fielders behind the pitcher may be critical for achieving a no-hitter. Anecdotally, there is often a spectacular defensive play that robs a sure hit and preserves a no-hitter. It would be interesting to include team defensive summaries as part of a future analysis. One other issue with the matching scheme employed here is that each pitcher’s entire career statistics are examined, rather than their career statistics up to the time the no-hitter was thrown. It may be that a pitcher who threw a no-hitter was just beginning his career and that one or more of the pitchers matched with this particular pitcher were near the end of their careers. Thus, a future analysis could either use a pitcher’s career statistics up to the time the no-hitter was thrown (for both the pitcher who threw the nohitter and the matched controls) or use a pitcher’s entire career statistics, but match all no-hitter pitchers with other pitchers who played over essentially the same time period. Finally, the analyses gave some evidence that better pitchers are more likely to throw a no-hitter than weaker pitchers. Those pitchers who threw no-hitters are a more skilled group, on average, than those pitchers who did not, as seen in Table 2. The significant effect of shutout percentage on the probability of throwing a no-hitter in three of the final multiple predictor variable models also provides evidence that higher quality pitchers are more likely to throw a no-hitter.
...none of the “ various pitcher characteristics stood out across all eras as important for predicting no-hitters.
”
Further Reading Albert, Jim. 2009. Is Roger Clemens’ whip trajectory unusual? CHANCE 22(2):8–20. Bradlow, Eric T., Shane T. Jensen, Justin Wolfers, and Abraham J. Wyner. 2008. A statistical look at Roger Clemens’ pitching career.” CHANCE 21(3):24–30. Checkoway, H., N. E. Pearce, and D. J. Crawford-Brown. 1989. Research methods in occupational epidemiology. New York: Oxford University Press. Collett, C. 2002. Modelling binary data. New York: John Wiley & Sons. Frohlich, Cliff. 1994. Baseball: Pitching no-hitters. CHANCE 7(3):24–30. Huber, M., and A. Glen. 2007. Modeling rare baseball events: Are they memoryless? Journal of Statistics Education 15(1). Sommers, P.M., D. L. Campbell, B. O. Hanna, and C. A. Lyons. 2007. A Poisson model for no-hitters in major league baseball. Middlebury College Economics Discussion Paper No. 07–17. Stokes, M. E., C. S. Davis, and G. G. Koch. 2000. Categorical data analysis using the SAS system, 2nd ed. Cary, NC: SAS Institute, Inc. Wikipedia contributors. No-hitter. Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/w/index. php?title=No-hitter&oldid=336736776.
CHANCE
33
Best Hand Wins: How Poker Is Governed by Chance Vincent Berthet
D
uring the last several years, poker has grown in popularity so much that some might even say it’s become a social phenomenon. Whereas poker was not played much by nonspecialists a few years ago, it has since democratized and become one of the most practiced card games. Accordingly, the stereotype of the poker player has moved from the old guy who smokes a cigar and hides an ace in his sleeve to the 21 year old who plays online. While the five-card draw used to be the most famous variant of poker; the variant widely practiced today is Texas hold ‘em. In Texas hold ‘em, each player is dealt two cards face down. This level is called preflop, and play begins with a first betting round in which each player can check (i.e., stay in the play without having to bet), bet, raise (i.e., bet over a bet), or fold (i.e., leave the play). The dealer then deals a flop, three face-up community cards. The flop is followed by a second betting round. After the flop betting round ends, a single community card called the turn is dealt, followed by a third betting round. A final single community card called the river is then dealt, followed by a fourth betting round. A play can be completed in two ways. First, at least two players have reached the river having bet the same amount. In this case, the remaining players reveal their hands and the winner is the one who holds the best hand (i.e., showdown).
34
VOL. 23, NO. 3, 2010
Second, at one of the four levels of play, all players have folded their hands except one, who is then the winner. In such cases, the player who wins the pot does not have to reveal his cards (i.e., no showdown). Because of the growth of poker’s popularity and its economic importance, the issue of whether poker is governed by skill or chance has become recurring and turned into a social debate, as evidenced by Celeste Biever’s recent paper in The New Scientist. As a card game, poker would be governed by chance if the outcome of most games were determined by the value of the cards—the distribution of the cards being random. In other words, the player who wins the pot would also be the one who holds the best hand in most cases. Poker would be governed by skill if the outcome of most games were determined by something other than the value of the cards, such as the betting technique, reading of opponents, or psychology, all of which are grouped under the generic term “skill.” The poker industry strongly promotes the idea that poker involves skill more than chance. The purpose of our study was to provide new empirical evidence to this debate. This work was done in response to a recent study by Cigital, Inc., a consulting firm, that reported two major findings over 103 millions hands: that games did not end in a showdown 75.7% of the time and only 50.3% of those in the 24.3% of showdowns were won by the best hand. Both results suggest the outcome of games is determined by something other than the value of the cards. Thus, Cigital concluded that poker is governed by skill. In our study, we show that this interpretation is not relevant when data are further analyzed. Moreover, our research applied inferential statistics, which allowed a clear test of the theoretical hypothesis that poker is governed by chance, while the Cigital research provided only descriptive statistics. Our findings provide clear statistical evidence corroborating this hypothesis.
Method Data Acquisition Given the nature of the analysis to be performed, we needed a database that met two criteria: it was relative to cash-game poker, rather than four-tournament poker, and it included the hands of all players in each game, even folded hands. Even though such data are difficult to obtain, they can be found in poker television programs. Two met our criteria: “High Stake Poker” (season five only) and “Poker After Dark” (season four, “Cash Game #1” and “Cash Game #2”; season five, “Nets vs. Vets Cash Game,” “Hellmuth Bash Cash Game I,” and “Hellmuth Bash Cash Game II”; season six, “Top Guns Cash Game #1” and “Top Guns Cash Game #2”). The coding of these programs resulted in a 678-game database.
Data Coding For each game, 13 variables were coded for each hand, together with the two cards of each player and the cards on the board. “Showdown” indicates whether a given game ended in a showdown, whereas “Level” indicates the level reached in the game (i.e., preflop, flop, turn, river). “Progress” indicates whether a hand was still in progress when the game ended. To illustrate this variable, let’s consider a game in which two players were to the flop. One of the players bet; the other player folded. In this simple case, these two players were in progress when the game ended, while players who folded their hands before the flop were not in progress. There are more subtle cases, however, in which the coding of the progress variable is less obvious. For example, consider a game in which three players—A, B, and C—flop. A was the first player to act and he or she bet. Then B folded, C raised, and A folded. In this case, despite there being three players to flop, there were only two involved when the game ended. Indeed, B was no longer in progress when C raised, which resulted in A folding and the game ending. “Status” indicates whether a hand was the best among the hands in progress when the game ended. “Result” indicates the objective outcome of a hand (i.e., won or lost). The next four variables refer to which hand was the best at each level of the games among the hands of all players. For instance, for a game that reached the turn, we coded which hand was the best at the preflop, which hand was best at the flop, and which hand was best at the turn. The last four variables refer to the strength of hands at each level of the games. The strength of a hand was defined as an ordinal variable with three levels: weak, marginal, and strong (see supplemental material at www. amstat.org/publications/chance/supplemental). Contrary to classical poker databases, which allow computing of global statistics only (e.g., proportion of games won at showdown, proportion of games won at preflop), our database offered real possibilities for further analysis. Indeed, knowing the hand of each player and which hand was the best at each level of the games allowed us to probe the underlying mechanisms of poker.
Results Three kinds of games were excluded from the analysis: games that ended in a slip pot (1.47%); those that did not end in a showdown, but in which players held the same hand (0.15%); and those in which the board was run more than one time (1.03%). This resulted in a 660-game sample. We basically estimated the same three proportions as Cigital, Inc. Our estimations (and 95% confidence intervals) were: 73.5% (70.2,76.8) of games without showdown 15.3% (12.6,18.0) of games that ended in a showdown in which the best absolute hand did not win 11.2% (8.8,13.6) of games that ended in a showdown in which the best absolute hand won Based on their 103-million hand sample, Cigital, Inc. reported 75.7%, 12.1%, and 12.2% for the first, second, and third proportions, respectively.
Phil Hellmuth said: “ ‘Poker is 100% skill and 50% luck.’ That quote is 100% dubious!
”
- Anonymous (Two Plus Two poker forum)
Although both sets of estimations are similar, the two samples differ slightly [x2 = 6.6, df = 2, p < .05], especially regarding the proportion of games that ended in a showdown in which the best absolute hand did not win. This difference could be because Cigital’s sample included hands played online by a majority of amateur poker players, whereas our sample included hands played live by professional poker players. Since professional players tend to bluff more than amateur players, it seems normal that best absolute hands won less often in the latter sample. More fundamentally, we do not claim that our sample is representative of Texas hold ‘em as a whole. Indeed, one should first define all the relevant features of the population to construct a representative sample of Texas hold ’em, and it seems Cigital did not. As a result, one cannot claim that Cigital’s sample is more representative of Texas hold ‘em than ours, because online poker played by amateurs is not more representative of Texas hold ‘em than live poker played by professionals. Games with Showdown One of the arguments advanced by Cigital to support the hypothesis that poker involves skill more than chance is that the best hand wins only 50.3% of the time at showdown. At first glance, such a proportion seems surprising, as one would logically expect the best hand to win every time at showdown. Actually, this finding relies on a particular meaning of the “best hand.” In each game, while the best “absolute” hand is the one held by the player who would have made the best five-card hand at showdown, the best “relative” hand is the best hand among players who were in progress when the game ended. In the latter, the best hand is determined by taking into account all players, regardless of whether they folded. In the former, the best hand is determined with respect to players who were in progress when the game ended. Based on these considerations, it seems Cigital’s finding relies on the notion of best absolute hand (i.e., best relative hand wins every time at showdown by definition). To illustrate the difference between best absolute hand and best relative hand, let’s consider the following example. Three players—A, B, and C—start a play. A holds the five of diamonds and five of spades (a pair of fives), B holds the ace of hearts and the king of spades (two high cards), and C holds the ace of clubs and the 10 of diamonds. The flop is the ace of diamonds, the king of hearts, and the queen of clubs. The value of the hands is now a pair of CHANCE
35
Table 1—Contingency Table of Best Absolute Hand Strength by Level at Time of Fold
Level Hand Strength
Preflop
Flop
Turn
River
Row total
Observed
78
3
0
0
81
Column %
97.5
33.3
0.0
0.0
(80.2%)
Observed
2
5
1
1
9
Column %
2.5
55.6
100.0
9.1
(8.9%)
Observed
0
1
0
10
11
Column %
0.0
11.1
0.0
90.9
(10.9%)
80
9
1
11
101
(79.2%)
(8.9%)
(1.0%)
(10.9%)
Weak
Marginal
Strong
Column total
fives for A, two pairs (aces and kings) for B, and a pair of aces for C. In terms of hand strength, B is better than C, which is better than A. In the first betting round, A checks, B bets, C calls, and A folds. The card at the turn is the five of clubs. If A had remained in the game, A would have had the strongest hand (three fives). B bets and C calls. The card at the river is the two of hearts, which does not help any of the players. B bets and C calls, so the game ends in a showdown. B holds the best relative hand (top two pairs), whereas A holds the best absolute hand (three of a kind). In this play, the best relative hand (B) wins because A folded. Player C loses the most money. Given that the best absolute hand does not win at showdown in half the cases, Cigital suggested that this hand was beaten by skill. This rationale is valid only under the assumption that the best absolute hand is the best hand at all levels of a game. Instead, if the best absolute hand is not the best hand from stem to stern in most cases, then this hand is dominated at a certain level in the play. Therefore, it could be that the best absolute hand does not win at showdown half the time because it is dominated before showdown half of the time. To test this alternative interpretation, we analyzed the statistical properties of best absolute hands that did not win at showdown. Table 1 presents the contingency table of best absolute hand strength by level categories at time of fold. Two main observations can be drawn from this table. First, best absolute hands that did not win at showdown were folded preflop 79% of the time. Second, even though these hands would have won at showdown, they were actually weak hands when they were folded (80% of the time). Moreover, regarding their status, these hands were inferior when they folded 82% of the time. Taken together, these 36
VOL. 23, NO. 3, 2010
three observations lead to a clear conclusion: Best absolute hands that did not win at showdown were actually largely dominated when they were folded. Accordingly, the best absolute hand does not win at showdown half the time because it was beaten before showdown half the time—not because it faced a skilled player who bluffed. In other words, it is the value of the cards in this set of games, rather than skill, that determines the outcome. Moreover, these findings highlight the importance of how the best hand is defined. We claim that when determining the best hand, only considering the best relative hand makes sense. Games Without Showdown The second argument advanced by Cigital is that games that do not end in a showdown are governed by skill, and that such a scenario occurs in the large majority of cases (75.7% of the time). At first glance, this argument makes sense. Indeed, a card game is considered to be governed by chance if the outcome of most games is directly related to the value of the cards. Therefore, one could conclude that games without showdown cannot be governed by chance, since no private card is revealed. It is worth noting that games without showdown are the core of poker. In fact, the possibility of winning a game without having to reveal one’s cards is poker’s trademark. This feature seems to eliminate chance for the benefit of skill and opens the door to bluff. Metaphorically, games without showdown are the kingdom of skill and the bluff is king. To illustrate this idea, let’s consider the summary of a promotional video distributed by “Poker After Dark.” The narrator is Doyle Brunson, a poker living legend. Here is what he says: “It all begins with a raise before the flop, then a re-raise. You fold, then a call. Jack,
Table 2—Contingency Table of Winning Hand Status by Level
Level Winning Hand Status
Preflop
Flop
Turn
River
Row total
Observed
38
41
37
16
132
Column %
37.3
25.8
31.1
15.2
(27.2%)
Observed
64
118
82
89
353
Column %
62.7
74.2
68.9
84.8
(72.8%)
102
159
119
105
485
(21.0%)
(32.8%)
(24.5%)
(21.6%)
Inferior
Best
Column total
four, 10, with two hearts at the flop. Check, bet, and call. The Ace of heart[s] is the turn. Bet, and a big raise. Now the real game begins.” Then, the other player folds his hand—a pair of aces— which was probably the best hand. The remaining player wins the pot without showing his cards. “That’s poker, folks,” says Brunson. This promotional video shows what advocates of poker think to be the core of the game: When cards are not revealed, one can put the pressure on the player who holds the best hand and force him to fold. This reasoning is valid at an explicit level, but one also could consider an implicit level. Indeed, that cards are not revealed is not sufficient to discard their influence on the outcome. One has to consider the possibility that most games without showdown are nevertheless won by the player who actually holds the best hand. We therefore analyzed games without showdown (485 games) in more detail. We tested the hypothesis that poker is governed by chance by examining the distribution of winning hands in those games. According to what we suggested previously regarding best hand determination, we considered best relative hands (i.e., the best hand among players who were in progress when the game ended), rather than best absolute hand (i.e., the best potential five-card hand that would have won when the game ended). If poker is governed by skill, winning hands should be uniformly distributed into the inferior hand and best (relative) hand categories (i.e., null hypothesis). Indeed, claiming that poker is governed by skill means winning or losing with a particular hand does not depend on its objective value. In other words, one could win as much with inferior hands as with best hands.
However, if poker is governed by chance, winning hands would be best relative hands most of the time (i.e., alternative hypothesis). This means the outcome of a particular hand would be directly related to its objective value: one would win with best hands and lose with inferior hands. A one-way chi-square test revealed that winning hands were best hands in most cases [x2 = 99.8, df = 1, p < .001, after Yates’s correction]. Indeed, of the 485 winning hands, 353 were best hands, corresponding to 72.8% (69.5,76.1). Thus, even when hands were not revealed, best hand won almost 75% of the time. Moreover, this tendency was observed at each level of the games. Though the dominance of best hands was slightly modulated by level, [x2 = 13.9, df = 3, p < .005], the best hand tended to win whichever level was reached. Indeed, best hand won 62.7% (59.1,66.3) of the time at the preflop, 74.2% (70.9,77.5) of the time at the flop, 68.9% (65.4,72.4) of the time at the turn, and 84.8% (82.1,87.5) of the time at the river. Table 2 presents the contingency table of winning hand status by level categories. Considering the process of getting a winner as a process of survival with each level being a new obstacle, these findings show that the best existing hand tends to survive to the winning stage most of the time. One could argue that winning hands being best hands almost 75% of the time does not demonstrate that poker is governed by chance. In fact, when hands are not revealed, holding the best hand is not sufficient to win. Holding the best hand is related to chance, but knowing that one holds the best hand is related to skill. However, knowing that one holds the best hand might not involve much skill, since it could be that best hands are strong hands in most cases. Indeed, the probability of holding the best hand is directly related to the strength of one’s hand, and there is no need to be a skilled CHANCE
37
player to know that. Thus, we further tested the hypothesis that poker is governed by chance by analyzing the distribution of winning hands as a function of hand strength. While the skill hypothesis states that winning hands should be uniformly distributed into the weak, marginal, and strong categories (i.e., one could win whatever the hand strength), the chance hypothesis assumes winning hands should be strong hands most of the time. A one-way chi-square test confirmed that winning hands were strong in most cases [x2 = 72.9, df = 2, p < .001]. In fact, of the 485 winning hands, 27.0% (23.7,30.3) were weak, 21.6% (18.5,24.7) were marginal, and 51.3% (47.6,55.0) were strong. This supports our claim that knowing one holds the best hand does not require much skill, as winning hands are strong most of the time.
Conclusion Our study might suffer from two limits. The first concerns our sample, which included games played by top professionals only. Since such players have relatively the same amount of skill, it could be that luck determines the outcome of games in such cases. It is more likely that skill would play a significant role when skilled players play against beginners. Further research is needed to investigate how properties of winning hands depend on the level of players. The second limit is that our analysis did not take into account betting history. Betting influences the outcome of games along with the value of the cards. While we showed
38
VOL. 23, NO. 3, 2010
that the latter plays a massive role in the process of getting a winner in each game, it seems likely that betting also plays a significant role. For instance, a weak hand is more likely to survive when accompanied by proper betting. In the same way, a strong hand is more likely to make money if the betting pattern is adequate. Betting could be used to measure the amount of skill with which a hand was played, so that one could determine which of the chance (i.e., the value of the cards) or skill (i.e., betting) factors best predicts the outcome of hands. This would address further the chance vs. skill issue and extend our findings. We plan to construct a database that includes betting history for the games. When addressing the chance vs. skill issue, one should always keep two guidelines in mind. First, such an issue is unlikely to be resolved by a study alone. As a result, legal decisions regarding the status of poker should be made by examining all available scientific evidence. Second, games of pure chance and games of pure skill are located at the extremities of a continuum, and poker is located at a certain point on this continuum. No one can deny that poker involves a certain degree of chance. The mere existence of “bad beats” (i.e., the favorite hand finally loses) is a direct consequence of this random component of the game. However, some studies reported that skill also plays a significant role. For example, Rachel Croson, Peter Fishman, and Devin Pope showed that poker is similar to golf, which is a typical game of skill, in a recent CHANCE article. Moreover, Michael DeDonno and Douglas Detterman reported in an article published in the Gaming Law Review the findings of learning effects in poker. Such effects would not be observed if poker were a game of pure chance. As a result, poker is neither a game of pure chance nor a game of pure skill. Thence, the problem is to determine whether poker is dominated by chance or by skill. By showing that it is the value of the cards that mostly determines the outcome, our findings demonstrate that poker is truly governed by chance. In most cases, the player who wins the pot is the one for whom the cards were the most favorable. Best hand wins with and without showdown. That’s poker, folks.
Further Reading Biever, C. 2009. Poker skills could sway gaming laws. The New Scientist 202(2702):10. Cigital, Inc. 2009. Statistical analysis of Texas hold ’em. www. cigital.com/resources/gaming/poker/100M-Hand-AnalysisReport.pdf Croson, R., P. Fishman, and D. G. Pope. 2008. Poker superstars: Skill or luck? Similarities between golf—thought to be a game of skill—and poker. CHANCE 21(4):25–28. DeDonno, M. A., and D. K. Detterman. 2008. Poker is a skill. Gaming Law Review 12(1):31–36. Sklansky, D., and M. Malmuth. 1999. Hold ’em poker for advanced players. Henderson, NV: Two Plus Two Publishing.
The Contribution of
Star Players in the NBA: A Story of Misbehaving Coefficients Martin L. Jones and Ryan J. Parker
A
t the end of the 2008–2009 National Basketball Association (NBA) season, the Los Angeles Lakers were crowned champions. Since winning basketball games and championships is the goal, Lakers’ superstar Kobe Bryant made a strong case for being the league’s most valuable player (MVP). That distinction, however, was given to LeBron James, who led his team—the Cleveland Cavaliers—to the best regular season record in the league. Fortunately for James, voting for the MVP award occurred before the playoffs began, since the Orlando Magic beat the Cavaliers in the Eastern Conference finals. A third candidate for the MVP award was Dwyane Wade from the Miami Heat. Wade made a strong case for himself late in the regular season by upping his level of play to get his team into the playoffs, averaging 33.6 points per game in March and April compared to 29.0 points per game through February. Is there a way to quantify how much each player contributes to his team’s success? We will consider statistical analyses of the regular season using logistic regression models to determine how much each player’s performance contributes to his team’s probability of winning.
The Logistic Regression Models A logistic regression model is used to predict the probability of success π for a binary response variable Y based on observed variables x1, x2, . . . , xk. The link function for the model is the logistic link logit(π) = log(π/(1−π)), that is the natural logarithm of the odds of success. The form of the model is logit(π) = α + ß1x1 + · · · + ßkxk
(1)
where the coefficients α, ß1, . . . , ßk are estimated from the data using maximum likelihood estimation. In this application, let Y be the binary random variable representing a win in a given game of the regular season. That is, Y takes the value one for a win and zero for a loss. CHANCE
39
Table 1—Estimated Logistic Regression Model Coefficients for Predicting a Win Based on Player Points and Playing at Home for Three Players Wald Test p-Value
Odds Ratios & Confidence Interval
Pearson Goodness of Fit p-Value
2.302 (constant) −0.046 (points) 0.948 (home)
0.147 0.114
0.95 (0.89, 1.02) 2.58 (0.80, 8.36)
0.470
James
−3.311 (constant) 0.132 (points) 4.039 (home)
0.023 0.001
1.14 (1.02, 1.28) 56.8 (5.30, 608.3)
0.321
Wade
−2.072 (constant) 0.055 (points) 1.083 (home)
0.060 0.025
1.06 (1.00, 1.12) 2.95 (1.14, 7.63)
0.398
Player
Coefficients
Bryant
Figure 1. Left: Logistic regression models for predicting the probability of a win based on Bryant’s points, with sample proportions included. Right: Bryant’s distribution of points for home versus away games.
We assume the results of regular season games are independent and that, with no additional information available (such as the quality of the opponent), the probability of a win in a given game is π. The variables used to predict the response Y are a subcollection of the player’s points scored, rebounds, and assists, as well as combined measures of performance. We construct individual logistic regression models for Bryant, James, and Wade to determine if any of these variables are good predictors for the success of their teams. Based on these models, we attempt to determine the impact of each player on his team’s success.
40
VOL. 23, NO. 3, 2010
Analyses Based on Points Alone All three of these players are known for scoring many individual points. In fact, during the 2008–2009 regular season, Wade led the league in scoring by averaging 30.2 points per game, followed by James with 28.4 and Bryant with 26.8. Because home court advantage is known to have an effect on the outcome of NBA games, we will include an indicator variable for home court. The models for the three players are summarized in Table 1. In addition to the model coefficients, we display the Wald p-values for the statistical significance of the variables, an estimated odds ratio (constructed by exponentiating the
Figure 2. Left: Logistic regression models for predicting the probability of a win based on James’ points, with sample proportions included. Right: James’ distribution of points for home versus away games.
Figure 3. Left: Logistic regression models for predicting the probability of a win based on Wade’s points, with sample proportions included. Right: Wade’s distribution of points for home versus away games.
coefficient) with a 95% confidence interval, and a p-value for a Pearson goodness-of-fit test. For Bryant, the p-values for the coefficients give suggestive, but not conclusive, evidence that these variables are good predictors for the probability that the Lakers win. Moreover, the estimated odds that the Lakers win for each additional point scored by Bryant decreases by a factor of 0.95, indicating that increased point production by Bryant is actually detrimental to his team’s success. For both James and Wade, the models indicate that for each additional point they score, the odds of a win for their teams increases by an estimated factor of 1.14 for James and
1.06 for Wade. Moreover, home court advantage is a statistically significant predictor of a win for both the Cavaliers and the Heat. For the Cavaliers, this is not a surprising result, as they touted one of the best home court records during the 2008–2009 season. The Pearson p-values indicate the models fit reasonably well. In figures 1, 2, and 3, a graphic of each model is displayed with boxplots showing the point production of each player in home and away games. Also included are points representing the sample proportion of wins for consecutive five-point intervals of points scored.
CHANCE
41
Figure 4. Box plots showing the point distributions of the three players in games in which they win versus games in which they lose
It is well known that winning on the road is more difficult in the NBA than winning at home. Players need to perform at a high level during away games to propel their teams to victory. It is interesting to note that while the above analysis shows Wade’s point production to be important to his team’s success, he tends to score more points during home games, whereas both Bryant and James appear to score more points in away games. A pooled two-sample t test gives moderate evidence (p-value = 0.10) of a difference in Wade’s mean point production at home versus away. His home average is 31.8, while his away average is 28.5. James averages 25.4 points per game at home and 31.5 points per game away. A pooled two-sample t test gives strong evidence of a difference in his point production with a p-value of 0.001. Bryant also provides more offense on the road than he does at home, as his home average is 25.1 points per game while his away average is 28.6 points per game. A pooled two-sample t test for this difference has a p-value of 0.064. One possible explanation for the drop in point production for Bryant and James during home games is that both the Lakers and the Cavaliers had strong home records during the regular season. Often, these teams were well ahead of their opponents going into the fourth quarter, thus Bryant and James saw less action in many home games and, therefore, may have seen a drop in their scoring averages at home. To take this into account, the scoring average can be adjusted for the number of minutes a player plays in a game. We use the number of points scored per 40 minutes, instead of the total number of points scored in each game. Bryant averages 28.4 points per 40 minutes at home and 30.9 away. James averages 28.3 at home and 32.0 away. Wade averages 32.9 at home and 29.5 away. Two-sample tests give modest evidence that James and Wade have a difference in the amount of points they score on the road versus at home. Similar logistic regression models were constructed using the number of points per 40 minutes as a predictor; however, the coefficients for these models do not differ much from those of the original, and the signs of the coefficients are the same. These models suggest that decreased playing time during home games does not reverse the overall trend that Bryant and 42
VOL. 23, NO. 3, 2010
James score more points during away games and Wade scores more points during home games, nor does it eliminate the apparent paradox that increased point production by Bryant is detrimental to his team’s chances of winning. Further insight into the Bryant paradox can be obtained from Figure 4, which shows the point distributions for the three players in winning games versus losing games. Wade scores more points in winning efforts (p-value = 0.022). There is no clear evidence of a difference in point production from James for wins versus losses (p-value = 0.372), while there is moderately strong evidence that Bryant scores more in losing outings (p-value = 0.066). Perhaps Bryant, more than James and Wade, tends to take over games in which his team is behind, and thus ends up scoring more points in the Lakers’ losses.
Analyses Based on Points, Rebounds, and Assists Scoring a lot of individual points is one way to win basketball games, but getting the rest of your teammates involved is typically a more successful strategy. Superstars can use attracting a lot of defensive attention from the opposing team to their advantage by passing to open teammates. Also, players who are able to obtain offensive and defensive rebounds help their team attempt more shots than their opponent, thus giving their team more chances to score. We will consider models using points, rebounds and assists as predictor variables. There are other ways in which players can affect a game’s outcome, but we chose these three because they are available and intuitively plausible as summaries of a player’s performance. As before, we include an indicator variable for home games. Table 2 summarizes logistic regression models for the three players using these predictors. Interestingly, the coefficients and estimated odds ratios for the points and home predictors do not change appreciably from the earlier models that used only these two predictors. The model fits are reasonable, but with four predictors in the models, many of the predictors are not statistically significant. James is the only player with estimated odds ratios above one for all three of the performance predictors. Wade has an
Table 2—Estimated Logistic Regression Model Coefficients for Predicting a Win Based on Player Points, Rebounds, Assists, and Playing at Home for Three Players Wald Test p-Value
Odds Ratios & Confidence Interval
Bryant
2.302 (constant) −0.043 (points) −0.188 (rebounds) 0.109 (assists) 0.907 (home)
0.216 0.102 0.393 0.139
0.96 (0.90,1.03) 0.83 (0.66, 1.04) 1.11 (0.87, 1.43) 2.48 (0.75, 8.22)
James
−3.65 (constant) 0.128 (points) 0.056 (rebounds) 0.006 (assists) 4.00 (home)
0.026 0.614 0.957 0.001
1.14 (1.02, 1.27) 1.06 (0.85, 1.31) 1.01 (0.81, 1.25) 54.84 (5.05, 595.9)
Wade
−2.71 (constant) 0.059 (points) −0.077 (rebounds) 0.113 (assists) 1.24 (home)
0.049 0.523 0.177 0.015
1.06 (1.00, 1.12) 0.93 (0.73, 1.17) 1.12 (0.95, 1.32) 3.47 (1.27, 9.56)
Player
Coefficients
Pearson Goodness of Fit p-Value
0.470
0.684
0.263
Table 3—Probabilities of Winning Games at the Quartiles of the Points Scored, Rebounds, and Assists for the Three Players in Both Home and Away Games Quartile
Bryant
James
Wade
Q1 = (22, 4, 5)
0.76 (A) 0.89 (H)
0.36 (A) 0.97 (H)
0.24 (A) 0.52 (H)
Q2 = (27, 5, 7)
0.72 (A) 0.87 (H)
0.53 (A) 0.98 (H)
0.33 (A) 0.63 (H)
Q3 = (33, 7, 8)
0.61 (A) 0.79 (H)
0.73 (A) 0.99 (H)
0.40 (A) 0.70 (H)
estimated odds ratio below one for the rebounds predictor, and Bryant has estimated odds ratios below one for both points and rebounds. It should be noted that none of the predictors are statistically significant for Bryant, and only the points predictor is statistically significant for Wade. To summarize the points-rebounds-assists models, we consider the probability of winning at approximately the first, second, and third quartiles of the number of points, rebounds, and assists for all three players’ combined data. These comparisons are displayed in Table 3, where it is interesting to note that, as these forms of production increase, the probability of a win increases for both James and Wade, but decreases for Bryant. Bryant’s probabilities of winning are all well above 0.5; however, whereas James’ probabilities are lower for away games and Wade’s are all below 0.5 for away games.
The strength of the Cavaliers at home is clearly apparent in the probability of a home win for James at all three quartiles, as his team only lost one game at home all season. This was a game in which James did not participate.
Further Analyses One problem with the analyses here is that there should be roughly 10 wins and 10 losses for each variable in the logistic regression model. For Bryant and James, the number of losses for their teams is fewer than 20. Due to the restriction on the reasonable number of predictors we can have in these models, we would like to combine players’ statistics in such a way that a player’s performance can be summarized using fewer predictor variables. In addition, we would like for this combination of player statistics to reflect the teamwork involved in producing points on offense and allowing points on defense.
CHANCE
43
Table 4—Estimated Logistic Regression Model Coefficients for Predicting a Win Based on Player’s Offensive Rating, Defensive Rating, and Playing at Home for Three Players Player
Coefficients
Wald Test p-Value
Bryant
9.490 (constant) 0.067 (ortg) −0.145 (drtg) 0.963 (home)
0.020 0.003 0.001 0.157
James
1.379 (constant) 0.119 (ortg) −0.139 (drtg) 3.393 (home)
0.781 0.001 0.005 0.013
Wade
−0.140 (constant) 0.098 (ortg) −0.106 (drtg) 0.545 (home)
0.965 0.000 0.002 0.371
Table 5—Probabilities of a Win at the Quartiles of Offense Ratings and Defensive Ratings for the Three Players in Both Home and Away Games Quartile
Bryant
James
Wade
Q1 = (103.3, 111.8)
0.55 (A) 0.76 (H)
0.13 (A) 0.82 (H)
0.13 (A) 0.21 (H)
Q2 = (117.1, 104.0)
0.91 (A) 0.96 (H)
0.70 (A) 0.99 (H)
0.58 (A) 0.70 (H)
Q3 = (130.2, 96.2)
0.99 (A) 0.99 (H)
0.97 (A) 1.00 (H)
0.92 (A) 0.95 (H)
Note: High defensive ratings correspond to poor defensive performances.
44
VOL. 23, NO. 3, 2010
In his book Basketball on Paper, Dean Oliver provides a method for doing just that. In this analysis, we use Oliver’s offensive and defensive player rating combined with a home court indicator as predictor variables. These offensive ratings estimate the number of points per hundred possessions the player produced, and these defensive ratings estimate the number of points the player allows per hundred possessions. In short, the offensive ratings are created by giving players credit for producing points based on the shots and free throws they take, shots they assist on, and offensive rebounds they obtain. The defensive ratings are created by considering when players force missed shots from their opponents, when they foul opponents resulting in made free throws, the turnovers they force, and the defensive rebounds they obtain. The logistic models for each player are summarized in Table 4. Note that high values for offensive ratings are better than low values, whereas low values for defensive ratings are better than high values. This explains the negative coefficients in the models for the defensive ratings. In Table 5, we compare the probabilities of winning at the average of the quartiles of the offensive and defensive ratings for the three players. Bryant fares better than James or Wade in the away probabilities of winning, but James edges Bryant in the home probabilities. Wade is uniformly dominated by both Bryant and James for both home and away categories. Notice that the apparent paradox that Bryant does not help his team win games disappears when using these combined measures of offensive and defensive performance. It is quite conceivable that these combined performance ratings are better measures of a player’s contribution than are points, rebounds, and assists as raw counts. It should be noted, however, that these ratings depend on the player’s teammates. This mostly applies to defensive ratings, as, other than blocked shots, data are not collected that tell us which shots players force to be missed or allow to be made. For this reason, we expect Wade to be at a disadvantage in this analysis because his team is weaker than James’ and Bryant’s teams.
The Case for Each Player Which of these players can best make the case that their game performances contribute to the success of their teams? Using the logistic models based on points (Table 1) and the models
based on points, rebounds, and assists (Table 2), James appears to have the strongest case. The coefficient on points is the largest of the three players in the points-only model, and all three coefficients are positive in the points, rebounds, and assists model. Wade seems to make the second-strongest case using these models; however, the coefficient on rebounds is negative in the multiple predictor model. Wade does have the largest coefficient on assists among the three players, the only coefficient that is positive for Bryant in the model. When looking at the predicted probability of a win at the quartiles for the multiple predictor models (Table 3), Bryant fares much better. All his probabilities are well above 0.5, but these probabilities tend to decrease with increased production from Bryant in each category. While James’ probabilities are slightly lower than Bryant’s for away games for two of the quartiles, his home probabilities are high and all of his probabilities increase with increased production. Wade’s probabilities also show an increasing trend, but are much lower due largely to his team having many more losses than the other two teams. When considering the models based on offensive and defensive ratings (Table 4), James still seems to come out on top. His offensive rating coefficient is the highest among the three players, and his defensive rating is second-highest in absolute value. The paradox of Bryant’s apparent negative effect on his team from the earlier models is reversed in this model, even though his offensive rating coefficient is still the lowest among the three. Moreover, the probabilities of a win at the quartiles (Table 5) show an increasing trend for all three players as the performance measures increase. From these probabilities, perhaps Bryant’s case could be argued to be the best of the three, but only slightly better than that of James. For all of the models, a clear home-court advantage exists, particularly for the Cavaliers. It appears the selection committee made a good decision in selecting James as the MVP of the 2008–2009 season. The logistic models constructed clearly show that he makes an important contribution to his team and a strong case that he contributes at least as much, if not more, than do either Bryant or Wade to the success of their respective teams.
Further Reading Agresti, Alan. 2007. An introduction to categorical data analysis, 2nd ed. Hoboken, New Jersey: Wiley-Interscience. NBA.com, www.nba.com. Oliver, Dean. 2004. Basketball on paper. Dulles, Virginia: Potomac Books Inc. R Development Core Team. 2008. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing
CHANCE
45
Overview of the 2010 Census Coverage Measurement Program and Its Evaluations Mary H. Mulry and Patrick J. Cantwell
T
he 2010 Census is part of a long tradition of U.S. censuses that dates back to 1790. The Constitution requires a census be taken every 10 years, making the 2010 Census the 23rd. By law, the U.S. Census Bureau will deliver the census counts for states by December 31, 2010, for use in apportioning the seats in Congress. For use in redistricting, states will receive detailed data about the population within their state no later than March 31, 2011. The U.S. government also distributes more than $400 billion in federal funds to states and communities based in part on the census population data.
Historical Legal Perspective The accuracy of a geographic area’s proportion of the population, called its population share, is important in determining the area’s number of representatives in Congress and state legislatures, as well as the amount of congressional funds it receives. Adjustment of the census for coverage error became prominent when Newark, New Jersey, filed a lawsuit against the U.S. Department of the Treasury alleging that biases in fund allocation resulted from the Secretary of the Treasury’s use of unadjusted 1970 Census numbers, despite having notice of the estimated undercount for that census. The court ruled in favor of the Department of the Treasury, and the unadjusted 1970 Census numbers continued to be used in fund allocation. The controversy continued when New York City led a lawsuit asking for the results of the 1980 Post-Enumeration Program to be used in an adjustment of the 1980 Census. The U.S. Census Bureau took the position that the 1980 Post-Enumeration Program and demographic analysis did not produce sufficiently reliable results to ensure that adjustment would enhance the accuracy of the census count. The courts agreed and upheld the Census Bureau’s decision to not adjust the 1980 Census count. During the 1980s, the U.S. Census Bureau extensively reviewed its coverage measurement methodology to determine whether it was possible to adjust the 1990 Census. In 1987, the Department of Commerce decided not to proceed with plans to produce adjusted census counts, prompting a lawsuit by various states, cities, citizen groups, and individuals against the Department of Commerce. Ultimately, the parties entered into a stipulation prior to the 1990 Census. Under this stipulation, the U.S. Census Bureau agreed to undertake a post-enumeration survey (PES) after the 1990 Census and publish proposed and final guidelines for a 46
VOL. 23, NO. 3, 2010
decision on whether to adjust the 1990 Census counts. These final guidelines promulgated pursuant to this stipulation, provided the Census Bureau would announce a decision on whether the results of that survey would be used to adjust the previously announced 1990 Census results by July 15, 1991. The U.S. Census Bureau conducted an extensive evaluation of the 1990 PES estimates to prepare for the decision. On July 15, 1991, the Secretary of Commerce announced that the 1990 Census would not be statistically adjusted. The decision resulted in additional litigation and culminated in the 1996 Supreme Court decision upholding the secretary’s decision against adjustment. For the 2000 Census, the U.S. Census Bureau initially planned to incorporate sampling and estimation into two aspects of the census. One application of sampling was to select a sample of nonrespondents to the mail questionnaires and follow up on only those selected. The other application was to integrate the net coverage error estimates into the census numbers. The motivation for sampling for nonresponse follow-up was to reduce cost, while the aim of integrated coverage measurement was to reduce the differential undercount. However, in 1999, the Supreme Court barred the use of statistical sampling to calculate the population for the purpose of reapportioning the House of Representatives. After this 1999 Supreme Court decision, the U.S. Census Bureau redesigned its plan to eliminate the use of sampling to arrive at apportionment counts. Instead, it approached the issue of census adjustment for redistricting and other uses by constructing a plan for making a decision about correcting the census numbers. The Census Bureau announced in advance the criteria it would use to decide whether estimates from the 2000 post-enumeration survey, called the Accuracy and Coverage Evaluation (A.C.E.) Survey, were sufficiently reliable to use in an adjustment of the census. The U.S. Census Bureau considered the operational data to validate whether the conduct of A.C.E. was successful, assessed whether the A.C.E. measurements of undercount were consistent with historical patterns of undercount and independent demographic analysis benchmarks, and reviewed measures of quality that were available in March 2001. In the end, the inconsistency between the A.C.E. and demographic analysis estimates led to the Census Bureau’s decision not to recommend adjusting the census numbers for redistricting. In October 2001, the U.S. Census Bureau decided not to adjust Census 2000 for other purposes because newly available
evaluation data indicated problems with the A.C.E. detection of erroneous enumerations. After producing the A.C.E. Revision II estimates in March 2003, the U.S. Census Bureau decided not to adjust the census base for the Intercensal Population Estimates Program. The reasons for this decision included inconsistencies with demographic analysis for children 0–9 years of age and problematic estimates for some small areas.
Measures of Census Coverage Since 1940 Taking a census is so difficult that every census has probably had some coverage error, but, generally, censuses were thought to be accurate enough for their purpose. Even Thomas Jefferson suspected there was coverage error in the first U.S. census, which he directed in 1790. However, measuring the coverage error requires data and methodology that did not exist until more recent times. Consideration of measuring the census coverage error sprang from a match between the 1940 Census and draft registration records. Surprisingly, there were more young adult males registered for the draft than were counted in the census. This prompted work to begin on developing methods to measure the census coverage error. The U.S. Census Bureau now applies two methods to evaluate censuses: post-enumeration survey and demographic analysis. The 2010 CCM is a post-enumeration survey. Demographic analysis uses vital records to form an estimate of the size of the population. The basic approach is to start with births, subtract deaths, and add the net migration. Net migration is the difference between immigrants and emigrants. Demographic analysis uses administrative statistics on births, deaths, authorized international immigration, and Medicare enrollments, as well as estimates of legal emigration and net unauthorized immigration. The first coverage evaluation results for the 2010 Census will be the demographic analysis estimates of the total population size on April 1, 2010, available late in 2010 or early in 2011. The estimates of percent net undercount shown in Figure 1 demonstrate that, at the national level, the estimates from the 1980, 1990, and 2000 implementations of post-enumeration surveys are comparable to those from demographic analysis. An advantage of the post-enumeration survey is that it provides estimates for levels of geography below the national level and for racial and ethnic groups, although the estimates have sampling error and a vulnerability to violations of underlying assumptions. The advantage of demographic analysis is that it uses vital records, which are independent of the census and may be less susceptible to reporting error. However, problems exist with estimation of net migration into the U.S. In addition, demographic analysis estimates are not available below the national level or for racial and ethnic groups other than black and non-black, because historical records contain only those groups. Figure 2 shows demographic analysis estimates of the percent net undercount for the United States, as well as separately for blacks and non-blacks for censuses from 1940–2000. Census 2000 was able to lower the difference between the percent net undercounts for blacks and the entire United States, called the percent differential undercount for blacks, to 2.7% (2.8% minus 0.1%). From 1940–1990, the percent differential undercount for blacks ranged from 3.0% to 3.9%. Reduction in the differential
Figure 1. Estimates of percent net undercount from post-enumeration surveys and demographic analysis for the 1980–2000 censuses (All estimates are final after revisions.) Source: Long, J. F., J. G. Robinson, and C. Gibson. 2003. Setting the standard for comparison: census accuracy from 1940 to 2000. 2003 ASA proceedings. Alexandria, Virginia: American Statistical Association.
Figure 2. Demographic analysis estimates of percent net undercount for the United States, blacks, and non-blacks for the 1940–2000 censuses Source: Long, J. F., J. G. Robinson, and C. Gibson. 2003. Setting the standard for comparison: census accuracy from 1940 to 2000. 2003 ASA proceedings. Alexandria, Virginia: American Statistical Association.
undercount for a group improves the accuracy of the group’s population share.
Strategy for 2010: Census The design of the 2010 Census builds on past censuses, but also incorporates improved methodologies, partially in response to problems identified in Census 2000. An example of a change is an additional question on the census questionnaire that helps avoid duplicate enumerations. Another example is in the use of technology in creating the master address file for the census. In most areas of the United States, the U.S. Census Bureau conducted the 2010 Census by mailing questionnaires in midMarch to addresses on its master address file and hoped respondents returned the questionnaires by mail as soon as possible. When a questionnaire was not returned, the nonresponse followup operation sent an enumerator to the address during May or June to attempt an interview. The 2010 Census used only a “short form” questionnaire with 10 questions. The need for the “long form” used in prior censuses has been obviated by the American Community Survey, which now collects analogous information. By not having a long form, the census operations were able to focus efforts on the accuracy of the count. CHANCE
47
The 2010 Census form had two questions regarding coverage. Whenever certain categories of either question were checked, the coverage follow-up operation contacted the household to clarify. This operation ran from mid-April to mid-August. The first question—similar to one used on Census 2000—tried to avoid undercounting. This question asked, “Were there any additional people staying here that you did not include?” Check boxes followed the question with categories such as “children, such as newborn babies or foster children” and “people staying here temporarily.” The second question was new and focused on avoiding overcounting. This question asked, “Does [person] sometimes live or stay somewhere else?” and was followed by a list of check boxes that included “in college housing,” “for child custody,” and “at a seasonal or second home.” Responses to the new question were used to lessen the extent of census duplication. Evaluations of Census 2000 found that duplicate enumerations occurred in the census much more frequently than previously observed or suspected. The estimated number of duplicate enumerations in the Census 2000 count of 281.4 million persons was 5.8 million. This number of duplicate enumerations occurred in the final census count, although the U.S. Census Bureau discovered and removed 1.4 million duplicate listings of housing units during the Census 2000 data collection operations. The causes of duplicate enumerations of persons within a block appear to have arisen from operational errors, such as a dwelling having two addresses on the address list. A re-interview study, limited because it was conducted long after Census Day, found that the causes of duplicate enumerations in different states or counties included moving situations, people visiting family or friends, people with vacation or seasonal homes, college students, and children in shared custody situations. For the first time, the address list for the 2010 Census included GPS coordinates to aid in avoiding duplicate listings of housing units. The U.S. Census Bureau collected GPS coordinates from April through August 2009. Other programs aimed at producing a high-quality census comprised a host of coverage improvement operations and an integrated communications campaign that included advertising and partnerships with organizations that served as advocates for the 2010 Census.
Strategy for 2010: Components of Census Coverage The net coverage error of a census is defined as the difference between the true population size and the census count. The coverage measurement programs for the 1980, 1990, and 2000 censuses used dual system estimation (DSE) to produce estimates of the true population size that combine the proportions of omissions from the census and erroneous enumerations, which are census records that should not have been counted, such as duplicates, fictitious persons, and people who died before Census Day or were born after Census Day. From the dual system estimate of the population size, one subtracts the census count to obtain a measure of the net error. Although the net coverage error of a demographic group or geographic domain provides valuable information about its census count, it is a single number that summarizes the results of various census operations and actual events. To improve how the census is taken and to plan the next census or census test,
48
VOL. 23, NO. 3, 2010
it would be worthwhile to examine the components that make up this number. A major goal in the 2010 Census was to produce estimates of the components of census coverage, specifically to divide the number of census records into estimates of its three components—correct enumerations, erroneous enumerations, and whole-person imputations—and to separately estimate the number of omissions from the census. The DSE approach to estimating the components is to first obtain estimates of the population size and the erroneous enumerations. This is not as simple as it may seem, though relatively recent improvements in computer capacity and algorithms have made collecting and processing the data for the components of census coverage more practical. This is especially important when searching all census enumerations in a national file for duplicates. Even the definition of a correct or erroneous enumeration is not obvious, as circumstances produce many records in which the distinction is not clear. For some records, there is not enough information to determine whether the person recorded actually lived there on census day. Further, the concept of geographical correctness can complicate matters. For example, a person may be enumerated in the census at one address, yet it is determined later that he or she actually lived on campus at a university in another county in the same state. Is this enumeration correct? For purposes of apportioning Congress, this record doesn’t change the state’s count adversely. However, the census counts also are used to form legislative districts within the state and for other purposes of interest to local governments. Similarly, estimating the number of omissions from the census is complicated by practical and conceptual issues. As the missing component of the true population size, omissions cannot be collected and analyzed directly, but can be estimated by deduction through DSE. One possible method to derive census omissions is through the relationship: omissions = true population size – correct census enumerations,
where the true population size is estimated by the dual system estimate. As with correct and erroneous enumerations, defining what should qualify as an omission can be debated. For example, for some housing units captured in the census, the records of the entire household are imputed—treated as missing data and filled in statistically from another housing unit. Should such records be considered omissions from the census? Valid arguments on each side render it a difficult decision to make, with no solution that satisfies every intended use of the data. Through the CCM program, the U.S. Census Bureau will measure the coverage of the population living in housing units and that of housing units. Although the methods to measure coverage on the two universes are similar, there are differences. For example, the concept of whole-person imputations does not apply to housing units. For another, when measuring the coverage of persons, those who move between census day and the time of the CCM interview create additional challenges. There is no analogous concept for housing units. For reasons of implementation, group quarters units—such as college dormitories and prisons—and what is called “remote Alaska” are not part of the CCM evaluation.
CCM Data Collection and Processing The post-enumeration survey method may be designed to measure the separate components of census coverage and the net coverage error. The Census Bureau’s implementation uses two samples: the population sample (P sample) and the enumeration sample (E sample). The former is a sample of housing units and people selected independently of the census and designed to support the estimation of omissions—people missed in the census. Members of P-sample households are interviewed and matched to the census on a case-by-case basis to determine whether they were enumerated or missed. The E sample is a sample of census enumerations (records) in the same areas as the P sample, designed to support the estimation of erroneous enumerations. In most respects, the sample design for the 2010 CCM is similar to that used for the 2000 coverage survey, although smaller in size. The 2010 survey is a probability sample of approximately 170,000 housing units in the United States and 7,500 housing units in Puerto Rico. The CCM primary sampling unit is a block cluster, which contains one or more geographically contiguous census blocks. Block clusters are formed to balance statistical and operational efficiencies. A stratified sample of block clusters is selected for each state; Washington, DC; and Puerto Rico. For estimates of housing units, the operations are largely analogous to those described below for people. Under the CCM survey approach to measuring census coverage, it is important to maintain independence between census and CCM operations. Therefore, from August through December of 2009, CCM staff independently listed all addresses in P-sample block clusters with no information, maps, or materials used to form the address list for the 2010 Census—and without participation or assistance from staff working on census operations. Later, interviewing and other operations were scheduled to minimize interaction between the census and CCM staff members. Next, CCM staff members will attempt to conduct a CCM person interview of all P-sample households in each sample block cluster. Field interviewers will collect information about the current residents of the sample housing unit who also lived in the block on census day, in-movers living at the housing unit at the time of interview who did not live there on census day, and certain people who moved out of the unit between census day and the time of the CCM interview. The demographic information collected for each person includes the name, sex, age, date of birth, race, Hispanic origin, relationship to the householder, and whether the household owns or rents. Beginning in November 2010, an extensive computer matching operation will be conducted on the CCM records. P-sample persons are matched to other P-sample persons within the “local search area”—the same sample block cluster and one ring of surrounding blocks—to identify duplicate enumerations in the P sample. A computerized search of census records in the local area and entire census file attempts to locate census enumerations for the P-sample persons. Records in the E sample are matched to the entire census in an effort to identify duplicate census records. After the computer matching, the matching staff review many cases and attempt to clerically match those cases the computer
Table 1. The Capture-Recapture Approach to Estimating the Size of a Population
Recaptured? (Enumerated in Independent or P Sample?)
Captured? (Enumerated in Census or E Sample?)
Yes
No
Total
Yes
N11
N12
N1+
No
N21
N22
N2+
Total
N+1
N+2
N
Note: Numbers in bold are known or can be estimated from the samples.
did not match and to resolve those the computer identified as possible matches. In addition, matchers conduct clerical searches for duplicate persons. Early in 2011, following the clerical match, cases that remain unresolved with respect to census day residence status, enumeration status, match status, or person duplication are sent to the person follow-up operation.
CCM Estimation Once the data have been collected, statistical procedures must be conducted to address missing or inconsistent data, cases that may be extremely influential, and other situations. Several enhancements to the data collection are expected to lessen the occurrence of these problems. From these data, the CCM survey will provide measures of census coverage for both housing units and persons in housing units. The result, available in 2012, will be estimates of net coverage error, as well as estimates of omissions, and correct and erroneous enumerations. Like the coverage measurement survey programs in 1980, 1990, and 2000, the 2010 CCM will measure net coverage error by using dual system estimation (DSE). The U.S. Census Bureau’s use of DSE is based on capture-recapture methodology. As an example, to estimate the number of fish in a pond using this approach, one captures a set of fish, tags them for later identification, and throws them back into the pond. After the fish have had time to disperse sufficiently, one captures a second set of fish. This second set of fish would be the “recapture.” Then, one counts the number of recaptured fish and how many of them are tagged. All fish in the pond can be placed into one of four categories, as depicted in Table 1. Note that the numbers in the table, N11, N12, N1+, N21, and N+1, can all be observed by counting the fish. The entry N22, and thus the true size of the population, N, are unknown. If we can assume that the capture and recapture are independent, each
CHANCE
49
fish has the same chance of being captured, and each fish has the same chance of being recaptured, then the ratios N11/N+1 and N1+/N should be approximately equal. Hence, an estimator for the unknown total, N, is: ^
N= (N1+)
N+1
(1)
N11
Turning from estimating fish to estimating people, the census enumeration is the initial capture, while the independent sample represents the recapture. Equation 1 becomes ^ number of correct ) number of P-sample enumerations (2) N=(census enumerations number of matches
N11 represents the number of people enumerated in the census and then later in the P sample. They can be “matched” across the files for the two enumerations. Thus, the inverse of the ratio in Equation 2, number of matches number of P-sample enumerations
(3)
is the sample ratio of P-sample enumerations that match to a census record, that is, that were captured or enumerated (not missed) in the census. The “correct” census enumerations in Equation 2 do not include erroneous enumerations. One can think of an erroneous enumeration as a captured “frog”; this frog does not contribute to the size of the fish population in the pond. The “correct” enumerations also do not include what are called “wholeperson imputations,” people who are not “data defined.” People are not “data defined” when their census records have so little valid information that we impute (i.e., insert) all of the person’s characteristics based on another person’s good record from a similar household. Although we have records from the entire census, determining whether a census enumeration is correct must be done on a sample basis because of time and expense. Based on the E sample, we estimate the term N1+ by multiplying the weighted number of census enumerations by the probability a record is data defined, and by the probability an enumeration is correct. The estimator in Equation 2 becomes, effectively, ^
^
1
^
N= (number of census records) PDD PCE
^
Pmatch
(4)
where the subscripts DD, CE, and match correspond to data defined, correct enumeration, and match, respectively. One may recall, among the assumptions leading to Equation 1, that each fish has the same chance of being captured and recaptured. Similarly, the effectiveness of the estimator in Equation 4 generally increases if the underlying probabilities being estimated are the same for all those in the E sample or P sample. As the probabilities become more heterogeneous, an error labeled “correlation bias” can increase. In the estimation approach used in the 1980, 1990, and 2000 coverage measurement programs, this issue was addressed by
50
VOL. 23, NO. 3, 2010
defining a large number of mutually exclusive groups, or poststrata, such that the probability of an enumeration being correct was more homogeneous among the sample people within the same post-stratum—and similarly for the probability of a match. To estimate N, one then computed the estimated total within each post-stratum and added the totals across all post-strata. The post-strata were defined by variables that were available (observed or imputed in the samples), and highly correlated with correct enumeration and match status. These variables included demographic characteristics such as age, sex, race, and owner-renter status, as well as operational variables such as type of census enumeration area and rate of mail return for the block cluster. Unlike the previous programs, the CCM will use more general logistic regression modeling—instead of post-stratification—to estimate the probabilities of match, correct enumeration, and being data-defined. The logistic regression modeling allows the U.S. Census Bureau to reduce the correlation bias in its total population estimates without having to include nonsignificant higher-order interactions, as when forming poststratification cells. Not having the unnecessary interactions allows the Census Bureau to include additional variables in the model that can potentially help reduce error for subpopulation estimates. In this approach, the DSE takes the form: ^
N=
∑P ^
i
DD,i
^
PCE,i
1
(5)
^
Pmatch,i In Equation 5, the sum is over all records in the census. For record i, according to the person’s characteristics, one inserts the modeled values of the three probabilities. We will produce estimates for several types of erroneous enumerations (e.g., duplicate, other types), by various census operational categories, and specified levels of geography. An example of census operational categories is mail-out/mail-back enumeration and nonresponse follow-up enumeration. Under the current plan for the geographic level of tabulations, the U.S. Census Bureau will release estimates of coverage for the nation, each of the four major census regions, each state, the District of Columbia, Puerto Rico, and those counties and places in which the census population count exceeds a threshold to be specified. The threshold(s) will be determined so the estimates satisfy certain reliability constraints.
Evaluation of 2010 CCM Assessing the effectiveness of the 2010 CCM design is important to provide indications of the quality of the 2010 estimates of coverage error and identify where improvements can be made. The CCM evaluation program will assess the accuracy of the CCM determinations of correct versus erroneous enumerations, census day residency, and matches from the CCM independent sample to the census. The CCM interviewing and matching operations are more complicated than for previous coverage measurement surveys, which makes an evaluation of their effectiveness even more crucial. The results of the evaluations of the 2010 CCM will be available in late 2012. Previous studies have examined the error in the U.S. Census Bureau’s implementation of coverage measurement
surveys and the dual system estimator for past censuses. Recent studies examined the error structure in components of census coverage error. Several evaluations will consider aspects of errors that may occur during data collection and processing. Three evaluations examine possible data collection error and assess the effect of the level of error on the quality of the 2010 CCM estimates. In one of these evaluations, experts on residence rules and CCM procedures will accompany interviewers and debrief respondents after their interview regarding the roster of residents, alternate addresses, and moves. Another evaluation will investigate recall errors, focusing on respondents’ reports of moves. A third evaluation will use an extended search to examine the level of error in identifying geocoding errors in the CCM. Another three evaluations will examine errors that may occur during the data processing and assess their effect on the CCM estimates. One evaluation will investigate the level of matching error in the clerical matching operation through an independent rematch of a subsample of the block clusters selected for the CCM sample. The second evaluation will refine the methodology that used administrative records to evaluate the 2000 duplicate identification and test whether some duplicates can be confirmed using the administrative records database, instead of fieldwork. A third evaluation will explore how errors occur in the census by taking a detailed look at the sequence of census operations and comparing the results from each operation to the CCM results. The comparison will be attempted on a small scale and provide information about different types of E-sample errors with a view to improving operations in the 2020 Census. The results of the evaluations of errors in data collection and processing will be used in an evaluation that will provide an overall picture of the quality of the 2010 CCM estimates. This evaluation will synthesize the evaluations of data collection and processing error and information from other sources about sampling error, random data error, and random error in imputations for missing data in the 2010 CCM. Obtaining an understanding of the error structure in estimates of the components of census coverage will aid the design of coverage measurement programs for the 2020 Census. Another study has the goal of providing direction on the possibility of successfully measuring coverage of group quarters (GQ) facilities and residents of GQs in the 2020 Census, at least
for some types of GQs. The 1980 and 1990 census coverage measurement programs included residents of noninstitutional GQs in their universe, but the methodology was not effective in resolving the census day residence for this group. As a result, the 2000 and 2010 coverage measurement programs excluded all GQs and GQ residents. The study will investigate whether enumeration problems vary by type of GQ and require different methods for assessing census coverage. For example, methods that are effective for measuring census coverage within assistedliving facilities may not be effective for measuring coverage within college dormitories. Although the evaluations focus on enhancing the understanding of the accuracy of the estimates of net census coverage error and components of census coverage, they also will provide insight into the quality of the 2010 Census. In so doing, the outcome of the evaluations will aid in forecasting costs and errors in the 2020 Census and be helpful in optimizing tradeoffs that will have to be made during its design. The evaluation results also will influence the design of the research during the upcoming decade on methodologies for census-taking and coverage measurement for the 2020 Census. Editor’s Note: Any views expressed are those of the authors and not necessarily those of the U.S. Census Bureau.
Further Reading Anderson, M., and S. E. Fienberg. 2001. Who counts? The politics of census-taking in contemporary America. New York: Russell Sage Foundation. Citro, C. F., D. L. Cork, and J. L. Norwood. 2004. The 2000 census: Counting under adversity. Washington, DC: The National Academies Press. Hogan, H. 1993. The 1990 post-enumeration survey: Operations and results. Journal of the American Statistical Association 88:1047–1060. Mulry, M. H. 2007. Summary of accuracy and coverage evaluation for census 2000. Journal of Official Statistics 23:345–370. U.S. Census Bureau. 2004. Accuracy and coverage evaluation of census 2000: Design and methodology. Washington, DC: U.S. Census Bureau. www.census.gov/prod/2004pubs/dssd03-dm.pdf.
Become of a Member of the American Statistical Association Visit the ASA Membership site: www.amstat.org/membership/index.cfm No matter where your expertise in statistics takes you, the ASA is here with resources and opportunities.
CHANCE
51
Here’s to Your Health
Mark Glickman,
Column Editor
A Story of Evidence-Based Medicine: Hormone Replacement Therapy and Coronary Heart Disease in Postmenopausal Women Arlene S. Swern
O
ne aspect of the current discussion regarding health care reform is the use of evidence-based medicine to determine the standard of care. The essence of evidence-based medicine, a concept first credited to David Sackett in 1997, is the idea that clinical decisions and the determination of standard of care need to be based on a careful and systematic scientific review of the medical and scientific literature. But what is considered standard of care or current evidence is a function of the studies that have been conducted at the time and ends up being an ever-changing story as new trials and meta-analyses are conducted, rather than an absolute “truth.” Even when the “true” effect of a medical treatment remains constant over time, statistical inference about that medical treatment changes as evidence becomes more refined and focused. This phenomenon is illustrated through the evolution of our knowledge about the association between hormone replacement therapy (HRT) and the risk of coronary heart disease (CHD) in postmenopausal women.
HRT: Our Story Begins The use of hormone replacement therapy, now sometimes called hormone therapy, to counteract the symptoms of menopause was documented as early as 1959. Treatment with low doses of estrogen, as it was used initially, and, later, estrogen plus progesterone artificially boosted hormone levels in postmenopausal women in whom endogenous hormones were dropping and thereby counteracted some of the uncomfortable side effects of menopause, such as hot flashes and difficulty sleeping through the night.
52
VOL. 23, NO. 3, 2010
Arlene S. Swern is a director in charge of medical affairs statistics at Celgene Corporation in Summit, New Jersey. Her current research interests include extensions of the zero-inflated Poisson distribution, all aspects of meta-analysis, and good clinical study design. She earned her PhD in biostatistics from Yale University.
There had been much speculation that the increase in mortality from coronary heart disease in women around the age of menopause was also due to a drop in endogenous estrogen and that HRT would counteract that effect. Studies examining this hypothesis were published from 1971–1991. In 1991, Meir Stampfer published a meta-analysis of 31 studies trying to evaluate and summarize these studies. (“Meta-Analysis: Points to Consider” discusses the method of meta-analysis and issues related to its use.) This meta-analysis found a 44% reduction in the relative risk (RR) of CHD for women who had taken HRT compared to those who had not (RR: 0.56, 95% CI: 0.50, 0.61). (“Odds Ratio, Relative Risk” discusses measures of risk.) These results remained significant after adjusting for cardiovascular risk factors such as age, smoking, and blood pressure, suggesting that women taking postmenopausal estrogen therapy are at decreased risk for CHD. This meta-analysis, as Stampfer cautioned when he published it, had severe limitations. It was conducted based on retrospective observational studies, which are subject to biases and confounding. Moreover, the decision to give HRT was
not based on randomization. Was there certain information available at the time that would make a physician put a certain kind of person on HRT? Women who took HRT at that time were often upper middle class and relatively well educated. Such women were more likely to pay attention to cardiovascular risk factors and probably see their physicians more often. As a result, what was seen as a reduction in CHD due to HRT was possibly due to selection bias. Furthermore, there were issues in how exposure was defined. Although most of these studies followed women with and without use of estrogen and therefore had an internal control group, women were often classified as estrogen users or nonusers at baseline and assumed to remain in these categories. In fact, women often took estrogen for a limited period of time. In 1997, being aware of the limitations of these studies and the meta-analysis, Elina Hemminki and Klim McPherson published a pooled analysis of 22 randomized trials with 4,124 women that did not support the notion that postmenopausal hormone therapy prevents cardiovascular events. In fact, the odds ratio (OR) for all cardiovascular events was 1.64 (95% CI: 0.65 to 4.18), suggesting a harmful effect for HRT, but one that was not statistically significant. However, most of these studies were small, data on cardiovascular events were usually not the primary endpoint (which was mentioned in an offhand manner), the number of women with events was small, and the follow-up was fairly short. Hemminki stated, “Because of the varying lengths of treatment, regimens used, and the vagueness of describing health outcomes, a formal meta-analysis was not carried out.” Interestingly, their method was a pooled analysis, sometimes called a marginal model, that treats the results of individual studies as if they come from one study and is more sensitive to differences among studies than a meta-analysis is.
HRT in Postmenopausal Women with CHD: An Answer In 1998, the Heart Estrogen/Progestin Replacement study (HERS) of 2,763 women with documented coronary heart disease, was published. Postmenopausal women with an intact uterus who were under the age of 80 were randomized to receive either estrogen and progesterone or placebo and followed for an average of 4.1 years. The average age on study was 66.7 years and 82% of those assigned to hormone treatment were taking it at the end of one year; 75% were taking it at the end of three years. This study, a prospective randomized study, found an increased risk of cardiac events with HRT in this special group of women with coronary heart disease. As a result, the initiation of HRT was no longer recommended in postmenopausal women with cardiovascular disease. Its use however, was still encouraged in women without cardiovascular disease.
HRT in Postmenopausal Women Without CHD: We Think We Have the Answer Between 1993 and 1998, the Women’s Health Initiative (WHI) enrolled 161,809 postmenopausal women between the ages of 50 and 79 into a series of trials. One of the subtrials was a
Meta-Analysis: Points to Consider Meta-analysis is commonly used as a tool of evidencebased medicine to scientifically synthesize the medical literature and help determine standard of care. A meta-analysis will have more statistical power to detect a treatment effect than individual studies. It is particularly useful in combining data from several undersized studies or in evaluating conflicting results from individual studies. Conducting a meta-analysis requires both a clear understanding of the methodology and the subject matter. And a meta-analysis is only as good as the individual studies that go into it. Like any scientific study, a meta-analysis needs to have a clearly well-formulated question and protocol. The exact population to be studied and inclusion/ exclusion criteria for studies need to be carefully defined. Should observational studies be included with randomized studies? What kinds of biases are being introduced with each choice? Search strategies for identifying relevant studies and methods for identifying and assessing publication bias need to be considered. It is well known that studies with strong significant conclusions are easier to have published, but just using these can sometimes lead to a mistaken conclusion. The choice of the outcome measure is dependent on the formulation of the hypothesis. How we decide to summarize the data from the studies is dependent on the degree of heterogeneity among the studies. In a fixed effects model, the assumption is that the only differences among the studies are due to chance, and the estimate is weighted by the inverse of its variance for each study. This process can be adapted to most outcome measures (e.g., difference in means, odds ratio, relative risk, etc.). The random effects model assumes that, in addition, there is a between-study component of the variance of the treatment effect dependent on the heterogeneity of the data. Assessing heterogeneity among the studies involves more than deciding whether to use a fixed effects or random effects model. If the studies are different, are they different in a critical way? Heterogeneity can sometimes point to a critical flaw in combining studies. Is there a factor that distinguishes one study from another that needs to be included in the meta-analysis? Other considerations are the quality of the studies included and sensitivity analyses, the extent to which the results of a meta-analysis are driven by any one study. If that is the case, what is the reason? Are there many more patients in one study? Are the results different because this study is different than the others in some way? Should the study not be included in the meta-analysis?
CHANCE
53
Odds Ratio, Relative Risk The odds ratio (OR) in the context of this article is the ratio of the odds that CHD will occur in women taking HRT compared to the odds of CHD occurring in women not taking HRT. It is a summary measure commonly reported from a logistic regression analysis. The risk ratio (RR), or hazard ratio, is the probability of CHD in women taking HRT compared to the probability of CHD occurring in women not taking CHD. It is most often reported from a Cox regression analysis. In a case such as ours, in which a rare disease is being studied, the odds of a disease is approximately the risk of a disease. Therefore, the two can be used interchangeably. A risk ratio of 1 can be interpreted as no effect for the characteristic being studied. A risk ratio greater than 1 indicates an increased risk for the characteristic (i.e., a risk ratio of 1.24 for CHD within this context indicates a 24% increase of CHD for women taking HRT compared to women not taking HRT ((1.24-1)/1)*100). A risk ratio of 0.56 in the Stampfer meta-analysis represents a 44% reduction in the relative risk of CHD—calculated as ((1.00-0.56) /1.00)*100 for women who had taken HRT compared to those who had not.
study of 16,608 postmenopausal, primarily healthy women free of cardiovascular disease who had been randomly assigned to receive placebo or estrogen-progestin and followed for a mean of 5.2 years. In 2002, we were given what we thought was the final answer when this trial was stopped because women receiving HRT had an increased risk of invasive breast cancer. In addition to the findings on breast cancer, this study found that estrogen plus progestin did not confer cardiac protection and, in fact, possibly increased the risk of CHD with a hazard ratio of 1.29 (95% CI, 1.02 to 1.63) (i.e., a statistically significant 29% increase in the incidence of CHD). See “A Reanalysis of the Women’s Health Initiative Data.” The results of this study were taken seriously. Here was a carefully designed, well-conducted randomized trial of healthy postmenopausal women who were actually randomized between HRT and placebo and therefore not subject to the biases in the previous published studies. According to this study, HRT did not have the cardio-protective effect it was thought to have and it increased the risk of breast cancer. The results of the WHI conflicted with what had been published previously, but many of those earlier studies were either small or observational studies. And the HERS study was conducted within a special population of women with coronary heart disease. It was widely believed that the discrepancy between the previous observational studies and this randomized clinical trial was likely due to a confounding variable that was not adequately adjusted for in the observational studies. 54
VOL. 23, NO. 3, 2010
CHD HRT
Yes
No
Total
Yes
A
B
A+B
No
C
D
C+D
Total
A+C
B+D
A+B+C+D
Odds of CHD in women taking HRT is A/B Odds of CHD in women not taking HRT is C/D Odds ratio of CHD is (A/B)/(C/D) = (A*D)/(B*C) (1) Risk of CHD in women taking HRT is A/(A+B) Risk of CHD in women not taking HRT is C/(C+D) Risk ratio of CHD = (A/(A+B))/ (C/(C+D))
(2)
And in the case of a rare event, ~ (A/B)/(C/D) = odds ratio
Postmenopausal women all over the world discontinued HRT, and, in September of that year, the United Kingdom’s Telegraph projected that U.K. prescriptions for hormone replacement therapy would fall from 6.2 million in 2001 to 2.6 million in 2002. If this was not definitive enough, Rafael Gabriel Sanchez from the Cochrane collaboration published a meta-analysis in 2003 including 10 clinical trials, two involving healthy women and eight involving women with heart disease. This meta-analysis found “no evidence that hormone therapy provides heart-related benefits to postmenopausal women with or without heart disease for any of the cardiovascular outcomes assessed.” This was not the end of the story, however. In 2004, Shelley Salpeter published a meta-analysis including data from 30 trials with 26,708 participants, including the WHI trial. To balance the effect of risk of breast cancer against CHD, researchers looked at mortality, both total and cause-specific, associated with HRT. This meta-analysis found that for all-causes mortality, the OR was 0.98 (95% CI, 0.87 to 1.12). Overall, there was no significant evidence for HRT in reducing or increasing the incidence of mortality. However, these researchers did something not done before. They tried to assess the relationship between HRT and actual patient age at which they received HRT. As is often the case in meta-analysis, the researchers only had access to study-level results published in the literature, not to individual patientlevel data. They did have the average age of the patients in each individual study that was included in the meta-analysis,
however. When the results of these trials were stratified by the mean age of the participants—