About the Authors Heather Ardery graduated from the University of Kansas with a Bachelor of Science in accounting in 2008. She is working toward her Master of Accountancy at the University of Kansas School of Business.
W. J. Hurley is a professor in the
John Bailer is distinguished professor
Zacariah Labby is
and chair in the department of statistics, affiliate member of the departments of sociology and gerontology and zoology, and a senior researcher in the Scripps Gerontology Center at Miami University in Oxford, Ohio. His research interests are in quantitative risk estimation, doseresponse modeling, and the design and analysis of occupational and environmental health studies.
Shenghua Kelly Fan is an assistant professor of statistics at California State University, East Bay. Her research and statistical practice include clinical trial design and multivariate optimal design.
Mark G. Haug is a school of business teaching fellow at the University of Kansas. His research interests include interdisciplinary matters of law and the sciences.
department of business administration at the Royal Military College of Canada. His research interests are in game theory, decision analysis, operations research practice, statistics in sports, and wireless communication networks.
a third-year PhD student in the graduate program in medical physics at The University of Chicago. He works on novel approaches to assessing the response of mesothelioma patients to their cancer therapies.
Virginia Vimpeny Lewis is a lecturer of mathematics education at Longwood University, where she also earned her BS in mathematics. She earned her MIS in interdisciplinary mathematics and science from Virginia Commonwealth University and is currently a PhD candidate in mathematics education at the University of Virginia.
Gang Luo is a research staff member at IBM T.J. Watson Research Center. He earned his BS in computer science from Shanghai Jiaotong University in China and his PhD in computer science from the University of Wisconsin-Madison.
CHANCE
3
Shahla Mehdizadeh is a senior research scholar at the Scripps Gerontology Center and adjunct associate professor in sociology and gerontology at Miami University in Oxford, Ohio. Her research interests include predicting long-term care needs for older disabled population, long-term care utilization, and overall cost of caring for persons with disability in the community.
Douglas A. Noe is assistant professor in the department of statistics at Miami University in Oxford, Ohio. His main research interests are in Bayesian methods and the statistical aspects of data mining, with a current focus on treestructured models.
Wanli Min is a research staff mem-
Mark Schilling is professor of mathematics at California State University, Northridge. His research interests include the theory of longest runs, nearest neighbor methods, mathematical and statistical problems involving simulated annealing, and applications of probability and statistics to sports and games.
ber at IBM T.J. Watson Research Center. He earned his BS in physics from the University of Science and Technology of China and his PhD in statistics from The University of Chicago.
Ian M. Nelson is a senior research associate in the Scripps Gerontology Center and a doctoral student in social gerontology at Miami University in Oxford, Ohio. His research interests include long-term care utilization, innovations in home care, and mental health policy for older adults.
Andreas Nguyen recently earned his master’s degree in statistics from California State University, East Bay. He works in marketing research.
4
VOL. 22, NO. 4, 2009
Jürgen Symanzik is an associate professor in the department of mathematics and statistics at Utah State University. His research interests include all forms of statistical graphics and visualization. He serves as the regional editor of Computational Statistics.
Fredrick Vars is assistant professor at the University of Alabama School of Law. He is a life-long soccer player and fan. He is also a member of the American Law and Economics Association and earned his JD in 1999 from Yale Law School.
Editor’s Letter Mike Larsen, Executive Editor
Dear Readers, This issue is the final issue of the 22nd year for CHANCE. Included are two puzzles and entries on a variety of subjects: medicine (EEG wave classification, phase II clinical trials, and home care services), sports (soccer, volleyball, and golf), insurance (viatical settlements), history (Pascal and Fermat’s letters), graphics, and probability (Weldon’s dice). This issue also has what I believe to be a first: associated with the article on Weldon’s dice experiment by Zac Labby is a YouTube movie. Go to www.youtube.com/watch?v=95EErdouO2w and view Labby’s dice-throwing machine in action. Weldon’s original data were used by Karl Pearson when developing the chi-square statistic. Labby developed the machine as a project for Steve Stigler’s History of Statistics course at The University of Chicago. Stigler recommended the article for CHANCE, and I am glad he did. Persi Diaconis of Stanford University commented that this is the experiment he has wanted to do for 40 years. Wanli Min and Gang Luo present methods for classifying EEG waves and applications, particularly in sleep research. As you may recall, two quite different articles about sleep research appeared in issue No. 1 this year. If you want to read the trio together, you can find those (as well as the current) articles at www.amstat.org/publications/chance. Andreas Nguyen and Kelly Fan discuss ethics associated with phase II clinical trials. Their particular focus is stopping rules, or when to terminate a trial due to early success or failure of a new treatment. In Mark Glickman’s Here’s to Your Health column, Douglas Noe, Ian Nelson, Shahla Mehdizadeh, and John Bailer look at classification tree methods for predicting disenrollment of patients from home-care services to nursing homes. There are significant costs, both personal and monetary, involved. Fred Vars posits probability models for scoring a goal in soccer. The process of shooting for a goal in soccer is so complex that simplifying assumptions must be made when estimating chances of success. Vars compares his results to data and suggests richer data sets. Mark Schilling asks whether streaks exist in competitive volleyball. The existence of streaks is challenging to prove, and Schilling discusses why. Meanwhile, Bill Hurley
looks at the odds of the outcomes of a golf tournament and whether a victory by the United States in the Ryder Cup was really amazing. Mark Haug and Heather Ardery tell us about viatical settlements, or the sale of life insurance policies to third parties. Has anyone ever asked you if you would bet your life? Well, in this case, people do. Virginia Vimpeny Lewis relates and explains the content of letters between Pascal and Fermat. This historic correspondence played a key role in the development of probability. Lewis provides detailed tables that would be useful in the classroom. Howard Wainer, in his Visual Revelations column, critiques a graphic that appeared in The New York Times in May of 2009. Illustrations such as the one discussed are appealing for their color and context, but it is a real challenge to accurately communicate information with a fancy graphic. Jonathan Berkowitz, in Goodness of Wit Test, brings us a variety cryptic in the bar-type style. Also, Jüergen Symanzik submitted a statistically based puzzle. See if you can decode the data and provide an explanation and graphic. Winners will be selected from among the submissions submitted by January 28. In other news, CHANCE will add new editors in 2010: Michelle Dunn (National Cancer Institute), Jo Hardin (Pomona College), Yulei He (Harvard Medical School), Jackie Miller (The Ohio State University), and Kary Myers (Los Alamos National Laboratory). Additional editors will help keep reviews of article submissions quick and effective and enable editors to take time to write articles and recruit articles on special topics. The editors will bring fresh perspectives and ideas to CHANCE. I welcome them and thank them for agreeing to serve. Additional information about the new editors is available at www.amstat.org/publications/chance. Of course, I am grateful to the current editors; they deserve ongoing thanks for their good work. We look forward to your submissions and suggestions. Enjoy the issue! Mike Larsen
CHANCE
5
Weldon’s Dice,
Automated
Zacariah Labby
W
alter Frank Raphael Weldon’s data on 26,306 rolls of 12 dice have been a source of fascination since their publication in Karl Pearson’s seminal paper introducing the 2 goodness-of-fit statistic in 1900. A. W. Kemp and C. D. Kemp also wrote about the historical data in 1991, including methods of analysis beyond Pearson’s goodness-of-fit test. Although modern random number generators have come a long way in terms of periodicity and correlation, there is still a certain cachet in the apparent
6
VOL. 22, NO. 4, 2009
randomness of rolling dice, even if this appearance is ill-founded. As PierreSimon Laplace said, “The word ‘chance’ then expresses only our ignorance of the causes of the phenomena that we observe.”
Weldon’s Dice Data In a letter to Francis Galton—dated February 2, 1894—Weldon reported the results of 26,306 rolls of 12 dice, where he considered five or six dots (pips) showing to be a success and all other pip counts as failures. The data were presented in tabular form, with the number of successes per roll tallied as in Table 1. Weldon was motivated to
Table I—Weldon’s Data on Dice: 26,306 Throws of 12 Dice Number of Successes
Observed Frequency
Theoretical Frequency, p = 1/3
Deviation
0
185
203
-18
1
1149
1216
-67
2
3265
3345
-80
3
5475
5576
-101
4
6114
6273
-159
5
5194
5018
176
6
3067
2927
140
7
1331
1255
76
8
403
392
11
9
105
87
18
10
14
13
1
11
4
1
3
12
0
0
0
26,306
26,306
Total
Walter F. R. Weldon (1860–1906), an English biologist and biometrician
= 32.7
Note: A die was considered a success if five or six pips were showing.
collect the data, in part, to “judge whether the differences between a series of group frequencies and a theoretical law, taken as a whole, were or were not more than might be attributed to the chance fluctuations of random sampling.” The simplest assumption about dice as random-number generators is that each face is equally likely, and therefore the event “five or six” will occur with probability 1/3 and the number of successes out of 12 will be distributed according to the binomial distribution. When the data are compared to this “fair binomial” hypothesis using Pearson’s 2 test without any binning, Pearson found a p-value of 0.000016, or “the odds are 62,499 to 1 against such a system of deviations on a random selection.” The modern application of the goodness-of-fit test requires binning such that each theoretical bin has at least approximately four counts, and for the data in Table 1, this results in the bins 10, 11, and 12 grouped into one “10+” bin. With the appropriate binning, the p-value for the original data becomes 0.00030, a larger but still significant result.
The conclusion is that the dice show a clear bias toward fives and sixes, which Pearson estimated was probably due to the construction of the dice. Most inexpensive dice have hollowed-out pips, and since opposite sides add to seven, the face with six pips is lighter than its opposing face, which has only one pip. While the dice may not follow the fair binomial hypothesis, they still may follow a binomial hypothesis with bias toward fives and sixes. The overall probability of a five or six is estimated as 0.3377 from the data, and Pearson outlines the comparison of the dice data to this alternate theoretical distribution in his illustration II of the 1900 paper. Correcting errors in his arithmetic, 2 = 17.0 for the unbinned data and = 8.20 for the binned data (with binning performed as outlined above). As many university students learn in introductory statistics courses, the estimation of one variable by maximum likelihood must be ‘repaid’ by dropping one degree of freedom in the goodnessof-fit test, hence the nine degrees of freedom for the “biased binomial” test. The
resulting p-value is 0.51, meaning there is not sufficient evidence to refute the claim that, although biased, the dice still follow the binomial distribution. These two applications of the original dice data have served as examples, introducing the 2 goodness-of-fit statistic to many.
Design of Apparatus While it is possible to repeat Weldon’s experiment by hand, such an endeavor would be dull and prone to error. Here, we will use an automatic process consisting of a physical box that rolls the dice, electronics that control the timing of the dice-rolling, a webcam that captures an image of the dice, and a laptop that coordinates the processes and analyzes the images. The idea behind the dice-rolling is as follows: A thin plate of metal is covered in felt and placed between metal U-channel brackets on opposing faces of a plastic box. Solenoids with return springs (like electronic pinball plungers) are mounted under the four corners of the metal plate and the 12
CHANCE
7
(a)
(b)
(c)
Figure 1. Apparatus for the rolling of dice. (a) shows the locations of the solenoids in the box without the metal plate in place, along with the return spring plungers in place. (b) is a close-up view of the metal plate in bracket design. (c) shows the apparatus fully assembled in the midst of a rolling sequence.
dice are placed on top of it. The dice are inexpensive, standard white plastic dice with hollowed-out pips and a drop of black paint inside each pip. The
dice have rounded corners and edges. The front panel of the plastic box is removable for easy access to the solenoids and dice. Once assembled, the
Zacariah Labby beside his homemade dice-throwing, pip-counting machine
8
VOL. 22, NO. 4, 2009
inside of the box is lined with black felt to suppress reflections from the inside surface of the plastic (Figure 1). The solenoids are controlled through an Arduino USB board, which can be programmed using a computer. The USB board listens on the serial port, and when it receives an appropriate signal, it sends a series of digital on/off pulses to four independent relay switches, which control the solenoids’ access to electricity. When the relay is placed in the on position, current flows to the solenoid, thereby depressing the plunger against the force of the return spring, due to magnetic inductance. When the relay returns to the off position, the plunger is allowed to freely accelerate under the force of the return spring until it hits the metal plate, transferring its momentum to the (limited) motion of the plate and the (unlimited) motion of the dice. The solenoids operate independently, and their power is supplied from a standard DC power supply. If the four solenoids are numbered clockwise one through four, three solenoids at a time are depressed, initially leaving out number four. Then, the solenoids spring back, and 0.25 seconds later,
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Figure 2. Image analysis procedure. The raw grayscale image in (a) is thresholded in a consistent manner with the aid of controlled lighting, leading to (b). After the holes in the image have been filled, (c), the edges are identified and dilated to ensure complete separation of touching dice (d). Once the edges have been removed from (c), the centers of the dice, (e), are identified and a unique mask is placed over the center of each die (f). The pips are identified from the subtraction of (c) and (b), and the results are eroded to ensure the pips on a six-face do not bleed together (g). Finally, the number of unique pips beneath each mask is counted and stored in memory. The results are displayed for the user to monitor (h).
three solenoids are depressed, leaving out number three. The pattern continues, leaving out two, then leaving out one, then repeating the entire pattern. When only one solenoid is left unaltered at a time, the metal plate tilts away from that solenoid, and the dice roll on the plate. When the other three solenoids return to their standard positions, the dice pop into the air. After the dice have come to rest for a few seconds, a laptop connected to a webcam captures a grayscale image of the dice. The acquisition and processing of the image, along with the entire automation process, is controlled from the computer. The lighting in the experiment room is carefully controlled using radiography light boxes (for viewing X-ray films), which provide uniform and diffuse lighting to reduce glare on the surfaces of the dice. Because the lighting is carefully controlled, a simple threshold can be applied to the grayscale image, resulting in an array of ones and zeros. The ‘holes’ in the image, which represent the pips of the dice, are filled in, and the resulting image shows the dice as white squares on a black background. When two dice are in close contact, the thresholding process often fails to completely separate the dice and an edge-finding algorithm is used to find any transition regions in the original image. The edges are dilated (inflated) to ensure the shared border between any two touching dice is entirely encompassed within the detected edges, and any regions labeled as edges are subsequently removed from the thresholded image. This leaves the center of each die as an independent region. The uniquely connected components are identified and a mask is placed at the centroid of each to serve as a search space for pips. Pips are identified from the original thresholded image, again through the identification of uniquely connected components. These initial components are eroded (i.e., a few pixels along the identified perimeters are removed) to prevent any pips from ‘bleeding’ together, which is common with dice showing six pips. The number of pips under each die mask is counted and the results are both stored in an array in the computer and displayed to the user to ensure proper operation. The image analysis steps are shown in Figure 2.
CHANCE
9
(b)
(a)
(c)
Figure 3. Images leading to errors during analysis. In (a), a die has landed perfectly on one corner. In (b), one die has landed atop another, leading to the software only identifying 11 dice. Finally, in (c), one die has come to rest against the felt-covered wall of the apparatus, leading to an improper lighting situation identified as an error.
3
10-1
2.5
10-2
2 10-3 1.5 10-4 0 10-5
0.5
10-6
0 -0.5 -200
-100
0
100
Gap Between Sequence Positions
200
10-7 10-4
10-3
10-2
10-1
100
Frequency (Inverse of Positions Between Rolls)
(b)
(a)
Figure 4. Lack of correlation between dice-rolling iterations. (a) shows the autocorrelation of the sequence of the number of successes per roll (central portion). The only point above the noise floor is at zero lag. (b) shows the corresponding power spectrum, which has the characteristic appearance of white noise.
There are numerous opportunities for error in this sequence, and error-catching steps are taken during the image analysis. For instance, if any uniquely labeled die is not within a tight size range—as is possible if two dice are perfectly flush and not separated—the image is considered to be an error. If 12 dice are not found in the image, which happens when one die lands on top of
10
VOL. 22, NO. 4, 2009
another during the tossing sequence, the image is an error. If any die does not land on a face, but rather on an edge or corner, the lighting is such that the whole die will not be found and the image is an error. Also, whenever any pip is too oblong, as is the case when pips bleed together, the image is an error. A few images that lead to errors are shown in Figure 3.
Any time an error occurs (approximately 4% of the time), the image is saved externally and, when possible, the numbers on the 12 dice are entered manually. Unfortunately, some images are impossible to count manually (see Figure 3a). Those that are possible to count manually have a bias toward showing a large number of sixes, as the pips on sixes can bleed together, and therefore
Table 2—Current Data on Dice: 26,306 Throws of 12 Dice Number of Successes
Observed Frequency
Theoretical Frequency p = 1/3
Theoretical Frequency p = 0.3343
0
216
203
199
1
1194
1216
1201
2
3292
3345
3316
3
5624
5576
5551
4
6186
6273
6272
5
5047
5018
5040
6
2953
2927
2953
7
1288
1255
1271
8
406
392
399
9
85
87
89
10
13
13
13
11
2
1
1
12
0
0
0
Total
26,306
= 5.62
= 4.32
Note: A die was considered a success if five or six pips were showing.
ignoring error images would possibly lead to bias in the results. With the entire rolling-imaging process repeating every 20 seconds, there are just more than 150 error images to process manually each complete day of operation. At the previously mentioned rate of operation, Weldon’s experiment can be repeated in a little more than six full days.
Results
will appear as an identifiable peak in the sequence autocorrelation. However, if the sequence is largely uncorrelated, the only identifiable peak in the autocorrelation will be at zero lag between iterations, and the Fourier transform of the autocorrelation will be a uniform (white noise) spectrum. These are precisely the results seen in Figure 4, leading to the conclusion that the number of successes per roll forms an uncorrelated sequence.
The distribution of successes per iteration is shown in Table 2, where it can be seen that the 2 values are not large enough to reliably reject either the fair or biased binomial hypotheses.
After all 26,306 runs were completed, all error images were processed to remove potential bias from the following analysis. (There were 27 ‘uncountable’ error images.) Initially, a “success” will hold the same meaning as it did for Weldon: a five or six showing on the up-face of a die. To assess any correlation (or inadequacy) in the dice-rolling procedure, it is useful to look at the autocorrelation of the sequence of successes per iteration. If a high number of successes in one iteration leads, for example, to a correlated number of successes in the next iteration or the iteration following, this
CHANCE
11
0.174 0.172 0.170 0.168 0.166 0.162
0.164
Overall Probability
0.160
1
2
3
4
5
6
Die Faces
Figure 5. Probability of observing each number of pips out of 12 times 26,306 total rolls. The error bars are 95% confidence intervals according to binomial sampling, where 2=p(1–p)/315672 and the dashed line shows the fair probability of 1/6 for each face.
Binning is performed on the data in Table 2 as outlined under “Weldon’s Dice Data.” The overall probability of a five or six showing is estimated to be 0.3343. From these results, the dice seem to be in accordance with the fair binomial hypothesis, unlike Weldon’s dice. This is as far as Weldon (or Pearson) could have gone with the original 1894 data, but this is by no means the end of the story. Besides the automation, which is a time-saving step, the unique aspect of this experiment is that the individual number of pips on each die is recorded with each iteration, and not just whether the die was a success or failure. This allows a much deeper analysis of the data. For instance, instead of jointly analyzing fives and sixes as a success, we find some interesting results if a success is considered to be only one face. First, the probabilities for the individual faces are estimated to be Pr1 = 0.1686, Pr2 = 0.1651, Pr3 = 0.1662, Pr4 = 0.1658, Pr5 = 0.1655, and Pr6 = 0.1688, whereas
12
VOL. 22, NO. 4, 2009
the fair hypothesis would indicate that each face should have probability Pri = 1/6 = 0.1667. These probabilities, along with their uncertainties in a binomial model, are shown in Figure 5. The tossing apparatus does not seem to alter the probabilities of the individual die faces over time. Comparing the first third and final thirds of dice rolls, and adjusting for multiple comparisons, reveals that none of the face probabilities significantly changed over time. When comparing the observed number of counts for each pip face with the expected fair value (12 times 26,306/6 = 52,612) in a 2 test, the resulting = 25.0 and p = 0.00014 leave little doubt that the dice results are biased. If the dice were biased in the manner Pearson assumed—namely due to pip-weight imbalance—we would expect the probabilities of the individual faces to follow a linear trend of -5, -3, -1, 1, 3, 5 for faces one through six, respectively. Fitting and testing for this pattern yields a p-value of
0.00005, allowing reliable rejection of the pip-weight-trend hypothesis.
Discussion and Conclusion One interesting point from the data obtained here is the revelation of a non sequitur by Pearson in his biography of Galton. In a footnote, he writes: Ordinary dice do not follow the rules usually laid down for them in treatises on probability, because the pips are cut out on the faces, and the fives and sixes are thus more frequent than aces or deuces. This point was demonstrated by W. F. R. Weldon in 25,000 throws of 12 ordinary dice. Galton had true cubes of hard ebony made as accurate dice, and these still exist in the Galtoniana. Weldon’s dice were most likely made from wood, ivory, or bone with carved pips, but that fives and sixes jointly occurred more often than one would expect under the fair hypothesis
16.10 16.15 16.20
Length (mm)
15.95 16.00
16.05
1–6 side 2–5 side 3–4 side
1
2
3
4
5
6
7
8
9
10
11
12
Die Number
Figure 6. Measurement of axis length, in millimeters, for all 12 dice on all three axes. (The main axes of a standard die are the 1–6 axis, the 2–5 axis, and the 3–4 axis.) The 1–6 axis is consistently and significantly shorter than the other two axes.
does not automatically imply the cause Pearson suggests. The new data presented here show that, even though fives and sixes jointly appear slightly more often than would be expected under the fair hypothesis, fives and sixes do not both have an individual probability larger than 1/6. The number of throws needed to observe these probability departures from fair is high and a testament to Weldon’s perseverance. In Weldon’s original data, the observed probability of a five or six was 33.77%, and at least 100,073 throws (or 8,340 throws of 12 dice) are needed to detect this departure from fair with 90% power at the 5% significance level. Here, the probability of throwing a six was determined to be 16.88%. At 90% power and the 5% significance level, 270,939 throws (or 22,579 throws of 12 dice) were needed to detect a departure as extreme. The estimated probabilities for the six faces seen in “Results” might be explained by a mold for the plastic dice that is not perfectly cubic, with the one- and six-pip faces slightly larger than the faces with two and five pips. To further investigate this possibility, the dimensions of each of the axes of all 12 dice (i.e., the 1–6 axis, the 2–5 axis, and the 3–4 axis) were measured with an
accurate digital micrometer. The results are shown in Figure 6, where it is seen that the 1–6 axis is consistently shorter than the other two, thereby supporting the hypothesis that the faces with one and six pips are larger than the other faces. A two-way ANOVA model (axis length modeled on axis number and die number) adjusted for multiple comparisons also showed that the 1–6 axis was significantly shorter than both of the other axes (by around 0.2%). Pearson’s suggestion for the cause of biased dice would also indicate that if a die were considered a success when four, five, or six pips were showing, that event should have a measurably higher probability than the complementary event (one, two, or three pips showing). However, the data obtained here indicate almost perfect balance, with p4,5,6 = 0.5001. Perhaps with further investigation, Pearson may have unearthed evidence to support his claim that the pip-weight imbalance led to Weldon’s data, but current observations suggest minor imperfections in the individual cubes may overshadow any effect due to carved-out pips. Interestingly, dice used in casinos have flush faces, where the pips are filled in with a plastic of the same density as the surrounding material and are precisely balanced. It would be
reasonable to assume these dice would produce results in accordance with all fair hypotheses.
Further Reading Iversen, Gudman R., Willard H. Longcor, Frederick Mosteller, John P. Gilbert, and Cleo Youtz. 1971. Bias and runs in dice throwing and recording: A few million throws. Psychometrika 36(1):1–19. Kemp, A.W., and C.D. Kemp. 1991. Weldon’s dice data revisited. The American Statistician 45(3):216–222. Nagler, J., and P. Richter. 2008. How random is dice tossing? Physical Review E 78(3):036207. Pearson, Karl. 1900. On the criterion that a given system of derivations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophical Magazine 5(50): 157–175. van der Heijdt, L. 2003. Face to face with dice: 5,000 years of dice and dicing. Groningen, The Netherlands: Gopher Publishers. Weldon’s Dice, Automated, www. youtube.com/watch?v=95EErdouO2w.
CHANCE
13
Medical Applications of EEG Wave Classification Wanli Min and Gang Luo
Medical Background
D
id you know your brain continuously emits electric waves, even while you sleep? Based on a sample of wave measurements, physicians specializing in sleep medicine can use statistical tools to classify your sleep pattern as normal or problematic. Brain-computer interfaces (BCIs) now being developed can classify a disabled person’s thinking based on wave measurements and automatically execute necessary instructions. This type of research is exciting, but conducting it requires knowledge of medicine, biology, statistics, physics, and computer science. Electroencephalogram (EEG) is the recording of electrical activity through electrode sensors placed on the scalp. The electricity is recorded as waves that can be classified as normal or abnormal. Measuring EEG signals is not an intrusive procedure; it causes no pain and has been used routinely for several decades. Different types of normal waves can indicate various states or activity levels. Abnormal waves can indicate medical problems. Two important applications of EEG wave classification are diagnosis of sleep disorders and construction of BCIs to assist disabled people with daily living tasks.
Sleep, which takes up roughly one-third of a person’s life, is indispensable for health and well-being. Nevertheless, onethird of Americans suffer from a sleep problem. For example, one in five American adults has some degree of sleep apnea, which is a disorder characterized by pauses of 10 seconds or longer in breathing during sleep. A person with sleep apnea cannot self-diagnose the disorder. To make diagnoses for sleep disorders, physicians usually need to record patients’ sleep patterns. A typical sleep recording has multiple channels of EEG waves coming from the electrodes placed on the subject’s head. Sample sleep recordings are shown in Figure 1. In the left panel, the waves from a healthy subject are stable at about zero and show relatively high variability and low correlation. In the right panel, the waves from a person with sleep difficulty show less variability and higher correlation. Sleep staging is the pattern recognition task of classifying the recordings into sleep stages continuously over time; the task is performed by a sleep stager. The stages include rapid-eye movement (REM) sleep, four levels of non-REM sleep, and being awake. Sleep staging is crucial for the diagnosis and treatment of sleep disorders. It also relates closely to the study of brain function. In an intensive care unit, for example, EEG wave classification is used to continuously monitor patients’ brain activities. For newborn infants at risk of developmental disabilities, sleep staging is used to assess brain maturation. Many other applications adapt the EEG wave classification techniques originally developed for sleep staging to their purposes. Besides being used to study human activities, sleep staging also has been used to study avian bird song systems and evolutionary theories about mammalian sleep. To make many EEG-based applications practical enough for routine use, it is necessary for the wave classification to be accurate. The more accurately sleep stages are classified, the faster patterns can be recognized. Because different sleep disorders have different sleep stage patterns, more accurate sleep stage classification allows physicians specializing in such disorders to diagnose problems better and faster. In fact, such specialists often spend several years in residency programs for special training in recognizing sleep patterns before obtaining board certification in sleep medicine. Expediting sleep disorder diagnoses also can help reduce the costs, which have surged in recent years, of treating sleep problems.
Statistical Analysis of Sleep EEG Data Among popular statistical methods for performing sleep staging are autoregression, Kullback-Leibler divergence-based
14
VOL. 22, NO. 4, 2009
1500
800
1000
600
400 500 200 0 0 -500 -200 -1000 -400 -1500
-600 Healthy Person
Person with Sleep Difficulty -800
-2000 0
2000
4000
6000
8000
10000
0
2000
4000
Time (10ms)
6000
8000
10000
Time (10ms)
Figure 1. EEG signals from a healthy person and a person with sleep difficulty
14 Healthy Person Person with Sleep Difficulty
12
Log PSD
10
8
6
4
2 0
5
10
15
20
25
30
Frequency (Hz)
Figure 2. Logarithmic power spectral density of EEG signals from a healthy person and a person with sleep difficulty
nearest-neighbor classification, and statistical modeling using a hidden Markov model (HMM). These methods typically consist of two steps: signal processing, which is extracting useful feature variables from EEG signals at each time-stamp of the EEG recording, and statistical classification based on the extracted features.
Although some researchers have used nonlinear characteristics (e.g., fractal dimensionality), the prevailing technique for EEG signal processing is spectral analysis. Spectral analysis, which converts the original time series to the frequency domain, is a natural choice for EEG signal processing because EEG signals are often described by , , , and waves, whose
CHANCE
15
14 13 12
Log PSD
11 10 9 8 7 6 5 4 0
5
10
15
20
25
30
Hz
Figure 3. Stage-specific average logarithmic power spectral density of EEG signals from a healthy person
frequency ranges are 8–12 Hz, 12–30 Hz, 4–7 Hz, and 0–3 Hz, respectively. As shown in figures 2 and 3, the frequency content of EEG signals is characterized by power spectral densities (PSDs). In Figure 2, the log PSD plot is substantially higher, especially at higher frequencies, for the subject with sleep difficulty. Figure 3 shows that one major difference between sleep stages in a healthy subject occurs at high frequencies, where the curve representing the wake stage is much higher than those representing the other stages. During signal processing, a spectral density estimator is typically applied to each epoch—a time window of fixed length—of the raw EEG data. To reduce variance, adjacent epochs have overlapping segments. The size of the epoch depends largely on the sampling rate (number of measurements per second) of the EEG signal. On the one hand, each epoch needs to contain enough raw signals for any samplingbased spectral density estimation method to work well. On the other, the time window for each epoch cannot be too wide or classification will become difficult, as information from later sleep stages is mixed with information from the current stage in the extracted feature variables. In testing on a subject, for example, our sleep stager can achieve a classification accuracy of 80% when three-second epochs are used, but accuracy drops to 49% when 10-second epochs are used. Thus, the tradeoff needs to be evaluated statistically. PSD is usually estimated through a periodogram using the fast Fourier transform. It is known that the raw periodogram estimator is biased for a process with continuous spectrum. To address this problem, we use kernel-type methods, including Parzen window, Bartlett window, and multi-taper. In general, a sleep stager needs enough training data to achieve good classification accuracy. Training data includes both EEG signals and the corresponding sleep stages, which require time-consuming manual labeling. To obtain sufficient training data on a subject, a specially trained technician needs to spend several days, or weeks, on the labeling process. As 16
VOL. 22, NO. 4, 2009
a result, there is usually sufficient training data, Dold, on several old subjects, sold, but limited training data, Dnew, on a new subject, snew. For the sleep stager, extracting feature variables and classifying stages are equally important. If either is not done well, stager performance will deteriorate. Before performing EEG wave classification, it is helpful to quickly assess the quality of extracted feature variables. This assessment does not require correlation of feature variables extracted at different time stamps. Instead, it can treat these feature variable vectors as independent because the major patterns of sleep stages are described by frequency components without referring to time correlation. For this purpose, it would be sufficient to perform a straightforward discriminant analysis by redefining sleep stages to a simple structure. For instance, with one non-REM and one REM sleep stage, a reasonable classification accuracy such as 70% would suggest that the extracted feature variables have good quality. If the extracted feature variables cannot be classified reasonably accurately in the discriminant analysis, we should not continue to pursue the corresponding feature variable extraction method. Another issue that needs to be considered is the number of frequency bands used (i.e., the dimensionality of feature variable vectors ). If the amount of training data is extremely large, using more frequency bands usually leads to better discriminating power. However, the amount of training data is often limited, so this property no longer holds. Instead, it is to cover the subset of all the frequency desirable to confine bands that have the most significant discriminating powers, as bands with low discriminating powers can interfere with parameter estimation of the statistical model. Once feature variables are extracted, there are many approaches to classifying EEG waves. Among them, the HMM and its variants are widely used. HMM-style methods
Figure 4. Matrix used in the P300 brain-computer interface
(a)
(b)
Figure 5. Seven characters are intensified simultaneously. One row of characters is intensified in (a), and one column of characters is intensified in (b).
take into account an important aspect of sleep staging: the serial correlation of sleep stages across time. In contrast, discriminant analysis treats feature variable vectors with different time stamps as completely independent. Although discriminant analysis captures the main effect, it misses the obviously important secondary effect of correlation across time so cannot achieve satisfactory classification accuracy. For example, it is unlikely that a transition from deep sleep (the third or fourth level of non-REM sleep) to REM sleep occurs at two consecutive time stamps. HMM-style methods consider the serial correlation of sleep stages across time and model it through the transition probabilities of the hidden Markov chain. Nevertheless, they make unnecessary assumptions about the distribution of the feature variable vectors . (Typically, they assume follows a Gaussian distribution, which can be far from true.) Unlike HMM, a linear-chain conditional random field (CRF) directly models the probabilities of possible sleep stage sequence given an observed sequence of feature variable
vectors, without making unnecessary independence assumptions about the observed vectors. Consequently, CRF overcomes the shortcoming of HMM, that it cannot represent multiple interacting features or long-range dependencies among the vectors observed. According to our experiments, the linear-chain CRF method performs much better than the HMM method for human sleep staging, improving average sleep-stager classification accuracy by about 8%. Similar results hold for bird sleep staging.
Brain-Computer Interfaces The BCIs now being developed will facilitate the control of computers by people who are disabled. As they think about what they want the computer to do, their thinking will be classified based on their EEG waves and the computer will automatically execute the corresponding instructions. Accurate EEG wave classification is critical for computers to issue the correct instructions. Among the various types of BCIs, the P300 BCI using EEG signals is one of the most promising because it is noninvasive, CHANCE
17
Figure 6. A Latin square of order seven
Figure 7. Seven characters are intensified simultaneously according to symbol ones (1s) in the Latin square in Figure 6.
easy to use, and portable. Additionally, the set-up cost is low. P300 refers to a neurally evoked potential component of EEG. The current P300 BCI communicates one symbol at a time and works as shown in figures 4 and 5. A matrix of characters or pictures is displayed on the computer screen. To communicate a desired character, the user focuses on the matrix cell containing it and counts the number of times it is intensified when a predetermined number of intensification rounds are performed. In each round, all the rows and columns of the matrix are intensified once in a random order—one row or column at a time. The row and column containing the desired character form the rare set (the target), and the other rows and columns form the frequent set (the nontargets). If the user is attending to the desired character, intensification of the target row or column should elicit a P300 response because it is a rare event in the context of all the other row or column intensifications. By detecting the P300 responses from the recorded EEG signals of the user, we can classify the target row and column whose intersection cell contains the classified character the user intends to communicate. Experimental design is the term describing how characters are arranged and intensified. To maximize both the classification accuracy and communication speed of the P300 BCI 18
VOL. 22, NO. 4, 2009
system, an appropriate experimental design is needed to obtain strong P300 responses. Nevertheless, the existing experimental design is nonoptimal due to an undesirable effect caused by neighboring characters. Amyotrophic lateral sclerosis (ALS) patients, one of the most important user groups of BCI, have eye movement problems. When neighboring characters in a row or a column are intensified simultaneously, an ALS patient’s attention can be distracted from the desired character, weakening the P300 response and reducing classification accuracy. To minimize this interference, it is better to intensify non-neighboring characters simultaneously. The larger the distances between simultaneously intensified characters, the less interference the ALS patient will receive. One approach to intensifying non-neighboring characters simultaneously is to use the mathematical structure of the Latin square. A Latin square of order n is an n×n matrix based on a set of n symbols, so that each row and column contains each symbol exactly once. Without loss of generality, the symbols are assumed to be 1, 2, …, and n. Figure 6 shows an example of a Latin square of order seven. If we intensify characters according to a Latin square (Figure 7), the simultaneously intensified characters will not be direct neighbors either horizontally or vertically. To ensure
EEG Classification Mark L. Scheuer, Chief Medical Officer, Persyst Development Corporation
With respect to sleep, an improvement in diagnostic classification would result in some improvement in diagnostic accuracy, but improvement in this particular area would be uncertain and warrant study in actual use. The area in which it would help is in reducing the time physicians and high-level medical technologists require to establish a clear classification of the problem based on the EEG/ polysomnography data. This would improve efficiency in a costly endeavor. There are other areas in which improved classification would be useful, among them intensive care unit monitoring of brain activity and a host of other applications in which EEG signals provide a wealth of physiological information about a person (awake or sleep, sleeping well or not, brain functioning properly or not, a potential stroke in progress, a seizure occurring, a person with diabetes having lapsed into an obtunded state due to blood sugar problems, etc.). The EEG provides a direct window on brain function, one with high time resolution, great versatility, and reasonable spatial resolution. Probably the main problem with EEG classification and interpretation is that the signal is complex and noisy. The noise, itself, is complex and easily confused with the actual cerebral signal. Humans who can read EEG well have generally spent many months to years learning how to do so.
the desired character can be uniquely determined within each round of intensification, we can resort to the concept of orthogonal Latin squares. Intuitively, Latin squares L1 and L2 of the same order n are orthogonal if the cells in L1 containing the same symbol can be regarded as a conceptual row, the cells in L2 containing the same symbol can be regarded as a conceptual column, and each conceptual row and column has a unique intersection cell. For an n×n character matrix M, the following new experimental design can ensure unique character determination by mapping M to the superposition of L1 on L2. Whenever the experimental design intensifies the hth (1 ≤ h ≤n) row of M, the new experimental design intensifies the characters in M corresponding to the hth conceptual row in L1. Whenever the experimental design intensifies the kth (1 ≤ k ≤ n) column of M, the new experimental design intensifies the characters in M corresponding to the kth conceptual column in L2. By detecting the P300 responses from the recorded EEG signals of the user, we can classify the target conceptual row and column whose unique intersection cell contains the classified character the user intends to communicate. If we expand nonsquare matrices into square matrices by adding dummy rows or columns, this method also works for nonsquare character matrices.
In general, given a positive integer n, we can obtain many pairs of orthogonal Latin squares of order n. The pair of orthogonal Latin squares used to communicate a character can vary from one character to another through random selection. This provides much flexibility and makes the character intensification pattern more unexpected by the user. As mentioned by Eric W. Sellers and colleagues in their Biological Psychology article, “A P300 Event-Related Potential Brain-Computer Interface (BCI): The Effects of Matrix Size and Interstimulus Interval on Performance,” such unexpectedness can lead to stronger P300 responses and improve classification accuracy. When choosing Latin squares, we can impose various distance constraints. One is that, in the Latin square, the distance between any pair of cells containing the same symbol is no smaller than a predetermined threshold t. Using that constraint, we can ensure that, at any time, the distance between any two simultaneously intensified characters is no smaller than t, which can reduce interference for ALS patients, lead to stronger P300 responses, and improve classification accuracy. We emphasize that the orthogonality of Latin squares is a desired, but not mandatory, property. Moreover, the pair of Latin squares used can vary more frequently (e.g., from one round of intensification to another), even within the communication process of the same character, because multiple rounds of intensification are performed to communicate a character. Within each round of intensification, a separate score reflecting the likelihood of being the desired character can be computed for each character in the matrix according to some classification algorithm. The final classification is performed by combining the scores in all rounds (i.e., using an aggregation or voting schema). If the two Latin squares used are nonorthogonal, we may not be able to uniquely determine the desired character in a single round of intensification. Nevertheless, since the pair used varies from one round of intensification to another, the combination of the scores of all the rounds can uniquely determine the desired character if all the Latin squares are chosen appropriately.
Future Directions for EEG Wave Classification Research EEG wave classification research appears to be going in the following three promising directions: 1. Subject adaptation for sleep staging. When there is extensive training data, Dold, on several old subjects, sold, but limited training data, Dnew, on a new subject, snew, it is not desirable to train the parameter vector Θ of the classifier using only Dnew. Instead, subject adaptation needs to be performed to improve classification accuracy on snew. The high-level idea of subject adaptation is to use the knowledge on Θ that is learned from Dold to obtain a prior distribution of Θ. Using Dnew and Bayes’ theorem, a posterior distribution of Θ can be computed to obtain a regulated estimate of Θ. In this way, even without any Dnew, classification accuracy on snew will be relatively acceptable. Moreover, accuracy increases with increases in Dnew. Because subject adaptation for sleep staging was proposed only recently, a few issues remain open. To ensure robustness, subject adaptation requires training data from many subjects with a wide variety of characteristics. This requires building a large, integrated, publicly available EEG sleep recording
CHANCE
19
database. Currently, such databases are owned by individual institutions and will not be released to the general public for at least several more years. One contains data from 6,400 subjects in the Sleep Heart Health Study. Moreover, different subjects (e.g., newborn babies vs. older people, healthy people vs. people with mental disorders) have different characteristics. When classifying sleep stages for a snew, it is undesirable to train the sleep stager using data from subjects whose characteristics are dramatically different. However, training the stager using data from subjects that have similar characteristics requires categorizing all the subjects in the database into multiple clusters and having a mechanism to find clusters that match snew. The stager should be trained on either the data from the matching clusters or on all the data in the database. In the latter case, discounts or corrections are applied to data from the nonmatching clusters. 2. Automatic identification of candidates with possible sleep disorders. A natural way to diagnose sleep disorders is to first use a sleep stager to classify patients’ sleep stages and then let physicians check the patterns. However, this might not be the only way of using stagers to diagnose sleep disorders. Suppose a sleep stager is trained using training data from healthy people and then asked to classify sleep stages for a new person. If the classification accuracy is low, we would suspect this new person of having different characteristics from healthy people and possibly a sleep disorder. It would be interesting to investigate how well this conjecture matches reality. 3. Language modeling for brain-computer interface. Besides diagnosing sleep disorders, EEG wave classification is useful for BCIs. In BCI, a person can think about a sentence one letter at a time. Individual letters are recognized by classifying the person’s EEG wave. Then, all the recognized letters are concatenated into a sentence that is automatically entered into the computer. The most straightforward way to implement this is to treat each letter as a state while using a method similar to the sleep staging method to classify EEG waves into individual letters. 20
VOL. 22, NO. 4, 2009
But this is not the best approach. The sentence the person thinks of is natural language, with its own frequency patterns and characteristics. For instance, some letters occur more often than others (e.g., e vs. z). Some letters are more likely to follow a specific letter than other letters. Certain word pairs are invalid in English. Some words are more likely to follow a specific word than other words. All such information can be used to make EEG wave classification more accurate. One way to capture the information provided by the structure of the natural language is to use the language modeling method, which has been widely adopted in speech recognition, information retrieval, and machine translation. A language model is a way of assigning probability measures over strings drawn from a language. Each string W has the prior probability P(W). Using the person’s EEG waves and the Bayesian framework P(W|X) P(X|W)P(W), where X represents the observed sequence of EEG feature variable vectors, the posterior distribution for all possible strings can be computed and used to obtain the most likely string. According to past experiences in speech recognition, using the language modeling method can significantly improve classification accuracy. The literature already contains models for many natural languages related to speech recognition, information retrieval, and machine translation. However, to use the language modeling method to support EEG-based BCI, a large, integrated EEG database with enough training data needs to be built. Current EEG databases for BCI are too small.
Further Reading Lafferty, J.D., A. McCallum, and F.C. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the eighteenth international conference on machine learning. Williamstown. Luo, G., and W. Min. 2007. Subject-adaptive real-time sleep stage classification based on conditional random field. In Proceedings of the American Medical Informatics Association annual symposium, 488–492. Chicago. Okamoto, K., S. Hirai, M. Amari, T. Iizuka, M. Watanabe, N. Murakami, and M. Takatama. 1993. Oculomotor nuclear pathology in amyotrophic lateral sclerosis. Acta neuropathologica 85(5):458–462. Priestley, M.B. 1981. Spectral analysis and time series. New York: Academic Press. Rabiner, L.R., and B.H. Juang. 1993. Fundamentals of speech recognition. Englewood Cliffs: Prentice Hall. Sellers, E.W., D.J. Krusienski, D.J. McFarland, T.M. Vaughan, and J.R. Wolpaw. 2006. A P300 event-related potential brain-computer interface (BCI): The effects of matrix size and interstimulus interval on performance. Biological Psychology 73(3):242–252. Thomson, D.J. 1982. Spectral estimation and harmonic analysis. In Proceedings of the IEEE 70(9):1055–1096. Zhang, L., J. Samet, B. Caffo, and N.M. Punjabi. 2006. Cigarette smoking and nocturnal sleep architecture. American Journal of Epidemiology 164(6):529–537.
Missing Well: Optimal Targeting of Soccer Shots Fredrick E. Vars
“S
hoot for a post,” a youth soccer coach once instructed me. It seemed like good advice at the time—the goalkeeper was in the middle and I didn’t want to kick it to him—but a moment’s reflection shows that my coach was wrong. If the ball went exactly where I aimed every time, I’d have often hit the goal post, but scored few goals. In fact, I was (as all decent players are) about equally likely to miss left as right, so about half of the balls directed toward the goal post would have gone wide with no chance of scoring. Better to aim at a spot on the target, but where exactly? Not directly at the goalkeeper, surely, but beyond that, it’s tough to answer this question. Shots come from all over the pitch, with goalkeepers and other defensive players in an infinite variety of positions.
A Simple Model No one can kick the ball exactly where he or she intends every time. If anyone could, assuming the goalkeeper was in the goal, his or her best shooting strategy would be to aim for one of the extreme edges of the goal, just inside the post, nearly every time. Mere mortals, however, must accept error. Decent players do not miss left more than they miss right, at least not by much. A player who did would, over time, adjust the point at which he or she aimed until misses left and right were about equal. Good players hit the ball in the general direction they intend more often than not. Shots are taken from various positions on the field, which could lead to skewed errors, but shots from the left and right tend to even out. These assumptions imply that the frequency distribution of shot placement, when looking at one side of the goal and in the horizontal direction, is symmetric and unimodal. To get sample probabilities, we suppose the distribution of the shot position is normal about the target.
One additional simplifying assumption is needed: The goalkeeper generally positions him or herself in the center of the goal. Assuming, realistically, that the probability of making a save is highest where the keeper stands and diminishes outward in each direction, the keeper
positioning him or herself in the center maximizes save probability. We’re now ready to illustrate my coach’s ‘shoot-for-a-post’ strategy on a simple one-dimensional half goal. Shot height is ignored for now. Figure 1 shows a normal distribution centered on a goal
CHANCE
21
Density
0.2
0.4
Shoot for Post
Goals 34%
Misses 50%
0.0
Saves 16%
Goalkeeper
Right Goal Post
Center of Goal -- Inside the Goal --
0
-- Outside the Goal --
2
4
6
Horizontal Distance (Scale Is Shot SD)
Figure 1. Shoot for a post: The distribution of shots is centered on the right goal post, and the keeper can cover all but one standard deviation of the shot distribution from the center to the goal post.
Density
0.2
0.4
Shoot for 5% Misses
Misses 5% Goals 21%
0.0
Saves 74%
Goalkeeper
Right Goal Post
Center of Goal -- Inside the Goal --
0
2
-- Outside the Goal --
4
6
Horizontal Distance (Scale Is Shot SD)
Figure 2. Put 95% of shots on goal: The keeper has the same coverage and the shooter has the same shot accuracy, but the shooter aims well inside the goal post to reduce misses wide to 5%.
22
VOL. 22, NO. 4, 2009
Density
0.2
0.4
Shoot for Equal Saves and Misses
Goals 38%
Misses 31%
0.0
Saves 31%
Goalkeeper
Right Goal Post
Center of Goal -- Inside the Goal --
0
2
-- Outside the Goal --
4
6
Horizontal Distance (Scale Is Shot SD)
Figure 3. Shoot for equal saves and misses: Again, keeper coverage and shooter accuracy are unchanged, but the shooter aims halfway between the goal post and the outer limit of the keeper’s reach.
post. We can focus on just half the goal as a result of the symmetry assumption. If we further assume the goalkeeper can cover the goal perfectly to within one standard deviation of the shooter’s distribution, the shooter will score on 34% of shots (19% if we assume half a standard deviation). “Perfectly” means the keeper saves every shot within reach. To provide a specific example using concrete measurements, assume the shooter puts two-thirds of his shots within 3 feet of where he aims (a standard deviation equal to 3 feet). In the case illustrated by Figure 1, this would mean the goalkeeper starting in the center of the goal mouth, which is 24 feet wide, can cover 9 feet in the time the ball travels, thus leaving 3 feet—one standard deviation—uncovered. The shooter will score on every shot in that uncovered range and none outside it. How does the shoot-for-a-post technique compare with other possibilities? Some say putting your shot on target is most important. To never miss the goal is probably an unattainable ideal, but we
can operationalize this theory by aiming the ball to miss the goal only 5% of the time. This means the shot is centered a good distance inside the goal post (about 5 feet under the concrete assumptions above). Figure 2 illustrates how this strategy fares: generating goals on only 21% of shots at the one standard deviation level (8% at half a standard deviation), again assuming the goalkeeper is perfect within his reach. To be fair to the proponents of this theory, many shots generate rebounds that present scoring opportunities. The real objective from a team perspective is to maximize goals, not player scoring percentages. Trying never to miss wide may make more sense with young players than veterans, because goalkeeping improves with age. From these figures (if not before) it should be obvious that the optimal targeting strategy is to aim somewhere between the goal post and the outer extent of the goalkeeper’s reach. The recognition that shooting distribution is almost certainly symmetric (and
assumption of goalkeeper perfection to a point) tells us exactly where to shoot: half way in between. Figure 3 moves the center of the distribution to this optimal position and generates a scoring percentage of 38% for a scoring window of one standard deviation (and 20% for half a standard deviation). With a window of 3 feet as assumed above, we would aim 1.5 feet inside the goal post. Notice that, unlike the first two strategies, there are equal shots wide of the goal and shots into the goalkeeper’s arms in the third. Thus, a simple and easy-to-measure way to describe the optimal strategy is to aim the ball to equalize the probability of missing the goal and permitting a goalkeeper save (i.e., a one-to-one miss-to-save ratio). (Of course, the size of the uncovered range will change the precise targeting point.) It is important to note that this result is unaffected by relaxing our assumptions regarding the precise shape of the distribution (normal) and size of scoring window (one or one-half a standard deviation, or 3 or 1.5 feet). The CHANCE
23
0.4
Best with 75% Saves in Save Region
Keeper 41.6%
Saves (75%) 31.2%
Density
0.2
Goals (25%) 10.4%
Goals 36.8%
Misses 21.5%
0.0
Keeper 41.6%
Goalkeeper
Right Goal Post
Center of Goal -- Inside the Goal --
0
2
-- Outside the Goal --
4
6
Horizontal Distance (Scale Is Shot SD)
Figure 4. Effect of porous goalkeeping: Rather than stopping 100% of shots within his reach, the goalkeeper stops only 75%.
only critical assumptions regarding shots are that the distribution is symmetric and unimodal.
Pulling in Opposite Directions The one-dimensional model outlined above leaves out several important factors, however. First, goalkeepers are not perfect walls; goalkeepers sometimes miss or bobble balls within their reach. This can result in a goal or dangerous rebound. Focusing on goals, we can quantify the effects of porous goalkeeping. The analysis above assumed goalkeepers stop 100% of shots within their reach. Assume instead that number is 75%. How would this affect the ratio of misses to saves for the optimal scoring strategy? It would fall from one to 0.68. Graphically, one could represent this change by shifting the distribution left and changing 25% of would-be saves into goals (see Figure 4). It may seem counterintuitive that subpar goalkeeping leads to more saves, but one must look at the situation from the
24
VOL. 22, NO. 4, 2009
shooter’s perspective. If the goalkeeper is likely to flub the play, shooters take advantage by steering more shots in his or her direction. Extra scoring space on the goalkeeper side shifts the shooter’s distribution left. Adding dimensions to the simple model reveals additional relevant factors. The second complicating factor flows from the goal mouths being two dimensional. This means shooters miss high and wide. Adding another way to miss almost necessarily increases the miss-to-save ratio—“almost” because it is theoretically possible for players to be so accurate in their shooting that they never miss high. Figure 5 illustrates this potential (the small circle), along with the more likely case in which misses high push the miss-to-save ratio toward two (the large circle, about 1.9 assuming a uniform distribution). An even larger circle would push the ratio higher still. Note that the save domain is assumed to be rectangular, which is almost certainly not the case. Unfortunately, the data here cannot illuminate the actual
limits of a goalkeeper’s reach, which varies substantially with the distance and angle of the shot. That said, the fundamental point—that misses high increase the miss-to-save ratio—almost certainly holds whatever the precise shape of the save domain. If high shots are more difficult to save, this would push up even further the miss-to-save ratio by leading shooters to aim, and miss, high. Third, because players shoot from a variety of positions on the field, two elements of a third dimension become relevant: distance from the goal and angle from dead center. Distance from the goal is closely related to missing high. The natural effect of increased distance is decreased accuracy—a higher standard deviation in shot placement. Again, this pushes the miss-to-save ratio toward misses. Angle from the center line between shooter and the center of the goal mouth changes the shape of the target from a rectangle to what would appear to the shooter more like a trapezoid. The far post shrinks as it recedes. This is more than an optical illusion, as
Figure 5. Effect of misses high on miss-to-save ratio. A shooter who aims halfway between goal post and keeper’s full reach will have more misses than saves unless the shooter is so accurate as to never miss high.
the increased distance to the far post demands greater accuracy for scoring. The acute angle at the near post likewise gives the goalkeeper less space to cover. Thus, the important effect of angled shooting is to decrease the overall size of the target, which decreases the scoring percentage and miss-to-save ratio. The ratio effect results from misses high remaining constant, while a narrowing scoring window increases both misses wide and saves by equal margins. A fourth dimension and potentially complicating factor is the time it takes the ball to reach the goal. This is a function of shot speed (or power), which exerts three effects. By reducing the distance the goalkeeper can cover, higher shot speed expands the scoring target (the opposite effect of angled shooting). This would tend to increase the scoring percentage and the miss-to-save ratio. On the other hand, increased power means decreased accuracy and, because the area of potential misses (unlike the goal) is unbounded, the miss-to-save ratio is pushed toward misses. The third effect of shot speed is to make the goalkeeper more porous. It is more difficult to catch a ball coming at high speed. As argued above, the impact of goalkeeper error is to increase saves relative to misses. Thus, the net effect of increased shot speed on the miss-to-save ratio is ambiguous: The bigger target and decreased accuracy pushes toward misses; powering through the goalkeeper pushes toward saves.
A final complication is goalkeeper behavior. Economics literature has examined the interaction between shooter and goalkeeper in some depth in the context of penalty kicks. In that setting, a critical element of shooting strategy is tricking the goalkeeper into diving the wrong way. This game theory component of shooting is likely much less important in the ordinary run of play, where most shots are scored too quickly for player reflection or faints. Still, the clear effect of deception is to put more shots on goal. As for the miss-to-save ratio, the impact is precisely the same as the porous goalkeeper effect. Indeed, fooling the goalkeeper is merely one way to score goals that were within the goalkeeper’s grasp. Because penalty kick goals are relatively rare—6.1% of goals scored in the 2008 MLS season—there is some reason to think the effects of deception and higher-order goalkeeper strategy may not be substantial. What predictions can we make about the aiming strategies of professional soccer players? The simple model shown in Figure 3 suggests players who score on a high percentage of shots will tend to miss about as frequently as their shots are saved. In other words, scoring percentage will decrease as the ratio of misses to saves deviates from one. If shot placement follows a normal distribution, we would expect the scoring percentage drop-off to be gradual near one, rapid in middle values, and slow as the deviation from one is greater.
But what about the various realworld effects described above? There are three possibilities: 1. They cancel one another perfectly and the optimal miss-to-save ratio remains one 2. The porous goalkeeper effects dominate and saves outnumber misses at peak scoring percentage 3. Misses high push the optimal miss-to-save ratio above one The geometry of Figure 4 is compelling, but high percentage shots are likely to be from close range, where misses high are unlikely. Deception is almost certainly not a dominant shooting strategy in the run of play. The simple model will probably hold sway, with a modest adjustment for shots over the crossbar, generating a miss-to-save ratio slightly above one.
Results To test these hypotheses, we will start with players from the Union of European Football Associations (UEFA) Champions League. In the 2007–2008 season, 39 players scored three or more goals. Twenty of these scored on 25% or more of their shots. Among these 20, there were the same number of saves and misses wide: 91 each. This is consistent with the most efficient scorers in the world applying the strategy depicted in Figure 3.
CHANCE
25
M/S = 0.4
M/S =1
M/S = 2.4
75
Number of Players
60
45 30
15
0 -1.8
-1.5
-1.2
-0.9
-0.6
-0.3
0
0.3
0.6
0.9
1.2
1.5
1.8
2.1
2.4
In(M/S)
Figure 6a. Frequency distribution of natural logarithm of miss-to-save ratio (M/S). The natural logarithm of miss-to-save ratio follows a roughly normal distribution, with 71.7% of players between 0 and 0.90, which corresponds to a miss-to-save ratio range of 1 to 2.4.
Scoring Percentage
0.45
0.30
0.15
0.00 -2
-1
0
1
2
3
Log Ratio
Figure 6b. Scoring percentage by log miss-to-save ratio
However, looking at one extreme tail of the distribution of scorers may miss broader trends and differences among individual players. The UEFA data also reflect few shots per player. To supplement these data, we will look at career statistics on each player active in the United State’s top professional league— Major League Soccer (MLS)—as of the all-star break of the 2008 season. 26
VOL. 22, NO. 4, 2009
Omitted are goalkeepers and players who had zero misses or saves. The resulting data set consists of 226 players out of 381 in the league. Because miss-to-save ratio is bounded at zero, we will calculate the natural logarithm of the ratios and use that value (“log ratio”) going forward. The frequency distribution of the log ratio variable is telling. In Figure 6a, the
overwhelming majority of players (72%) had miss-to-save ratios between 1 and 2.4. There is a noticeable shift to the right, suggesting slightly more misses than saves. That professional players shoot in this way is reassuring for our predictions, but this histogram tells us little about whether a one-toone miss-to-save ratio is the optimal shooting strategy within the league. To
Goals/Shot
All Forwards Mid-fielders Defenders
-0.3
-0.1
0.1
0.3
0.5
0.7
0.9
1.1
1.3
1.5
In(Miss/Save)
Figure 7. Predicted scoring percentage by natural logarithm of miss-to-save ratio and position. Scoring percentage is maximized where misses and saves are roughly equal for all positions, forwards, and mid-fielders, but follows a different and more complicated pattern for defenders.
answer that question, we turn to scoring percentage data. The scoring percentage for all players in the data set was 11.8%. Figure 6b is a scatter plot of scoring percentage versus log ratio. Figure 7 shows weighted (by shots taken) regression lines for scoring percentage against log ratio. Included on the right-hand side of the equation is log ratio, along with squared, cubed, and fourth-power terms. The higherorder terms are to allow for three minima and maxima. Our prediction is that the line will have minima near the bottom and top ends of the log ratio range and a maximum just above 0 (equal misses and saves). The overall data, represented on the graph by the black line, generally bear out this prediction. The observed optimal log ratio is about -0.12, which corresponds to a miss-to-save ratio of 0.89. Notably, the squared log ratio coefficient is negative and highly significant (p=0.002). The other log ratio coefficients are not statistically significant. The R2 of the model is 0.165. Is this overall result masking different optimal shooting strategies for different positions? Recall that we predicted a lower optimal miss-to-save ratio for shots from close distance. A porous goalkeeper will push the miss-to-save ratio toward saves because shooters will place
more balls on the goal mouth. On the other hand, high shooter accuracy will tend to push down misses faster than saves, as there are two ways to miss (high and wide). Both effects will be most pronounced for shots from close range: Goalkeepers have less time to react and shooters become much less likely to miss, either high or wide. The data can be broken down by position to test this sub-hypothesis. Based on their position on the field, forwards seem most likely to shoot from close range, defenders from long range, and mid-fielders somewhere in between. Hence, our new prediction is that the optimal miss-to-save ratio will be lowest for forwards, higher for mid-fielders, and highest for defenders. The data provide mixed support for this hypothesis. Figure 7 shows that predicted scoring percentage peaks around a log ratio of 0.04 for both forwards and mid-fielders (1.04 misses per save). The squared log ratio coefficients in both models are at least marginally statistically significant (p=0.053 forward and p=0.036 mid-fielder), and the R2s were around 10% (0.117 forward; 0.099 mid-fielder). Thus, these models do provide additional support for our original thesis that equal misses and
saves optimizes scoring percentage, but not for the sub-hypothesis of a difference between forwards and mid-fielders. The situation for defenders is more complicated. There are maxima at the bottom of the range (the lowest log ratio for a defender was -0.22, which corresponds to 0.80 misses per save) and at a log ratio of 0.74 (2.10 misses per save). None of the coefficients in the defenderonly model was statistically significant and the R2 was only 0.036.
CHANCE
27
CONTEST A Real Challenger of a Puzzle Jürgen Symanzik
The following 10 data points could have been found in a data collector’s log: 5.643517 5.721843 5.718105 5.837939 5.780754 5.851633
WIN A FREE SUBSCRIPTION
5.781989 5.836783 5.783540 1.863323 The challenge is to determine the origins of these data points, explain how the solution is obtained (using the hints), and make a graph to communicate the meaning of the data points. Hints 1. Order matters. 2. The data have been transformed. 3. The first nine points may differ slightly, depending on the source. 4. The tenth observation is famous (before its transformation). 5. A major hint is hidden in Edward Tufte’s 1997 book.
The optimal strategy may have been to aim for the center of that space, but the post—a clearly defined target—may have been a more efficient proxy. Better players in better scoring positions, however, would do well to ignore my coach’s advice.
Limitations and Future Directions Shooting a soccer ball is a complex process. Due to limitations in the data set, we employed a range of simplifying assumptions. The most fundamental are symmetry and that the only defensive strategy is the goalkeeper reacting to the shot taken. The two are connected. For example, a goalkeeper may deliberately leave one side of the goal more open and be ready to dive in that direction. The optimal shooting strategy in such a situation may be to aim at the lessexposed side of the goal or the keeper. Such gamesmanship is not common in the ordinary run of play—there isn’t time for much thought on either side of the ball—but higher-order strategy plays a significant role in dead-ball situations such as penalty kicks. Again, penalty kicks generate a small fraction of goals, but future analysis of shot targeting would do well to isolate them (and perhaps other dead-ball situations). Beyond goalkeeping strategy, a host of other variables also are relevant to shot targeting. The location and movement of the shooter and ball, including its spin, the positions of other players, and weather and field conditions. Simple geometry and one statistic not commonly reported—the miss-tosave ratio—may go some distance in explaining which players score on a high percentage of shots. Other factors will have to await future examination with richer data.
6. This puzzle should have been sent on 1/28/09. 7. The data represent the entire population. Submissions should be sent electronically to CHANCE editor, Mike Larsen, at
[email protected] by January 28, 2010. From among entries that correctly identify the origins of these data points, explain how the solution is obtained (using the hints), and provide a clear graphical interpretation of the data, one winner will be selected to receive a one-year (extension of his or her) subscription to CHANCE. As an added incentive, a picture and short biography of the winner will be published in a subsequent issue. Faculty, students, and recent graduates (since 1/1/2009) from Utah State University and winners of the Goodness of Wit puzzles or graphics contest from any of the three previous issues are not eligible to win this contest. 28
VOL. 22, NO. 4, 2009
Further Reading Buxton, Ted. 2007. Soccer skills for young players. Ontario: Firefly Books. Chiappori, P.A., S. Levitt, and T. Groseclose. 2002. Testing mixed-strategy equilibria when players are heterogeneous: the case of penalty kicks in soccer. American Economic Review 92(4):1138. MLSnet.com. 2008 MLS Standings. http://web.mlsnet.com/stats/index. jsp?club=mls&year=2008.
Does Momentum Exist in Competitive Volleyball? Mark F. Schilling
S
ince T. Gilovich, R. Vallone, and A. Tversky published their seminal 1985 paper, “The Hot Hand in Basketball: On the Misrepresentation of Random Sequences,” there has been considerable study of the question of momentum in sports. The term “momentum” here refers to a condition in which psychological factors cause a player or team to achieve a higher (or lower) than normal performance over a period of time due to a positive correlation between successive outcomes. Put more plainly, sports momentum—if and when it exists—can be summarized as “success breeds success” (or, perhaps, “failure breeds failure”). Gilovich and colleagues referred to it as the “hot hand.” Gilovich’s research team studied questions such as whether a basketball player who has made several shots in a row becomes more likely to make the next shot than his or her normal shooting percentage would indicate. Finding no evidence for the existence of momentum, they concluded that basketball shots behave as independent trials. Basketball, baseball, tennis, and many other sports have been analyzed for momentum, and little statistical evidence has been found, although controversy about the issue continues. The best evidence for momentum has come in sports that involve competition between individuals, rather than teams, and have little variation within the course of play, such as bowling and horseshoes. Here, we investigate whether momentum exists in volleyball. By some accounts, volleyball is the most popular sport in the world in terms of the number of participants; the Fédération Internationale de Volleyball (FIVB) estimates approximately 800 million people play the game. There are, of course, many ways in which momentum could manifest itself in volleyball. We focus on whether “runs” of consecutive points by a team give evidence of momentum or whether they are manifestations of the natural variability that appears in the outcomes of chance
events. We provide three analyses to help address the question and find the results of all three consistent. The data come from 55 games played in the course of 16 matches during the 2007 NCAA Collegiate Women’s Volleyball season between the California State University Northridge team (CSUN) and 15 opponents, most of them members of the Big West Conference (to which CSUN belongs).
action this generates is called a “rally.” Whichever team wins the rally earns a point, regardless of which team served. (This rally-scoring system is a departure from the scoring system used many years ago, known as side-out scoring, in which only the serving team could score.) If the team that was serving wins the point, the
Volleyball Basics To understand the possible propensity and nature of runs in competitive volleyball, we need a basic understanding of fundamental components of the sport in its modern form. A volleyball match consists of either a best-of-three or a best-of-five series of games (also called sets), where each game is played until one team reaches a certain score— typically 25 or 30 points—and is ahead by at least two points. A best of 2n–1 game match ends as soon as one team wins n games. There are six players on a side at any given time, each having nominal positions on the court—three in the front row and three in the back. Only players who begin the rally in front-row positions can jump close to the net either to attack the ball (typically with a harddriven ball known as a “spike”) or to block the opposing team’s attack. Volleyball uses a system of rotation, so the same players do not always stay in the same positions. Play begins when a member of one team serves the ball across the net to the opposite team. The CHANCE
29
time gap between the end of one game and the start of the next; (2) each team is free to start the new game with a rotation completely different from that in which it ended the previous game; (3) it is not uncommon for different players to be inserted into the match at this point. Each of these factors may interfere with the assessment of any momentum that may contribute to runs. As is the case in basketball, volleyball coaches often stress the importance of limiting the runs achieved by their opponents (i.e., stopping their momentum). Of course, a run could just as easily be engendered by negative momentum, wherein a team’s poor performance during the run of points by its opponent may feed upon itself. That volleyball coaches believe in momentum is evidenced by the great majority of timeouts being called when the opposing team has scored a run of several points. In our database, 88 of the 141 time-outs (62%) were called immediately after a run of three or more points by the opponent, and 95% were called after runs of at least two opponent points.
Modeling Volleyball Play same player serves again. If the receiving team (the team not serving) wins the point (called a “side-out”), it serves next, at which time rotation comes into play as each player on this team moves clockwise to the next of the six positions. The player who moves to the back-right position (facing the net) becomes the new server. We can now see the structure of a run in a volleyball game: A run begins when the receiving team wins the point and continues as long as that team serves and wins subsequent rallies. The only exception is if the team that serves to begin the game wins an immediate string of points. A run therefore terminates either when the receiving team wins a rally or when the serving team reaches the score necessary to win the game and is ahead by at least two points. For this analysis, we will not allow a run to carry over from one game to the next. There are several reasons for this choice: (1) There is a significant 30
VOL. 22, NO. 4, 2009
To assess whether runs of points in volleyball games suggest a momentum effect, a probability model must be provided as a point of reference for how runs would behave in the absence of momentum. There is no unique model that perfectly represents the structure of volleyball games; rather, there is a choice between models of varying complexity and accurate representation of the game. The simplest model is a coin-toss model, in which each team is equally likely to win each point. In this model, the lengths of runs are easily seen to be geometric random variables with parameter 1/2; thus the chance any given rally (except for the first of the game) is the start of a run of length k is (1/2)k+1. The coin-toss model, however, does not do a good job of representing volleyball. First, one team is often significantly stronger than the other. A generalization to a biased coin-toss model can account for this by choosing one team and letting the probability that this team will win any point be p. This model, along with the special case p = 1/2, represents the points
of a volleyball game as a single sequence of independent Bernoulli trials. This sort of model, however, is still not sufficiently complex to model volleyball play well. Second, in competitions between skilled volleyball teams, the serving team is actually at a significant disadvantage because the receiving team has the first opportunity to attack. Rallies are generally won by a successful attack and only rarely as a result of a serve. In women’s collegiate matches, the proportion of times an average team is able to sideout when receiving serve (the side-out percentage) is typically about 60%; thus, the serving team “holds serve” only about 40% of the time. Therefore, the chance a team will win a point depends greatly on whether it is serving or receiving. A probability model for volleyball that accommodates both the difference in team abilities and the disadvantage of serving can be constructed using only two parameters. Calling the two teams in a match Teams A and B, let pA be the probability that Team A wins a point when it serves to begin the rally and let pB be the corresponding probability for Team B. If Team A is somewhat superior to Team B, for example, typical values for these parameters might be something like pA = .45 and pB = .35. This model is a two-state Markov chain with transition matrix. Team A serves next
Team B serves next
Team A serves
pA
1–pA
Team A serves
1–pB
pB
We refer to this model as a switching Bernoulli trials (SBT) model, to contrast with the previous simpler models that treat the course of play as being comprised of a single sequence of Bernoulli trials. The SBT model incorporates team and overall serving effects, but is not flexible enough to account for rotational/ player effects. On any volleyball team, some players are more skilled than others at attacking, and some are more capable at back row defense, etc. Players also differ in serving ability. Thus, a team may be stronger in certain rotations than others—when their best attacker is in
Table 1—Match Summary Data, CSUN vs. UCLA, 09/14/07 Points Won CSUN Serves
UCLA
Total
CSUN
9
34
43
UCLA
33
26
59
42
60
Total
the front row and their best server is serving, for example. Normally, each time a team is in a certain rotation within a given game, the team has the same players in the game, or the situation would be even more complicated. Even so, a model that incorporates rotational/player effects requires a combination of 12 Bernoulli trials submodels, as there are six rotations for each team and either team can be serving. This 12-parameter model is unwieldy, and parameter estimation is poor due to the limited sample size for each rotation. Also, the rotation configuration (how each team lines up against the other) often differs from one game to the next, and different players are frequently substituted into the line-up when a new game begins. Fortunately, our data suggest the rotational/player effect is small. Coaches, in fact, often try to arrange their line-up according to what mathematicians call a “maximin criterion” so their weakest rotation is as strong as possible. This tends to equalize the strengths of their rotations. One aspect of the game that can potentially make one rotation particularly strong is if a team has a player with a powerful and reliable jump serve. This serve is similar to a spike and can be disruptive to the receiving team’s efforts to achieve a side-out. However, women volleyball players rarely have the strength to have a dominating jump serve, and few players used jump serves in the matches analyzed. Somewhat like Goldilocks, then, we disdain those models that are either “too cold” (simplistic and unrepresentative) or “too hot” (complex and unstable) and choose the SBT model as “just right” for the task at hand—simple enough to estimate parameters well and interpret effects easily, yet representative enough to provide an accurate baseline
NB = 34. (These totals include many runs of just one point.) The values of the transition probabilities pA and pB can be estimated by the sample proportions = 9/43 = .209, = 26/59 = .441. With this information, we can estimate the number of runs of each length that each team would
for how volleyball play should progress if psychological momentum is not present. Note that in the SBT model, point runs will tend to be shorter than in a single Bernoulli trials model because most rallies in a typical match end in a side-out.
Analyzing Volleyball Runs Working with the SBT model, suppose Team B has just won a point, and let LA be the length of the subsequent run of points, if any, for Team A. Such a run must begin with a side-out by Team A, after which the number of points in the run is a geometric random variable with parameter pA. We therefore have P(LA l) = (1–pB) pAl–1 for l = 1, 2, 3, …. A parallel expression gives the probability distribution of a run of points for Team B. Note that we are ignoring two boundary effects here, namely that the first run of a game need not start with a side-out and that the last run is truncated by the end of the game. Now suppose we attempt to apply the distribution theory our model generates to an analysis of actual games. The critical information from the match is summarized in a two-by-two table such as Table 1, compiled from a bestof-three-game match between CSUN (Team A) and UCLA (Team B). (UCLA won both games, making the third game unnecessary.) What information does this table contain? We can see UCLA earned 34 side-outs when CSUN was serving, and CSUN achieved 33 side-outs when UCLA served. The first serve alternates from game to game, so each team served first once. Neither team won the first point when serving to start the game; therefore, CSUN had NA = 33 runs of points during the match and UCLA had
CHANCE
31
be expected to achieve in a match with parameters NA, NB, pA = and pB = . Letting and be the numbers of runs of at least l points for teams A and B, respectively, we have approximately and for l = 1, 2, 3…. Slight adjustments are needed to account for the boundary effects. Compiling the expected numbers of runs for each team in each match in the database by the above formulas (with adjustments) gives the results shown in Table 2 and Figure 1. The agreement between the actual numbers of long runs and the values the SBT model predicts is excellent. If momentum had been a significant factor, we would have expected the actual numbers of long runs to generally exceed the predicted numbers, but this is not what we see in Table 2. In fact, the total number of runs of at least four points was fewer than the predicted total. The results shown in Table 2 and Figure 1 are compatible with the hypothesis that the points scored in volleyball games depend on which team is serving, but are otherwise independent. There is another way to test whether the SBT model will produce run patterns that are consistent with those observed in actual volleyball. Suppose we wish to assess the run patterns observed in a particular game played to 30 points that has the game summary shown in Table 3. Without loss of generality, assume Team A was the winner, so N A = max(30, NB – 2). In the SBT model, the a priori probability of any particular sequence of play that results in Table 3 and represents a possible game (one in which Team A does not reach a winning position before all the totals in the game table have been achieved) is . As this value does not depend on the particular ordering of rally outcomes, each allowable configuration that leads to the desired table is equally likely. In principle, then, one could analyze the run patterns of a given game combinatorially by considering all such configurations. For example, if the longest run in a game with a given game summary table consists of six points, one could count the proportion of configurations having the specified game
32
VOL. 22, NO. 4, 2009
Table 2—Actual and Expected Numbers of Runs of Each Length 4, According to the SBT Model Length
Actual
Expected
4
67
69.7
5
35
30.1
6
11
13.2
7
7
5.8
8
1
2.6
9 Total
1
2.2
122
123.6
80 70 60 50 Expected
40
Observed
30 20 10 0 4
5
6
7
8
9
Number of Runs of Lengths Shown Figure 1. Actual and expected numbers of runs of each length 4, according to the SBT model
Table 3—General Game Summary Table Points Won
Serves
Total
Team A
Team B
Team A
nA
NB–nB
Team B
NA–nA
nB
NA
NB
Table 4—Actual and Simulated Expected Numbers of Runs of Each Length 4, According to the SBT Model Length
Actual
4
67
67.1
5
35
28.9
6
11
12.8
7
7
5.2
8
1
1.9
9
1
1.9
122
117.9
Total
Simulated
these values to the actual numbers of runs of each length 4. Once again there is outstanding agreement between the actual numbers of runs of various lengths and the values the SBT model predicts. Although the total number of actual runs of at least four points was higher than the average for the simulation, the difference is slight. There is only one case, runs of length five, where the actual number of runs is noticeably greater than the simulated value. To make sure this case does not give a meaningful indication of a momentum effect, we can make a formal check: Since five-point runs occur rarely, we can approximate the probability distribution of their number as a Poisson random variable with parameter λ equal to the simulated expected number, 28.9.
80 70 60 50 Observed
40
Expected
30 20 10 0 4
5
6
7
8
9
Figure 2. Actual and simulated expected numbers of runs of each length 4, according to the SBT model
table in which the longest run was six or longer. This would serve as a measure (essentially a p-value) of how unusual such a run is. However, the combinatorial approach is too complex for easy use. An alternative is to use simulation. One simulation option is to use the observed values = nA/(nA+NB–nB) and = nB/(nB+NA–nA) as the probabilities teams A and B will win a given point when serving and reject any outcome that does not match the game summary table. Any values in (0,1) can be used for these probabilities; however, the simulation runs faster if values are used that are at least close to and .
Simulating a game many times produces results for a random sample of all possible configurations having the given game summary table. Comparing the run patterns for the sample to those of the actual game provides another approach to assessing the assumption that rally outcomes are independent once serving is accounted for. To obtain the results shown in Table 4, 100 simulations were enacted for each of the 55 games in the database. The number of runs of each specified length was then divided by 100 to produce an expected number of runs of each length for all games based on the SBT model. Table 4 and Figure 2 compare CHANCE
33
The observed value X = 35 then gives a p-value of P(X 35) = 15%. Thus, the excess number of runs of length five observed is not significant and most likely an artifact of chance variation. We again have solid evidence that the course of play in volleyball is welldescribed by the rather simple SBT model (at least as far as run frequencies and run lengths are concerned) and— much more important—behaves as if the points scored depend on which team is serving, but are otherwise independent. There is yet a third way we can evaluate whether volleyball runs show evidence of momentum. If rally outcomes behave according to the SBT model, the chance that one team will win the next point at any time during a run of points by the other team is its (constant) probability of siding-out. Thus, we can look at each such instance during all point runs and compare the proportion of times the run ended on the next rally with the receiving team’s sideout percentage. In the CSUN vs. Santa Barbara match (10/02/07), for example, each team had several runs of at least three points (see Table 5). Consider the CSUN run of seven points: Santa Barbara had five opportunities to terminate the run— after the third, fourth, fifth, sixth, and seventh points. They were successful once, after the seventh point. For the match as a whole, Santa Barbara had 4×1 + 4×2 + 0×3 + 1×4 + 1×4 = 21 opportunities to stop a CSUN run and was successful 4 + 4 + 0 + 1 + 1 = 10 times, for an overall proportion of 10/21 = 48%. This value is somewhat lower than Santa Barbara’s 58% side-out percentage for the match. The reverse computation, for CSUN stopping Santa Barbara runs, is slightly different. Using the same method as before, we would conclude that CSUN stopped a Santa Barbara run 11 out of 21 times. However, two of the Santa Barbara runs were not stopped by CSUN points because each ended with Santa Barbara winning the game; therefore, two opportunities and two successful stops must be subtracted. Thus, CSUN actually stopped a Santa Barbara run 9/19 = 47% of the time, compared with CSUN’s 59% side-out percentage for the match. If our conclusion were to be based on this match alone, we would report
34
VOL. 22, NO. 4, 2009
Table 5—Numbers of Runs of Lengths 3, CSUN vs. Santa Barbara, 10/20/07 Run Length:
3
4
5
6
7
CSUN
4
4
0
1
1
Santa Barbara
6
1
3
1
0
Table 6—Proportion of Rallies in Which a Run Was Stopped vs. Match Side-Out Percentage Side-Out %
# Opportunities
# Stops
Stop Proportion
Overall Side-Out %
59%
128
84
65.6%
65.1%
100.0% 90.0% 80.0% 70.0% 60.0% Stop Percentage Avg. Side-Out %
50.0% 40.0% 30.0% 20.0% 10.0% 0.0% .59
Figure 3. Proportion of rallies in which a run was stopped vs. match side-out percentage
that there is some evidence in favor of momentum, given that each team stopped runs less often than their overall side-out percentage. However, using the data from all 32 cases (16 matches x two teams each) in the database tells a different story. To reduce the effects of heterogeneity among matches, Table 6 groups the results into four categories so matches in the same category have similar side-out percentages. Category boundaries were determined so the total numbers of opportunities to stop a run after the third point were similar for each category. Within each range of match side-out percentages, the stop proportion shown in Table 6 is the proportion of all rallies after the third point of a run in which the run was stopped. Note that these values are close to the overall side-out percentages for each category. Figure 3 shows a comparison. A chi-square test of conditional independence shows no significant difference between the stop proportions and side-out percentages (p = .81). As with the previous two analyses, this approach also shows the run data to be consistent with the SBT model. All three run
analyses support the contention that runs such as those observed in volleyball behave as the natural consequence of play involving rallies whose outcomes are affected only by the abilities of the two teams and which team is serving. No evidence for momentum in collegiate women’s volleyball has been found.
Conclusion Players, coaches, and fans almost invariably focus on the perceived significance of runs in sports, whether they involve consecutive field goals in basketball, hitting streaks in baseball, runs of points in volleyball, or a team winning or losing streak in any sport. It is hard to let go of the impression that when teams or players are successful several times in succession, they are likely to continue to play at a higher-than-normal level, at least in the short run. Yet statistical analyses often fail to support the notion that these patterns are anything more than those that occur naturally in sequences of chance events. Contrary to the strongly held intuition of most observers of athletic contests, outcomes of such events are often compatible with
models that do not presume the presence of momentum.
Further Reading Albright, S.C. 1993. A statistical analysis of hitting streaks in baseball. Journal of the American Statistical Association 88(424):1175–1183. Ayton, P., and G. Fischer. 2004. The hot hand fallacy and the gambler’s fallacy: two faces of subjective randomness? Memory & Cognition 32(8):1369–1378. Gilovich, T. 1993. How we know what isn’t so: The fallibility of human reason in everyday life. New York: Free Press. Gilovich, T., R. Vallone, and A. Tversky. 1985. The hot hand in basketball: On the misperception of random sequences. Cognitive Psychology 17:295–314. Stern, Hal S. 1997. Judging who’s hot and who’s not. CHANCE 10:40–43. Vergin, R. 2000. Winning streaks in sports and the misperception of momentum. Journal of Sport Behavior 23:181–197.
?
Thinking of Your
Future
Let the ASA help you realize your professional goals.
JobWeb—The ASA JobWeb is a targeted job database and résumé-posting service www.amstat.org/jobweb JSM Career Placement Service—A full-service recruiting facility held annually at JSM, with hundreds of statistical employers seeking qualified applicants www.amstat.org/meetings/jsm/2010 CHANCE
35
Calculating the Odds of the Miracle at Brookline W. J. Hurley
T
he Ryder Cup is a three-day golf competition played every other year between a team from the United States and another representing Europe. Over the first two days, 16 four-ball and foursomes matches are conducted. On the final day of competition, 12 singles matches are played. In total, there are 28 matches. One point is given to the team winning a match, and, in the case of a tie (draw), half of a point is given to each. Over the history of the Ryder Cup, there have been few comebacks. The most notable occurred in 1999 at Brookline. The U.S. team, captained by Ben Crenshaw, was down 10 points to six to the Europeans after the first two days of competition. In what some have called the “Miracle at Brookline,” the U.S. team won 8.5 of a possible 12 points over the Sunday singles matches to claim the cup. Brookline had two important characteristics. First, the United States had to dig a hole. Second, it had to climb out. In keeping with the fascination many statisticians bring to sports records, let’s ask and try to answer a couple of questions: What is the chance of a U.S. recovery given the significant hole the U.S. team was in that fateful Sunday morning? What is the expected time between Brookline-type miracles?
The Probability of Overcoming a Deficit
In this September 26, 1999, file photo, U.S. Ryder Cup player Justin Leonard celebrates after sinking his putt on the 17th hole of his final round match in Brookline, Massachusetts. Leonard won the hole, clinching the Ryder Cup for the United States. Nine years after the last U.S. victory, Leonard gets another shot when the Ryder Cup is played from September 19–21, 2008, at Valhalla Golf Club in Louisville, Kentucky. (AP Photo/Doug Mills)
36
VOL. 22, NO. 4, 2009
Let us begin by looking at the outcomes of singles matches. In a Ryder Cup singles match, 18 holes are played at most. An individual hole is either won or tied (halved). Suppose one point is given to the player winning a hole and zero points are awarded for a tied hole. A player wins a match when he has accumulated enough points so it is no longer possible, over the remaining holes, for the other player to win or tie the match. For example, in the 2006 World Golf Match Play Championship, Tiger Woods was up on Stephen Ames by nine points after 10 holes. Since there were only eight holes remaining, the match was over. This win was recorded as “9 and 8,” meaning that Woods was up by nine holes with eight remaining. Let us suppose the outcomes of all singles matches are independent and identically distributed random variables. To get this distribution, we looked at the hole-by-hole results of the four Ryder Cups played between 2002 and 2008 (Table 1), the only years for which hole-by-hole data appears to be available. Letting h be the probability that a hole is won outright, we will assume a particular player wins a hole with probability h/2, loses it with the same probability, and draws (ties) it with probability 1–h. Using the data in Table 1, we estimate h to be 0.475. With these assumptions, we can work out the probabilities that a particular match is won or drawn. If a match is drawn, all 18 holes will be played. Moreover, if t of the 18 holes are halved,
Table 1—Hole-by-Hole Results for Ryder Cup, 2002–2008 Year
Holes Won
Total Holes
Fraction Won
Won
2002
92
108
200
0.460
2004
94
106
200
0.470
2006
105
91
196
0.536
2008
85
111
196
0.434
Totals
376
416
792
0.475
Note: Aggregating over all of these matches, a hole is won with frequency 376/792 = 0.475.
Table 2—Probability D That a Match Is Halved for Three Values of h, the Probability a Hole Is Won Outright D
h = .450
h = .475
h = .500
0.139
0.136
0.132
then t must be even (otherwise, the match could not be halved) and each golfer wins (18–t)/2 of the remaining holes. Let dt be the probability that a match is drawn, given that exactly t holes are halved. Then dt is a trinomial probability:
(1) for t = 0, 2, 4, …, 18. Hence, the probability that the match is halved is: .
(2)
Table 2 presents values of D for three values of h. We now have the probabilities of the outcomes for an individual Ryder Cup match. It is halved with probability D and won by a particular player with probability (1 – D)/2. To work out the probability of overcoming a particular deficit, suppose a team must generate at least z0 points over the 12 Sunday singles matches to win. Let r(W, T) be the probability that a side wins exactly W matches and halves T matches. Again, r(W, T) is a trinomial probability: .
Table 3—Probability r(W, T) That a Side Wins Exactly W Matches and Halves T Matches to Earn 8.5 or More Points #Wins, W
#Ties, T
Points
r(W, T)
5
7
8.5
0.000010
6
5
8.5
0.000716
6
6
9.0
0.000037
7
3
8.5
0.010390
7
4
9.0
0.001630
7
5
9.5
0.000102
8
1
8.5
0.026392
8
2
9.0
0.012420
8
3
9.5
0.002598
8
4
10.0
0.000204
9
0
9.0
0.009347
9
1
9.5
0.008797
9
2
10.0
0.002670
9
3
10.5
0.000289
10
0
10.0
0.002804
10
1
10.5
0.001759
10
2
11.0
0.000276
11
0
11.0
0.000510
11
1
11.5
0.000160
12
0
12.0
0.000042
Total
(3)
Now it is a question of determining which combinations of won and halved holes, (W,T), will give a total of at least z0 points. Consider the case z0 = 8.5. Table 3 presents all such combinations with the associated values of r(W, T). To get the probability of overcoming the deficit and winning the Ryder Cup, simply sum the final column. Note in this case that there was about an 8% chance at the start of the Sunday singles matches that Crenshaw and his team would overcome their deficit. More generally, in the case where at least z0 points are required, let Z0 be the set of all possible ordered pairs, (W, T), that give a side at least z0 points. Then, the probability of overcoming a deficit of z0 is . (4) Table 4 gives values of R(z0) for various values of z0.
How Often Would We Expect to See a Brookline-Type Recovery? Let us now assess the joint probability of digging a hole and then climbing out. To do so, we use essentially the same kind of assumptions as above, over the first two days of competition, to compute how likely it would be for one of the teams to be ahead by 10 to 6. Over the 10 Ryder Cups played between 1987 and 2006, 160 four-ball and foursomes matches were played. Of these, 24 were drawn—a frequency of 15%. Note that this is close to the probability of drawing a singles match. We need to calculate H(z0), the probability a team will find itself in a position in which it has to win z0 points in singles play to win the competition. Again, we assume players on either side of a match are equal and the probability of drawing a match is 15%. Table 5 gives values of H(z0) for various values of z0.
0.081244 CHANCE
37
Table 4—Values of R(z0) for Various Values of z0
Table 5—Values of H(z0) for Various Values of z0
Points Required, z0
Pr(Win), R(z0)
Points Required, z0
Pr(z0 Points Required), H(z0)
8.5
0.08108
8.5
0.12921
9.0
0.04374
9.0
0.08791
9.5
0.02030
9.5
0.05952
10.0
0.00880
10.0
0.03656
10.5
0.00304
10.5
0.02103
11.0
0.00099
11.0
0.01079
11.5
0.00020
11.5
0.00515
12.0
0.00004
12.0
0.00212
Now we can calculate the probability of observing something such as Brookline. It would be: .
(5)
In the case where z0 = 8.5, the chance of a Brookline-type recovery is 0.0159. Consequently, we would expect to see such a recovery, on average, every 2/0.0159 = 125.7 years. (6) The 2 on the left-hand side takes into account the Ryder Cup being played once every two years.
Summary We have made a number of simplifying assumptions that make the calculations easier. Of these, the one that stands out is the assumption of homogenous teams. This is clearly not the case, so it would be interesting to look at this problem if the heterogeneity was modeled explicitly. Intuition suggests the numbers would not change significantly, because the teams are not that heterogeneous. Moreover, the format—a series of 18-hole matches—tends to bring the better players back to the field in the sense that a lesser player has a better chance of beating Woods in an 18-hole match than in a 72-hole tournament. The Brookline miracle is not in the same ballpark as Joe DiMaggio’s 56-game hitting streak during the summer of 1941, which is considered by most to be the sports record least likely to be broken. Related to DiMaggio’s feat, Hal Stern remarked in The Iowa Stater, “The DiMaggio streak is sufficiently unusual that it shouldn’t have happened yet in baseball.” The analysis above suggests this is also true for the Brookline miracle.
Further Reading
U.S. Ryder Cup team captain Ben Crenshaw poses with the cup September 21, 1999, at The Country Club in Brookline, Massachusetts. (AP Photo/Elise Amendola)
38
VOL. 22, NO. 4, 2009
Chance, Don. 2009. What are the odds? Another look at DiMaggio’s streak. CHANCE 22(2):33–42. Hurley, W.J. 2002. How should team captains order golfers on the final day of the Ryder Cup matches? Interfaces 32(2):74–77. Hurley, W.J. 2007. The Ryder Cup: Are balanced four-ball pairings optimal? Journal of Quantitative Analysis in Sports 3(4), Article 6. www.bepress.com/jqas/vol3/iss4/6. Hurley, W.J. 2007. The 2001 Ryder Cup: Was Strange’s strategy to Put Tiger Woods in the anchor match a good one? Decision Analysis 4(1):41–45.
Ethics and Stopping Rules in a Phase II Clinical Trial Andreas Nguyen and Shenghua K. Fan
C
arefully designed experiments, known as (controlled) clinical trials, have been used to evaluate (competing) treatments since 1950. The first clinical trial was conducted just after World War II and involved an evaluation of streptomycin in the treatment of pulmonary tuberculosis. Statisticians play a key role in the design and analysis of modern clinical trials. Design choices—including randomization, blocking, sample size, and power—have fundamentally statistical components. Analyzing and reporting clinical trials also involve using statistical methods and insight. The Southwest Oncology Group conducted a study of mitoxantrone or floxuridline in patients with minimal residual ovarian cancer after the secondlook laparotomy. Clinicians wanted to know whether the treatment is efficacious for the condition. It could be that the treatment is too toxic to patients and should not be used. In this study, if too many patients in the phase II trial could not tolerate at least two courses of treatment, the trial would be stopped. The study plan was to assign 37 patients to each treatment and stop accrual if 13 or more of the first 20 patients assigned to the treatment experienced toxicity. The sample size and stopping rule were determined by statisticians, but how did they determine them?
Ethical Challenges in Clinical Trials When evaluating competing treatments, the validity of the conclusions may depend on how patients are allocated. Patient allocation must be done at random, which has several advantages. First, it prevents personal biases from playing a role. Second, randomization balances the treatment groups with respect to the effects of extraneous factors that might influence the outcome of the treatment. Statistical ideas enter here. If influences
from extraneous factors are allocated in a random way, there will be no systematic advantage of one treatment group over another. The simplest way to make a random assignment is by flipping a coin. If the
result is heads, allocate the patient to the treatment group (new drug); if it is tails, allocate the patient to the control group (standard treatment or placebo). In practice, however, more sophisticated randomization procedures are required. CHANCE
39
A major challenge in randomized clinical trials involves ethical issues. The main ethical dilemma is the conflict between trying to ensure each patient receives the most beneficial treatment (individual ethics) and evaluating competing treatments as efficiently as possible so all future patients can benefit from the superior treatment (collective ethics). Every clinical trial requires a balance between the two. Although collective ethics is the prime motivation of conducting a clinical trial, individual ethics must be given as much weight as possible. Stuart J. Pocock, a pioneering author in the area of clinical trials, made the point that it is unethical to conduct a trial of such poor quality that it cannot lead to any meaningful conclusion. Such a trial would involve futile risk of harm to individual patients without any prospect of collective benefit.
Sequential Testing in Phase II Trials In phase II clinical trials, an ethical dilemma is involved in making a decision about whether to continue a trial based on the observed drug toxicities in the enrolled patients. Drug toxicity usually refers to serious side effects or complications that may have lasting or debilitating effects. Phase I trials aim to deliver the highest, and therefore most efficacious, dose that will cause no more than a certain proportion of patients to experience toxicity events, called the maximum tolerable dose (MTD). The MTD recommended in phase I trials is subsequently applied in phase II trials. However, due to the small sample sizes of phase I trials, the toxicity level of the MTD may actually exceed the tolerable level. So, in phase II trials, investigators
Clinical Trials Phases The stages that represent increasing knowledge in medical investigations are often called “phases.” Drug trials are typically classified into the following four phases: Phase I: Clinical pharmacology and toxicity. Phase I trials are the first experiments on human volunteers, usually following some experiments on animals. The goal of phase I trials is to study drug safety, rather than efficacy. If preliminary evidence is that an experimental drug is safe, then the primary objective of a phase I trial becomes to determine an acceptable dosage of the drug. The typical size of a phase I trial is around 20 to 80 patients. Phase II: Initial clinical investigation for treatment effect. After success in phase I trials, phase II trials are a preliminary investigation of the drug efficacy, with continued monitoring of safety. Phase II trials are often used as a rapid way to screen out from a large group of drugs those that may have a promising level of effectiveness. A typical phase II trial will not go beyond 200 patients. Phase III: Full-scale evaluation of treatment. A phase III trial is a large-scale verification of the early findings—the step from ‘some evidence’ to ‘proof’ of effectiveness. The objective of phase III trials is to compare the efficacy and safety of a particular new drug with those of the current standard treatments under the same set of conditions. The size of a phase III trial often exceeds a couple thousand patients. Phase IV: Post-marketing surveillance. After the investigations leading to a drug being approved for marketing, the drug is monitored to detect any undiscovered toxicities and any changes in the population of patients that would affect the use of the drug. These trials, also known as post-marketing surveillance trials, are large-scale and long-term studies.
40
VOL. 22, NO. 4, 2009
must continue to guard against delivering an overly toxic dose to patients.
Early Termination As soon as there is sufficient information to reach a conclusion in a phase II trial, it should be terminated. For example, if nine out of 10 patients in the treatment group—but only three out of 10 in the placebo group—are fully recovered from colds in three days, investigators might draw the conclusion that the treatment is efficacious. Similarly, if the treatment in a phase II trial caused a large proportion of the patients to experience toxicity events, the trial should be stopped. The decision to terminate early should be made objectively. The only way to do that is for investigators to look at the data to see if a pre-selected stopping criterion has been satisfied. Stopping rules usually involve type I and type II errors. The usual null hypothesis is that the toxicity is at or below an acceptable level. Rejecting the null hypothesis and stopping the trial early when the underlying toxicity rate is acceptable is called type I error. The practical implication of type I error is that a pharmaceutical company may fail to bring a good drug to market— or at least delay its introduction. Failing to stop a trial on an unacceptably toxic treatment is called type II error. In this case, the implication is that the investigators are unwittingly subjecting trial patients to high levels of risk and adverse reactions. In a phase II trial, interim monitoring is necessary to prevent patients from receiving an overly toxic treatment. For instance, investigators might estimate the toxicity level of a treatment after enrolling just half the patients and confirming toxicity is at an acceptable level. This is a group-monitoring scheme, in which investigators look at the results of a trial on one group of patients in a larger trial before continuing with the remaining group.
Continuous Toxicity Monitoring Another interim monitoring approach is continuous monitoring. That is, the trial is evaluated after each patient enrollment. For example, one new patient is enrolled in the trial, started on the assigned course of treatment, and evaluated after a short period of time determined in advance by a supervising physician. No other
Table 1—Type I versus Type II Error Actual Toxicity
Action
H0 True Acceptable 0.2
H1 True Too Toxic > 0.2
Stop Trial
Type I Error
OK
OK
Type II Error
0.20 0.10
Tail Probability = 0.010
0.00
Binomial Probability
0.30
Continue Trial
0
1
2
3
4
5
6
7
8
9
10
One way to construct a stopping rule is to define an upper limit, uk, that triggers stopping the trial if the number of toxicities, Xk, after each kth patient enrollment reaches or exceeds uk. In other words, they would stop the trial as soon as Xk uk, where Xk is the number of toxicities. The random variable Xk is binomial, with parameters k patients and toxicity proportion . Under the null hypothesis, the investigators assume a particular value of , say = 0.2. So, if the investigators only monitored the trial once, say after patient k = 8, they could construct a stopping boundary using the binomial distribution. A suitable boundary uk satisfies P{X8 u8 | = 0.2} = 5%. In this example, because P{X8 5 | = 0.2} = 0.010 , the investigators can use the boundary u8 = 5, as shown in Figure 1. (If they chose u8 = 4, the probability would be 0.056 > .) Recall, however, that they are not monitoring the trial only once. By monitoring continuously—after each patient enrollment—they are doing a sequential test. Essentially, they are taking multiple looks at the data, which inflates the overall type I error rate. If there is a 5% chance of making a type I error after each patient, the type I error rate after multiple patients is much
Number of Toxicities Figure 1. Binomial distribution rejection region for a single look (k = 8, K = 30, = 0.2)
patients are enrolled during this time. If the overall rate of toxicity including the new patient is deemed acceptable, the trial continues with the enrollment of the next patient, and this process is iterated until the last patient is enrolled. If at any point the cumulative number of toxicities is deemed too high— as determined by comparing the number of toxicities to a stopping rule—the trial is stopped. With continuous toxicity monitoring, data are collected and weighed after each monitoring step to decide if the null hypothesis should be rejected. The risk of unnecessarily stopping the trial too soon or letting the trial run too long comes with each decision.
Designing Stopping Rules Suppose investigators are planning a phase II trial on a new cancer drug and limiting the trial to K 30 patients. Phase II trials are typically larger than 30 patients, but a small sample size is used here for simplicity. Based on a phase I trial, investigators expect the toxicity proportion to be = 0.2, the maximum they are willing to allow. They also wish to limit the type I error rate to = 5%. How can they construct their stopping rules? Table 1 summarizes the two types of error in this case.
CHANCE
41
42
VOL. 22, NO. 4, 2009
Table 2—Characteristics of the Adjusted Pocock and O’Brien-Fleming Boundaries Assumed Toxicity Boundary
= 0.2
= 0.4
Adjusted Pocock
E(N) = 29.7 P{Reject}= 0.050
E(N) = 18.9 P{Reject}= 0.695
Adjusted O'BrienFleming
E(N) = 29.1 P{Reject}= 0.046
E(N) = 22.0 P{Reject}= 0.767
10
Note: E(N) denotes the average number of patients enrolled before stopping the trial. P{Reject} is the probability of rejecting the null hypothesis of acceptable toxicity.
5
Reject H[0]
0
Accept H[0]
-10
-5
Zk
larger than 5%. A similar situation is encountered in the analysis of variance when multiple comparisons are made. Special stopping boundaries have been designed to address this problem in a related sequential testing problem based on normal (rather than binomial) data. Two such types of boundaries are those proposed by Pocock and P. C. O’Brien and T. R. Fleming. These boundaries, used not only in biostatistics but also in more general applications, are devised to limit the overall type I error rate to the level of one’s choosing. Pocock and O’Brien-Fleming boundaries for a standard normal variable are shown in Figure 2. As long as the value of Zk (a mean and variance standardized version of Xk) stays below the respective rejection boundary, the test will continue in both cases. As soon as the value of Zk reaches or exceeds the rejection boundary, the test will be stopped and the null hypothesis rejected. If the test reaches maximum sample size K, the null hypothesis will not be rejected. In either case, the position of the boundary at each k depends on the values of and K. Implicit in using these stopping boundaries is the choice to continue testing as long as there is insufficient evidence to stop the test. This approach is reasonable because investigators already expect an acceptable toxicity level based on phase I clinical trial results. The use of continuous toxicity monitoring in phase II trials is more a matter of precaution than an uninformed determination of toxicity level. A key difference between the Pocock and O’Brien-Fleming boundaries is in how soon they would stop a trial, on average. The O’Brien-Fleming boundary, shown as a solid line in Figure 2, starts off wide and narrows toward the end. Such a boundary is more likely to let a trial run longer. In contrast, the Pocock boundary, shown as a dashed line in Figure 2, is flat and more likely to stop a trial earlier. There are other choices of sequential test boundaries, notably Wald’s sequential probability ratio test (SPRT). This test is sometimes applied in clinical trials because it can handle binomial data, but it is an approximate test. The SPRT also has no upper bound on the number of patients needed for a study. In phase II clinical trials, investigators want
0
10
20
k Figure 2. Pocock (dashed line) and O'Brien-Fleming (solid curve) boundaries for K = 20 are shown ( = 5%). The vertical axis is a standardized version of Xk. The trial represented by x’s leads to rejecting H0 (acceptable toxicity) using either boundary. The trial shown by +’s leads to not rejecting H0 because it crosses neither boundary by the time patient number K = 20 is observed.
to set a limit on the number of patients to be enrolled in advance. In “Continuous Toxicity Monitoring in Phase II Trials in Oncology,” published in Biometrics, Anastasia Ivanova and colleagues describe an interesting approach to creating an exact boundary for a sequential test with binomial data and a finite sample size. They take the exact Pocock and O’Brien-Fleming test
for normally distributed data and adapt it with a continuity correction for binomially distributed data. We followed the methodology they presented in their published article and wrote R code to reproduce their results. The method is iterative. In the initial application, the type I error rate may not be exactly equal to = 5%, but by trial-and-error, one can adjust
12 10 8 6 2
4
Stopping Boundary
0
5
10
15
20
25
30
Number of Patients Enrolled
0.01
0.02
0.03
0.04
H0: = 0.2
0.00
Cumulative Stopping Probability
0.05
Figure 3. Adjusted Pocock (dashed line) and O’Brien-Fleming boundaries (solid line) with K = 30, H0: = 0.2, 0.05. The vertical axis is the number of patients experiencing toxicity.
0
5
10
15
20
25
30
Number of Patients Enrolled Figure 4. Cumulative stopping probabilities under the null hypothesis ( = 0.2) for both adjusted Pocock (dashed line) and O’Brien-Fleming (solid line) boundaries
the results to give the closest type I error rate less than or equal to = 5%. Our R code (and related explanatory notes) for finding the exact boundaries derived from both Pocock and O’BrienFleming tests for normal data is available at www.sci.csueastbay.edu/~sfan/SubPages/ SeqClinTrialsRCode.doc. Applying this method to our example of a phase II clinical trial with K = 30 patients, the resulting adjusted Pocock and O’Brien-Fleming boundaries for our binomial data are shown in Figure 3. The vertical axis is the number of patients experiencing toxicity. The adjusted O’Brien-Fleming boundary is wider than the adjusted Pocock boundary in the beginning and tapers off at the end, whereas the adjusted Pocock boundary is more or less linear. As shown in Table 2, the adjusted Pocock boundary is more likely to stop the trial earlier than the adjusted O’Brien-Fleming boundary if toxicity is too high. However with a 30-patient limit, the difference in expected sample size is only about two patients. These results for K = 30 patients match the more general results described in the article by Ivanova and colleagues. Figures 4 and 5 show the cumulative stopping probabilities under the null hypothesis, H0: = 0.2, and an alternative hypothesis, H1: = 0.4. The vertical axes in the two figures are on different scales. The cumulative stopping probability curve for the adjusted O’Brien-Fleming boundary stays low for the first several patients and then rises more sharply than that of the adjusted Pocock boundary. This pattern shows that the adjusted O’Brien-Fleming boundary is less likely to reject the null hypothesis and stop the test early than is the adjusted Pocock boundary.
Become a Member of the ASA Visit the ASA web site at: www.amstat.org Partner discounts • Member forums • Enews archive
CHANCE
43
0.2
0.4
0.6
H1: = 0.4
0.0
Cumulative Stopping Probability
In the case of the alternative hypothesis, H1: = 0.4, the cumulative probability curves intersect, so the overall probability of stopping the trial is greater for the adjusted O’BrienFleming boundary. Figure 6 displays the power curve for these boundaries. It shows the probability of correctly deciding to stop the trial as a function of the actual toxicity. For example, if the actual toxicity were 0.4, then the adjusted O’Brien-Fleming boundary would stop the trial with probability 0.77, whereas the Pocock boundary would stop the trial with probability 0.70. Overall, the adjusted O’Brien-Fleming boundary is somewhat more powerful than the adjusted Pocock boundary for most values of 1 > 0.
0
5
10
15
20
25
30
Number of Patients Enrolled Figure 5. Cumulative stopping probabilities under an alternative hypothesis ( = 0.4) for both adjusted Pocock (dashed line) and O’Brien-Fleming (solid line) boundaries
Further Reading Ivanova, A., B.F. Qaqish, and M.J. Schell. 2005. Continuous toxicity monitoring in phase II trials in oncology. Biometrics 61:540–545. Jennison, C., and B.W. Turnbull. 2000. Group sequential methods with applications
44
VOL. 22, NO. 4, 2009
0.0
0.2
0.4
0.6
0.8
H0: = 0.2
Power
There are interesting tradeoffs to be weighed in choosing stopping rules in phase II clinical trials. The stopping rules presented here are by no means the best. In fact, the quest for optimal stopping rules is an active area of statistical research. Returning to the question asked at the end of the first paragraph, the first group size is 20 and the second group size is 17. The simple stopping rule considered only type I error rate. The study size of 37 patients was determined by the available resources and the stopping rule. Assuming a toxicity of 0.4 is acceptable, stopping the trial once 13 or more out of 20 patients experienced toxicity was determined by controlling the type I error rate at 5%. If the trial is not stopped after the first group, investigators would continue testing with the second group. Many phase II trials are fully sequential (a patient per group), especially when the toxicity event is severe, and adopt stopping rules considering both type I and type II error rates. The trial reported by Ivanova and colleagues is such a trial and adopted the adjusted Pocock boundaries.
1.0
Conclusion
0.1
0.2
0.3
0.4
0.5
0.6
1 Figure 6. Power curve for both adjusted Pocock (dashed line) and O’Brien-Fleming boundaries as a function of toxicity 1
to clinical trials. Oxford: Chapman and Hall/CRC. Muggia, F.M., P.Y. Liu, D.S. Albert, D.L. Wallace, R.V. O’Toole, K.Y. Terada, E.W. Franklin, G.W. Herrer, D.A. Goldberg, and E.V. Hannigan. 1996. Intraperitoneal mitoxantrone or floxuridine: effects on time-tofailure and survival in patients with minimal residual ovarian cancer
after second-look laparotomy— a randomized phase II study by the Southwest Oncology Group. Gynecological Oncology 61:395–402. O’Brien, P.C., and T.R. Fleming. 1979. A multiple testing procedure for clinical trials. Biometrics 35:549–556. Pocock, S.J. 1983. Clinical trials: a practical approach. Hoboken: John Wiley & Sons.
Entangling Finance, Medicine, and Law Mark G. Haug and Heather Ardery
T
raditionally, there were two basic ways to rid yourself of unwanted or unneeded life insurance policies: die or let the policy lapse. For some policies—permanent and whole life—you could force the insurance company to pay out its current cash value. However, a third option emerged in the 1980s known as a viatical settlement. A viatical settlement is an agreement between a third party—the investor—and a terminally ill holder of life insurance—the viator. The investor purchases the policy from the viator, thus permitting the investor to name himself as beneficiary. Due to the viator’s terminal illness, the viator can reasonably be expected to die soon. The agreement provides the viator an immediate lump-sum payment, which is less than the total death benefit of the policy. Following a viatical settlement, the investor, who now owns the policy, continues with premium payments to maintain the policy. When the viator dies, the investor collects the full death benefit. The investor’s return depends on the viator’s death date. The longer a viator lives, the lower the return for the investor. This macabre tension raises a number of issues. To relieve some of this tension, the National Association of Insurance Commissioners (NAIC) recommended the minimum lumpsum payout table (see Table 1) to viators based on individual life expectancy. To illustrate the NAIC recommendations, consider Sara, who is 45 years old and has a life insurance policy worth $500,000. Her annual premium is $10,000. Her life expectancy is nine months. If Sara sold her policy to an investor, she would receive 70% of the face value ($500,000) under the NAIC recommendations, providing her with an immediate $350,000. If Sara dies in exactly 12 months, the investor would realize $140,000: the difference between the face value and what the investor paid out, less one annual payment the investor made to maintain the policy. For the sake of simplicity, this illustration ignores transaction costs, tax liabilities, and time value of the investment. Alternatively, Sara may negotiate with an investor for a mutually agreeable amount, as is the case for many viatical settlements. A number of issues concerning viatical settlements may pique statisticians’ interest. The actuarial angle and what it entails is the most obvious. The NAIC recommendations are certainly an area of interest: Do these values favor the viator or the investor? What values would provide greater equity if the NAIC values are not effectively neutral? Beyond the actuarial issues and proposed guidelines are medical issues. For instance, are doctors able to estimate life expectancies with their inherent biases? To what extent do these estimates influence an equitable exchange between the viator and investor?
A Catalyst for Viatical Settlements: HIV/AIDS Viatical settlements became more common in the late 1980s. Acquired immune deficiency syndrome (AIDS) was identified during June 1981, when a rare form of cancer struck a number
of homosexual men. During the 1980s, nearly 150,000 people each year were infected with the human immunodeficiency virus (HIV), the cause of AIDS. Faced with massive medical bills and expensive medicines, patients resorted to selling whatever assets they owned, including their life insurance policies. Investors became the beneficiaries of the policies in exchange for an immediate lump-sum payment. Lump-sum payments ranged from 50% to 85% of the policy’s face value, depending on the patient’s life expectancy.
Table 1—NAIC Recommended Payout to Viators Based on Life Expectancy Viator’s Life Expectancy
Minimum Percentage of Face Value Less Outstanding Loans
Less than 6.00 months
80%
6.00–11.99 months
70%
12.00–17.99 months
65%
18.00–24.99 months
60%
25.00 + months
Greater of the cash surrender value or accelerated death benefit in the policy
Source: Viatical Settlements Model Regulation, 4 NAIC Model Laws, Regulations and Guidelines, 698-1 (1994). CHANCE
45
Legal Issues: Speculating on the Viator’s Death Date
At this time, the mortality rate of AIDS was relatively high, and the life expectancy after diagnosis was relatively short. Still, a viatical settlement proved beneficial for both the investor and the viator. People infected with AIDS needed immediate cash to cover mounting hospital expenses and expensive medicines. Meanwhile, investors were earning high returns due to the quick mortality of viators. In 1996, Dr. Peter Piot, head of the United Nations AIDS program, announced new drugs that blocked an enzyme essential to the reproduction of HIV. Due to the drugs’ ability to significantly increase the life expectancy of AIDS patients, viatical settlement returns abruptly dropped for investors covering viators with AIDS. Some investors had already invested in life insurance policies based on an outdated forecast of life expectancies. As a result, the market for viatical settlements took a temporary downturn. Soon, however, investors turned their attention toward other people with terminal illnesses such as cardiovascular, cancer, and respiratory diseases. Today, there are many investment companies providing viatical settlements, resulting in a billion-dollar industry. Although each investment company determines eligibility for a viatical settlement transaction, the industry generally recognizes several risk-management guidelines: (1) the viator must have owned his or her policy for at least two years; (2) the viator must have his or her current beneficiary sign a release or waiver; (3) the viator must be terminally ill; and (4) the viator must sign a release allowing the investor access to his or her medical records.
46
VOL. 22, NO. 4, 2009
Investors are essentially speculating on the viator’s death date. In England, the Life Assurance Act of 1774 made the betting on lives of strangers illegal (Life Assurance Act, 1774 c.48 14_Geo_3). The United States, however, has ruled it legal to transfer a life insurance policy to another, enabling viatical settlements. In 1911, U.S. Supreme Court Justice Oliver Wendell Holmes ruled in Grigsby v. Russell (222 U.S. 149) that a life insurance policy is transferable property and contains specific legal rights. These rights include naming the policy beneficiary, assigning the policy as collateral for a loan, borrowing money against the policy, and selling the policy to another individual. A life insurance policy is like any other asset an individual has at his or her disposal. In 1996, the Securities and Exchange Commission (SEC) petitioned the U.S. Court of Appeals that life insurance investment company Life Partners Inc. was violating the Securities Act of 1933 and the Securities and Exchange Act of 1934 by purchasing and selling life insurance policies from individuals without first complying with the registration and other requirements of the acts (SEC v. Life Partners Inc., No. 95–5364, U.S. Court of Appeals for the District of Columbia Circuit, 1996). The SEC claimed that viatical settlement transactions are securities and therefore should be subject to SEC regulations. Relying on precedent set in 1946 from SEC v. W.J. Howey (328 U.S. 293), the court found that viatical settlements are not subject to SEC regulations and requirements. There are no federal regulations on the viatical settlement market other than tax law. Regulation is a state matter. Many states have adopted the Viatical Settlements Model Regulation Act, as recommended by the NAIC in 1993 (NAIC Model Laws, Regulations, and Guidelines, 698–1), that set guidelines for avoiding fraud and ensuring sound business practices in the viatical settlement market. The act requires the investor to obtain a license from the insurance commissioner of the state of the viator’s residence and disclose all relevant information to the viator. In the early years of viatical settlements, viators’ cash inflow was treated as ordinary income and taxed accordingly. In 1997, however, Congress passed legislation and President Bill Clinton signed the Health Insurance Portability and Accountability Act (HIPAA), which both eliminated federal income tax on a viator’s cash inflow from viatical settlements and accelerated death benefits. The terminally ill individual could then take the entire bulk payment from the investment company and use the money for whatever he or she chose. When the viator died and the investor received the entire death benefit of the policy, the investor’s lump-sum benefit less the costs would be subject to capital gains tax. If the investor is in the business of buying and selling life insurance policies, however, the difference would be subject to federal income tax. With minimal legal regulations governing viatical settlements, there are opportunities for unethical practices. For example, one practice known as “clean sheeting” predates viatical settlements: A person applies for life insurance while intentionally concealing his terminally ill status. The reverse practice, “dirty sheeting,” pertains to viatical
Table 2—The Palliative Prognostic Score (PaP) Criteria
Assessment
Partial Score
Anorexia (Medical symptom of reduced appetite. This is not to be confused with anorexia nervosa, the eating disorder.)
No Yes
0.0 1.0
Dyspnoea (Difficult or labored breathing; shortness of breath)
No Yes
0.0 1.5
30, 40, …, 100 10, 20
0.0 2.5
> 12 11–12 7–10 5–6 3–4 1–2
0.0 2.0 2.5 4.5 6.0 8.5
< 8.6 8.6–11.0 > 11.0
0.0 0.5 1.5
20.0–40.0% 12.0–19.9% < 12.0%
0.0 1.0 2.5
30-Day Survival > 70% 30–70% < 30%
Total Score 0.0–5.5 5.6–11.0 11.1–17.5
Karnofsky Performance Status (Scores effectively range from 10 to 100 in increments of 10.* Higher scores mean the patient is better equipped to carry out daily activities.)
Clinical Prediction (Doctor’s estimate: weeks)
Total White Blood Cell Count (x10 9/L) Lymphocyte Percentage Risk Group A B C
Source: Pirovano, M., et al. 1999. “A New Palliative Prognostic Score: A First Step for the Staging of Terminally Ill Cancer Patients.” Journal of Pain and Symptom Management, 17: 231–239. *There are only 11 possible scores: 0, 10, 20, …, 100. A value of 0 indicates death.
settlements: A healthy person falsifies medical records to indicate a terminal illness.
Medical Issues: Predicting Life Expectancy Estimating life expectancies is a challenge for doctors, but a necessity for viatical settlements. Doctors’ estimates tend to be overly optimistic. In “Extent and Determinants of Error in Doctors’ Prognoses in Terminally Ill Patients: Prospective Cohort Study,” Nicholas A. Christakis and Elizabeth B. Lamont found that only 20% of clinicians’ predictions were accurate, defined as (2/3 < (actual/estimated) < 4/3) in a cohort study of 468 patients from five outpatient hospice programs in Chicago during 1996. Sixty-three percent of estimates were overly optimistic (4/3). One suggestion is that physicians may avoid a negative assessment and give patients the benefit of the doubt. Experienced doctors and doctors who have not known the patients for a long time, however, may be better at assessing prognosis. To improve estimating a patient’s life expectancy, health professionals have developed various methods. One popular method involves identifying specific predictors of survival
and combining the predictors into one prognostic score. We illustrate one such score—the PaP score—a short-term forecast contemplating 30-day survival. The European Association for Palliative Care (EAPC) identified the PaP Score as the best validated and the most widely used method of predicting life expectancies in terminal cases. P. C. Stone and S. Lund, in “Predicting Prognosis in Patients with Advanced Cancer,” describe the PaP score as a final score that ranges from 0 to 17.5. Higher scores predict shorter life expectancies. The score is derived from the doctor’s estimated life expectancy and five other criteria, each found to be predictive of survival (see Table 2). These criteria are consistent with the terminal cancer syndrome theory: Nondisease symptoms such as anorexia and dyspnoea are related more closely to the progression of the patient’s illness than disease characteristics such as type of tumor or metastatization.
Statistical Issues: Answering the Questions The motivation to research viatical settlements began when Mark Haug was asked to analyze a portfolio of 134 viatical settlements with a collective face value of $31,050,000. Originally, the investors’ interests were to identify which factors CHANCE
47
Table 3—Frequency Data for Viators: Life Expectancy Estimates and Age
■
■■
60 < X 65
■■■■
■■
■■■
7 13
■■
■■■■■
■■■
■■■
■■
■■■
65 < X 70
■■
■■■■■
■■■■■■
■■■
70 < X 75
■
■■
■■■■■■
■■■■■
■■
■
■■
■■■■■■
■
■■■
■■■
■■
■■■■
■
■■
■■
■■■■
■
■■
■
22
41
22
75 < X 80
■■
80 < X 85
■
85 < X 90
■
90 < X Column Totals
10
10
Row Totals
55 < X 60
13
10 < X
■
■■■
5 < X 10
■
■■■■
0<X5
50 < X 55
■■■
0
■■
-5 < X < 0
■
-10 < X -5
-20 < X -15
X 50
-15 < X -10
-20 Viator’s Age (Years)
Mortality: Actual Death Date Minus the Larger of the Two Life Expectancy Estimates (Months)
14
■■
17
■
8
16 ■
■
20
■
11
■■
■
■■
12
■■■■
■
■
11
13
3
5
134
Note: Each ■ represents one person.
Table 4—Cross Tabulations of Gender and Diagnosis in the Viatical Portfolio: Counts and ‘Survivorship’ Gender Diagnosis
Male
Female
Total
Cancer
13 ( 0, 0%)
10 ( 0, 0%)
23 ( 0, 0%)
Respiratory
26 ( 1, 4%)
20 (11, 55%)
46 (12, 26%)
Cardiovascular
25 ( 0, 0%)
20 ( 5, 25%)
45 ( 5, 11%)
Other
12 ( 1, 8%)
8 ( 3, 38%)
20 ( 4, 20%)
Total
76 ( 2, 3%)
58 (19, 33%)
134 (21, 16%)
Note: Number and percentage with length of survival greater than expected, LE > 0
predicted mortality. The investors believed that the longer of two physicians’ estimates of survival time, the viator’s age, diagnosis, and gender were essential to their analysis. They also believed a statistician may shed additional light on the matter. Table 3 provides detail concerning the physicians’ life expectancy estimates and the viators’ ages. Diagnosis was captured within four categories: cancer, cardiovascular, respiratory, and other. Diagnosis by gender is provided in Table 4, which also includes “survivorship” within 48
VOL. 22, NO. 4, 2009
each sub-classification. Survivorship identifies the number of viators within each subclass that outlived the longer of the two physicians’ life expectancies. Three facts are immediately evident from Table 4. First, viators are not likely to outlive their physician’s estimated life expectancy: Only 16% (21/134) did so. (In fact, only 28% (37/134) outlived the lesser of the two physicians’ estimates.) Second, it is clear that women viators in this portfolio fared better than men. Christakis and Lamont similarly found
Table 5—Summary Statistics for Three Variables Contained in the Viatical Portfolio (N=134) Variable Face Amount
Age
LE*
Mean
Median
Standard Deviation
All
$232,000
$150,000
$200,000
Female
$246,000
$175,000
$235,000
Male
$221,000
$150,000
$170,000
All
71 yrs.
72 yrs. (All)
14 yrs.
Female
74 yrs.
76 yrs. (Female)
15 yrs.
Male
69 yrs.
69 yrs. (Male)
13 yrs.
All
-6.6 mos.
-6.0 mos.
9.3 mos.
Female
-5.6 mos.
-5.0 mos.
11.1 mos.
Male
-7.5 mos.
-7.0 mos.
7.7 mos.
* Actual life expectancy minus highest estimate of life expectancy from two physicians
Table 6—Cross Tabulation of Season and ‘Survivorship’ ( LE > 0) in the Viatical Portfolio (2 = 5.05, p=0.0246) Survivorship
LE < 0
LE > 0
Total
Winter, Spring (January through June)
56
16
72
Summer, Fall (July through December)
57
5
62
113
21
134
Season
Total = 5.05, p=0.0246 2
that women fared better than men with respect to physicians’ estimates. Third, a diagnosis of cancer appears to have a greater negative impact on viators than physicians may realize. Christakis and Lamont similarly found that cancer patients are the most likely patients to receive an overly optimistic estimate. Several summary statistics in Table 5 are helpful in understanding these data. Face amount is the amount the life insurance company will pay to the beneficiary upon the death of the insured. It is not surprising to see the skew evident in the summary statistics. Most life insurance policies have relatively small face amounts: More than one-half (71 of 134) of the policies provided $150,000 or less of coverage. Fourteen policies provided $500,000 or more of coverage, with one providing $1.3 million in coverage. When we considered age with respect to survivorship, we were surprised to find that higher ages yield higher survivorship (see Table 3). We suspect that physicians’ Bayesian intuition is misguiding their estimates. Because a viator is advanced in age, the physician reasons, the viator is more
ready to succumb to the illness. We believe the data support that advanced age is a marker for the sum of psychophysical heartiness (notwithstanding the otherwise increased hazard rate), which provides the viator’s vigor to outlast the physicians’ estimates. Although most of our analyses were decidedly post hoc with the express purpose of identifying predictive factors (to illuminate the proper discount rate for the agreement between the viator and investor), we did have one a priori hypothesis: Would the season of the diagnosis have any bearing on survivorship? In the 1972 edition of Statistics: A Guide to the Unknown, Judith Tanur and co-editors report that there may be psychological forces postponing death, while N. E. Rosenthal, in “Diagnosis and Treatment of Seasonal Affective Disorder,” identified and described seasonal affective disorder (SAD). Taken together, we hypothesized that the time of diagnosis— specifically, its seasonality— also may be predictive. We tested whether survivorship depended on the season when the viator received a diagnosis of terminality. All the viators in the data set lived in North America and therefore shared CHANCE
49
Survivorship = 16% N = 134
p < 0.0001
Age: “ 79” Survivorship = 3% N = 93
Age: “ 80” Survivorship = 44% N = 41 p < 0.0001
Male Survivorship = 6% N = 16
Female Survivorship = 68% N = 25 p = 0.037
Cancer Survivorship = 0% N=3
Not Cancer Survivorship = 77% N = 22
Figure 1. An analysis of viatical settlement variables and their predictive power for survivorship using SPSS CHAID, the precursor to AnswerTree
seasons in common: winter (January, February, and March), spring (April, May, and June), summer (July, August, and September), and fall (October, November, and December). Believing that diagnoses of terminality would be better received when daylight was lengthening (January–June) and warmer weather approaching (March–June), we tested the cross-tabulated data in Table 6 and found a significant association. The seasonal effect, however, was mitigated by more significant predictors in a larger model. Although our variation on SAD was predictive on its own merits, it failed to be useful in the investors’ analyses when mixed with all other predictors. Taken together, we arrived at the decision tree in Figure 1. Among all variables, age was the most significant predictor of survivorship—in a way contrary to intuition. As illustrated in Figure 1, gender was significant among the older viators. Not surprising, however, was that a diagnosis of cancer further discriminated survivorship among the older female viators. Also not surprising, we found that our forward stepwise (conditional) logistic regression model confirmed our decision tree: Age (p