Editor’s Letter
Mike Larsen,
Executive Editor
Dear Readers, CHANCE now includes all content in print and online for subscribers and libraries. This will make CHANCE more accessible and useful to subscribers and attractive to potential authors. As you experience the online version, please feel free to share your impressions and suggestions at
[email protected]. The interaction between statistics and health and medicine are well represented in this issue. Articles and columns concern measuring health disparities, studying sleep and sleep disorders, defining and measuring disability, using machine learning in medical diagnosis, comparing effectiveness of antibiotics, and the emerging fields of pharmacogenetics and pharmacogenomics. Ken Keppel and Jeff Pearcy describe Healthy People 2010, a national initiative to coordinate health improvement and prevent disease. Of course, you have to measure disease rates if you want to judge performance. How many data sources do you think are involved? Numerical examples contrast absolute and relative measures of disability. Two articles concern sleep research. Despite the common theme, after introductions, the two articles differ significantly in data and methods. Brian Caffo, Bruce Swihart, Ciprian Crainiceanu, Alison Laffan, and Naresh Punjabi discuss data on transitions among sleep states gathered during overnight laboratory observations. Statistical matching plays an important role in the design. James Slaven, Michael Andrew, Anna Mnatsakanova, John Violanti, Cecil Burchfiel, and Bryan Vila report on a study involving actigraphy measurements on a large group of police officers. The officers wear monitors, which record activity levels every minute, for 15 days. Statistical measurements of data quality are important in this study. How would you define disability? How would you measure it in a sample survey? Michele Connolly explains why statistical description of disability in the population is both complicated and important. Michael Cherkassky examines two machine learning, or statistical learning, methods for use in medical diagnosis. Cross-validation is used to aid model selection. Methods are applied to three data sets. Cherkassky
is a high-school student and award winner in the 2008 ASA Intel International Science and Engineering Fair. In the Visual Revelations column, Howard Wainer tells us about Will Burtin and his contributions to scientific visualization. Burtin’s data set was the basis of the graphics contest announced in the previous issue. Winners of the graphics contest will be announced in the next issue. Wainer presents Burtin’s original, as well as his own, graphical interpretation of the data in this issue. In Mark Glickman’s Here’s to Your Health column, Todd Nick and Shannon Saldaña discuss pharmacogenetics and pharmacogenomics. Besides being dream words for Scrabble players, these refer to developments at the frontier of science and medicine. These are rich and challenging areas for statistical collaboration. Two articles have sports themes. Bill Hurley examines the first-year performance of hockey sensation Jussi Jokinen and how one can assess exceptional performance. This article could provide nice examples for instructors of regression to the mean and shrinkage estimators. Lawrence Clevenson and Jennifer Wright compute expected returns to punting or going for a first down on fourth down in professional football. Several assumptions and statistical modeling of probabilities of various events are required. This work was part of Wright’s master’s degree paper. To complete the issue, Jonathan Berkowitz brings us his Goodness of Wit Test column puzzle. Berkowitz gives us the hint that some degree of pattern recognition in the answers to clues will be useful (and fun!). I look forward to your comments, suggestions, and article submissions in 2009. Enjoy the issue! Mike Larsen
CHANCE
5
Healthy People 2010: Measuring Disparities in Health Kenneth G. Keppel and Jeffrey N. Pearcy
objectives are measured in terms of the rate or proportion of individuals with a particular health attribute, such as a health condition or outcome, a known health risk, or use of a specific health care service. Disparities in health are measured between subgroups of the population based on race and Hispanic origin, income, etc.
Measuring Changes in Disparity
H
ealthy People 2010 (HP2010), led by the Office of Disease Prevention and Health Promotion of the U.S. Department of Health and Human Services, is a national initiative to promote health and prevent disease. Through a national consensusbuilding process from 1997 to 1999, specific objectives for improving the health of the nation were identified. This process involved participants from 350 national membership organizations; 250 state health agencies; and numerous professionals at local, state, and national meetings. Baseline values for the objectives were established and specific targets were set for improvements to be achieved by 2010. Similar initiatives set objectives for 1990 and 2000. A new set of objectives for 2020 is being developed. 6
VOL. 22, NO. 1, 2009
The first overarching goal of this initiative is to increase the length and improve the quality of life. The second goal, which is the focus of this article, is to eliminate disparities in health among subgroups of the population. The second goal requires measurement of disparities and monitoring of changes in disparities over time. HP2010 includes 955 objectives that are being tracked using 190 data sources. Major data sources include the National Health Interview Survey (NHIS), the National Health and Nutrition Examination Survey (NHANES), and components of the National Vital Statistics System (NVSS). There are 504 objectives that call for data by demographic characteristics, including race and Hispanic origin, income, education, gender, geographic location, and disability status. These “population-based”
Conclusions about changes in disparity depend on which reference point the disparity is measured from, whether the disparity is measured in absolute or relative terms, and whether the indicator is expressed in terms of favorable or adverse events. When HP2010 began in 2000, there was little agreement on how disparities should be measured. Since 2000, consensus has been reached on the measurement of disparities in HP2010. The most favorable, or “best,” group rate among the groups for each characteristic is chosen as the reference point for measuring disparities. For example, the best racial and ethnic group rate is used as the reference point for measuring disparities among racial and ethnic populations. The best rate is a logical choice for an initiative with the goal to eliminate disparities in health. The ability of one population to achieve the most favorable rate suggests that other populations could achieve this rate. Also, it would not be desirable to eliminate disparity by making the rate for any group worse than it is. The accuracy of the best group rate is a concern when subgroup rates are based on a small population or a small sample. If the best group rate does not meet the criteria for precision, the best rate is selected from among groups that meet the criteria. Disparities can be measured as either the absolute difference or relative difference between the best group rate
and the rate for another group. Conclusions about the size and direction of changes in disparity depend on whether absolute or relative measures are used. In HP2010, disparities are measured in relative terms as the percent difference between each of the other group rates and the best group rate. When disparities are measured in relative terms, they can be compared across indicators with different units of measurement (e.g., %; per 1,000; per 100,000). Reductions in relative disparities are required as evidence of progress toward eliminating disparity. Reduction in an absolute measure of disparity can occur without any
corresponding reduction in the relative measure, therefore without any progress toward eliminating disparity. In HP2010, changes in disparity are assessed in terms of the percentage point change in the percent difference from the best group rate between two points in time. When disparities are measured in relative terms, conclusions about changes in disparity also depend on whether the indicator is expressed in terms of favorable or adverse events. James Scanlan, in his1994 CHANCE article “Divining Difference,” pointed out that the relative difference in rates of survival between black and white infants decreases as the
relative difference in rates of infant mortality increases. When disparities are measured in HP2010, indicators are usually expressed in terms of adverse events so meaningful comparisons can be made across indicators. For example, objective 3-13 calls for an increase in the percent of women 40 years and older who received a mammogram within the past two years. When disparity is measured, this indicator is expressed as the percent of women who did not receive a mammogram within the past two years. Only a few indicators expressed in terms of averages cannot be expressed readily in adverse terms—
Numerical Examples of Changes in Rates and Changes in Absolute and Relative Measures of Disparity Suppose groups A and B are measured on an adverse event at times 1 and 2. Low values are good. In each scenario, group B has the best rate and is the reference point for measuring disparity. For this illustration, we are ignoring possible sampling variability or model uncertainty associated with estimates. In scenario 1, both groups improve, and group A improves faster than group B. In scenario 2, both groups improve, but the percent improvement in group B is larger, so relative disparity increases. In scenario 3, group A improves and disparity decreases. In scenario 4, disparity decreases, but only because the adverse event measure increases in group B. 1. A good scenario—Improvement for Groups A and B and a reduction in disparity Time 1 Time 2 Direction of Change Group A 80 60 Rate better Group B – best 60 50 Rate better Absolute Disparity 20 10 Better Relative Disparity (80-60)/60 = 1/3 (60-50)/50 = 1/5 Better 2. Greater improvement for Group B results in an increase in disparity Time 1 Time 2 Group A 80 60 Group B – best 60 40 Absolute Disparity 20 20 Relative Disparity (80-60)/60 = 1/3 (60-40)/40 = ½
Direction of Change Rate better Rate better Same Worse
3. Improvement for Group A only results in a decrease in disparity Time 1 Time 2 Group A 80 70 Group B – best 60 60 Absolute Disparity 20 10 Relative Disparity (80-60)/60 = 1/3 (70-60)/60 = 1/6
Direction of Change Rate better Rate same Better Better
4. An undesirable way for disparity to decrease Time 1 Group A 80 Group B – best 60 Absolute Disparity 20 Relative Disparity (80-60)/60 = 1/3
Direction of Change Rate same Rate worse Better Better
Time 2 80 65 15 (80-65)/65 = 3/13
CHANCE
7
(Log scale) New cases per 100,000 population
100
Rates improving Disparity decreasing, then increasing
10
Hispanic American Indian or Alaska Native Non-Hispanic white
1
Non-Hispanic black Asian or Pacific Islander
0.1
Healthy People 2010 target, 4.3 1997 1998 1999 2000 2001 2002 2003 2004 2005
Figure 1. New cases of Hepatitis A by race and Hispanic origin: United States, 1997–2005 Source: DATA2010, http://wonder.cdc.gov/data2010
(Log scale)
Age-adjusted percent
100
Rates worsening Disparity decreasing 40
Non-Hispanic black Mexican American Non-Hispanic white
20 Healthy People 2010 target, 15 10 1988-94
1999-2002
2003-06
Figure 2. Obesity in adults 20 years and older by race and ethnicity: United States, 1988–1994, 2003–2006 Source: DATA2010, http://wonder.cdc.gov/data2010
the average age at first use of alcohol among adolescents or the median RBC folate level among nonpregnant women, for example. To summarize, in HP2010, consensus has been reached on measuring disparities from the best group rate, in relative terms, with indicators expressed in terms of adverse events. Reduction in the percent difference is required as evidence of progress toward eliminating disparity for each group. Reduction in the average percent difference from the best group is indicative of a reduction in disparity for each characteristic (e.g., race and ethnicity).
Three Examples from HP2010 Progress toward the first goal—to improve health—can be assessed in terms of changes since the baseline in the 8
VOL. 22, NO. 1, 2009
data for each objective. Progress toward the second goal can be assessed in terms of changes in the percent difference from the best group rate for specific subgroups of the population for most populationbased objectives. Progress toward the first goal does not necessarily coincide with progress toward the second and vice versa. The following examples illustrate different results thus far. Hepatitis A: The incidence of new cases of hepatitis A (HP2010 objective 14-6) provides an interesting example of how disparities can change as rates improve. Trends in the rate of new cases of hepatitis A for five racial and ethnic populations are shown in Figure 1. The percent difference between the rate for the Hispanic population and the rate for the group with the best rate (the Asian or Pacific Islander population in 1997 and both the non-Hispanic black and
white populations in 2002) declined from 432% in 1997 to 84% in 2002. The estimates in Figure 1 are shown on a log scale. The convergence in estimates from 1997 to 2002, therefore, represents reduction in relative differences from the best group rate. Not only had the racial and ethnic disparity been substantially reduced, but the HP2010 target of 4.3 new cases of hepatitis A per 100,000 population was achieved for all five racial and ethnic populations in 2002. This is a very desirable result. After 2002, new case rates continued to decline, but relative differences from the best group rate increased for each of the other groups to nearly the same level as they were in 1997. The American Indian or Alaska Native population went from having nearly the worst rate in 1997 to having the best rate in 2005. The reduction in new cases of hepatitis A was due to a combination of strategies, including geographic targeting of immunization programs. The continuing reduction in rates is encouraging, but the increase in disparities is not desirable. Adult Obesity: The disparity among three racial and ethnic populations declined as the percent of obese adults increased. HP2010 objective 19-2 calls for a reduction in the rate of obesity in adults 20 years and older from 23% at baseline in 1988–1994 to 15% in 2010. As indicated in Figure 2, the rate of obesity increased for the three racial and ethnic populations with reliable data. In 1988–1994, the percent of obese adults for both Mexican-American and non-Hispanic black populations was substantially higher than the percent for the non-Hispanic white population, the reference group. In 1999–2002 and 2003–2006, because of the increase in obesity in the reference group, there was only one racial and ethnic group for which the percent of obese adults was substantially higher than the percent for the non-Hispanic white population. The average of the two percent differences from the best group rate was reduced despite an increase in obesity. This is not a desirable way to reduce disparities. Prostate Cancer Death Rate: Objective 3-7 calls for a reduction in the prostate cancer death rate from 31.3 per 100,000 population in 1999 to 28.2 in 2010. Except for the American Indian or Alaska Native population, there were statistically significant declines in prostate cancer death rates between 1999
(Log scale) Age-adjusted rate per 100,000 population
and 2004 (Figure 3). Between 1999 and 2002, the percent difference from the best group rate increased for the Hispanic, non-Hispanic black, and nonHispanic white populations. However, when the baseline year and the most recent year are compared, there was no statistically significant change in the percent difference from the best group rate for any of the populations. Despite the improvement in rates for four of five racial and ethnic populations, relative differences from the best group rate were essentially unchanged.
100
Non-Hispanic black Non-Hispanic white
10
To monitor progress toward the elimination of disparities in HP2010, three essential choices were made. Progress toward eliminating disparities is judged to have occurred when relative differences from the best group rate in adverse events are reduced over time. These principles have been employed in measuring disparities and changes in disparity for the population-based objectives in HP2010. Disparities have been measured for race and ethnicity, income, education, gender, geographic location, and disability status. The results have been published in the Healthy People 2010 Midcourse Review. Although the review indicates disparities have been reduced for relatively few objectives and disparities increased for nearly as many objectives, these results do not imply that disparities in health cannot be reduced. Instead, these results are an indication of the difficulty that can be encountered in reducing rates of adverse outcomes for disadvantaged populations. Population groups with more unfavorable rates must improve by greater proportions than the rate for the best group if disparities are to be reduced. The measurement of disparity has not yet become standardized. The importance of choosing a reference point, deciding to measure disparity in absolute or relative terms, and expressing indicators in terms of adverse events is not widely understood. Different choices can lead to different conclusions about changes in disparity. Consensus on the measurement of disparities in HP2010 is a significant contribution to this initiative. Despite this accomplishment, there are issues that still need to be considered. While changes in relative measures of disparity are needed as evidence of prog-
American Indian or Alaska Native Asian or Pacific Islander
1 1999
Discussion
Hispanic
Rates generally improving Disparity increasing, then decreasing
2000
2001
2002
2003
2004
Healthy People 2010 target, 28.2
Figure 3. Prostate cancer death rates by race and Hispanic origin: United States, 1999–2004 Source: DATA2010, http://wonder.cdc.gov/data2010
ress toward reducing disparities, there are, as yet, no criteria for determining that a disparity has been eliminated. When the rates for groups are small, a large relative difference might correspond to a tiny absolute difference. If the absolute differences among group rates are small, it is possible that no further reduction is required. Figure 1 shows an example of this type. The question of whether a difference between groups is no longer great enough to warrant public health intervention is not just a statistical one. Social, ethical, and practical factors also need to be considered when decisions are made about public health interventions. Monitoring for HP2010 will continue through the end of the decade. Healthy People 2020 is now being planned. Lessons learned from HP2010 will inform the choice of objectives and methods for measuring health and disparities in health among subgroups of the population. Indeed, the continuing process of building consensus will make it possible to monitor health more accurately and in ways that can lead to a healthier nation.
Further Reading www.healthypeople.gov Department of Health and Human Services. (2000) Healthy People 2010, 2nd edition. With Understanding and Improving Health and Objectives for Improving Health. Government Printing Office: Washington DC.
Carter-Pokras, O. and Baquet, C. (2001) “What Is a ‘Health Disparity’?” Public Health Reports, 117(5):426–434. Scanlan, J.P. (1994) “Divining Difference.” CHANCE, 7(4):38–39&48. Keppel, K.; Pearcy, J.; Klein, R. (2004) “Measuring Progress in Healthy People 2010.” Statistical Notes, No. 25. National Center for Health Statistics: Hyattsville, Maryland. Keppel, K.; Pamuk, E.; Lynch, J.; et al. (2005) “Methodological Issues in Measuring Health Disparities.” Vital and Health Statistics, 2(141). National Center for Health Statistics: Hyattsville, Maryland. Keppel, K. and Pearcy, J. (2005) “Measuring Relative Disparities in Terms of Adverse Events.” Journal of Public Health Management and Practice, 11(6):479–483. Low, L. and Low, A. (2006) “Importance of Relative Measures in Policy on Health Inequalities.” British Medical Journal, 332:967–969. Office of Disease Prevention and Health Promotion. (2006) Healthy People 2010 Midcourse Review, www.healthypeople.gov/ data/midcourse/default.htm#pubs. Keppel, K.; Garcia, T.; Hallquist, S.; Ryskulova, A.; Agress, L. (2008) “Comparing Racial and Ethnic Populations Based on Healthy People 2010 Objectives.” Statistical Notes, No. 26. National Center for Health Statistics: Hyattsville, Maryland.
CHANCE
9
An Overview of Observational Sleep Research with Application to Sleep Stage Transitioning Brian Caffo, Bruce Swihart, Ciprian Crainiceanu, Alison Laffan, and Naresh Punjabi
10
VOL. 22, NO. 1, 2009
S
leep is an essential component of human existence, consuming roughly one third of our lives. Fatigue, jet lag, poor sleep, and vivid dreams are frequent points of our morning discussions. We look and feel terrible after getting too little sleep; hence a $20 billion industry of beds, pillows, pills, and other tools has cropped up to help us sleep better. Correspondingly, there are plenty of products designed to keep us awake. Despite the importance of sleep in our lives, and the lives of so many other species, a definitive answer on the specific neurobiological or physiologic purpose of sleep eludes researchers. However, substantial advances in the field are uncovering the crucial role that sleep plays in our health, behavior, and well-being. For example, studies of sleep duration have found associations with a variety of important health outcomes. Short sleep duration correlates with impaired cognitive function, hypertension, glucose intolerance, altered immune function, obesity, and even mortality. This point is driven home by the fact that sleep deprivation is a well-recognized form of torture. Consider the following quote from former Israeli Prime Minister Menachem Begin, who suffered forced sleep deprivation as a KGB prisoner: In the head of the interrogated prisoner, a haze begins to form. His spirit is wearied to death, his legs are unsteady, and he has one sole desire: to sleep, to sleep just a little, not to get up, to lie, to rest, to forget … Anyone who has experienced this desire knows that not even hunger and thirst are comparable with it. Quantity of sleep is only one measurable facet of sleep that is associated with health. Table 1 gives a few of the more common measurements of sleep and sleep disturbance. A common sleep disorder of particular public health interest is sleep apnea. This is a chronic condition characterized by collapses of the upper airway during sleep. Complete collapses lead to so-called “apneas;” partial collapses lead to so-called “hypopneas.” Over the last decade, research has shown these events can lead to several physiologic consequences, including changes in metabolism, glucose tolerance, and cardiac function. The respiratory disturbance index (RDI), sometimes also called the apnea/hypopnea index (AHI), is the principal measure of severity of sleep apnea. This rate index is the count of the number of apneas and hypopneas divided by the total time slept in hours. A severely affected patient may have an RDI of 30 events per hour or higher. Hence, such a patient is, on average, having a disruption in their sleep and breathing
every two minutes. As one can imagine, such frequent disruptions in sleep and oxygen intake can have negative health consequences. Terry Young, Paul E. Peppard, and Daniel J. Gottlieb write in the American Journal of Respiratory and Critical Care Medicine that a high RDI has been shown to be associated with hypertension, cardiovascular disease, cerebrovascular disease, excessive daytime sleepiness, decreased cognitive function, decreased health-related quality of life, increased motor vehicle crashes and occupational accidents, and mortality. We relate measures of sleep apnea with transitions that occur between “sleep states.” Sleep states are based on visual classification of brain electroencephalograms (EEGs) patterns. Two major sleep states are rapid eye movement (REM) and non-REM. Sleep states can be seen as a categorical response time series. Crude summaries of these states, such as the percentage of time spent in each one, are often used as predictors of health. Instead, we investigate the role sleep apnea has on the rate of transitioning between the states. We emphasize that the rate of transitioning contains important additional information that the crude percentage of time spent in each state omits. Notably, we use matching to account for other variables that might be related to both disease status and sleep
Table 1—Measurements of Sleep Taken During an Overnight Sleep Study and Routine Clinical Evaluation of Sleep Sleep Measure
Description
Arousal index
Number of arousals per hour slept
Epworth Sleepiness Scale
Aggregate measure of general sleepiness
Respiratory disturbance Number of apneas and index hypopneas per hour Sleep architecture
Proportion of time spent in various sleep states
Sleep efficiency
Time asleep as a proportion of time in bed
Sleep latency
Time until falling asleep
Total sleep time
Total time asleep in a night CHANCE
11
behavior, hence comparing a severely diseased group with a matched nondiseased group.
Sleep Measurement The gold standard of sleep measurement is based on an overnight sleep study called a “polysomnogram.” During a polysomnogram, a patient has several physiologic recordings that are digitized and subsequently stored. Some of these recordings includes skull surface electroencephalograms, which measure the actual electrical activity from neurons firing. Because the EEG measures aggregate electrical activity in the cortex, they have poor spatial resolution; however, they have excellent temporal resolution, with hundreds of measurements per second. Other physiologic recordings measure eye movement (an electro-oculogram), leg movement (electromyogram), oxygen saturation (pulse-oximter), air flow, chin movement activity (electromyogram), chest and abdominal movement (via belts around the torso), and heart rate and rhythm (electrocardiogram). A polysomogram produces an enormous amount of data, as each of these signals is recorded nearly continuously over a night of sleep. The signals are processed by trained technicians under the supervision of sleep physicians. The technicians and sleep physicians distill this deluge of information to more manageable summaries. In clinical settings, these summaries are used to help patient care decisions. They also are used in research to investigate the causes and consequences of sleep-related phenomena. Table 1 lists examples of summaries of the polysomnogram, as well as the Epworth Sleepiness Scale—a questionnaire-based assessment of daytime sleep propensity. Often, a concern in sleep clinics is whether the subject has sleep apnea and, if so, to evaluate the severity of the disease. As previously mentioned, the primary measure of severity of sleep apnea is the number of apneas or hypopneas per hour slept. Another important summary splits the sleep pattern into a few distinct sleep states. This is done visually, by trained and certified technicians and physicians, by grouping the data into 30-second “epochs.” The states of interest are labeled Wake, Stage I, Stage II, Stage III, Stage IV, and REM. Stages I–IV are referred to as non-REM sleep. Stages I and II represent light sleeping and encompass 3%–8% and 44%–55% of total sleep time, respectively. Stages III and IV represent deeper sleep and comprise about 15%–20% of the total sleep time. In REM sleep, which compromises approximately 20%–25% of total sleep time, the body is inactive while the brain manifests EEG patterns similar to wakefulness. As described by Sudhansu Chokroverty in Sleep Disorders Medicine: Basic Science, Technical Considerations, and Clinical Aspects, most dreaming occurs in REM sleep. A patient’s “sleep architecture” is simply the person-specific percentage of time spent in each of the six states. Sleep architecture can vary between people and within a person as they age. For example, infants spend more than 80% of their sleeping time in REM. It is generally accepted that sleep staging is relevant for understanding sleep’s effect on health. We are particularly interested in the impact of the rate of transitioning between the various sleep states. Note that it is not the case that a person necessarily transitions from wakefulness through Stages I to IV in sequential order, and then to REM. Instead, people pass 12
VOL. 22, NO. 1, 2009
through the states in cycles, with transitioning from any state to another both possible and likely. Wakefulness to REM is the transition that occurs the least frequently. In addition to measuring nighttime sleep signals, other behavioral measurements can be valuable in sleep research. One of the more widely used measures is the Epworth Sleepiness Scale. This is an aggregated score of several self administered questions involving sleep behavior and is used as a measure of daytime sleep propensity. For example, patients are scored on whether they fall asleep when sitting and reading or watching television.
Sleep Transition Rates The Sleep Heart Health Study (SHHS) added to and combined data from large, well-established longitudinal studies: the Atherosclerosis Risk in Communities Study, the Cardiovascular Health Study, the Framingham Heart Study, the Strong Heart Study, and the Tucson Health and Environment Study. As described in the SHHS by Stuart F. Quan and colleagues, the SHHS recruited more than 6,000 subjects at enormous expense to undergo an abbreviated in-home polysomogram. Roughly 4,000 of these subjects repeated this process about four years later. The sleep data were processed by trained SHHS technicians and included rigorous quality checks. The SHHS offers a unique data set to understand sleep and health. However, being an observational study, analysis of the data is often challenging. Any effects or absence of effects seen might be due to subtle biases from the sampling or (measured or unmeasured) variables either unaccounted for or improperly accounted for in the analysis. Matching We consider now the relationship between sleep transitions and sleep apnea. Hence, we compare a group of severely diseased patients with sleep apnea—as defined by an RDI greater than 22:3 events per hour—with a group of healthy controls having RDIs of less than 1:33 events per hour. The groups were chosen so each subject in the diseased group had a matching subject in the control group. This process helps control for confounding variables, not unlike methods such as regression analysis. However, matching— unlike regression adjustment—forces a discussion of how alike or unlike the diseased and control groups are. In contrast, regression adjustment will happily plod along via linearity assumptions, even if there is no overlap in the confounding variables for the diseased and control groups. Matching is not without its issues, however. Most notably, the SHHS population being studied is a subset of the population selected originally for study; hence, a matched subset may lack the generalizability of results on the original data. Only a subset of the SHHS was eligible to be matched for our analysis. For example, to adequately define disease status, only those subjects with outstanding sleep recording quality and without any history of coronary heart disease, cardiovascular disease, hypertension, chronic obstructive pulmonary disease, asthma, or stroke were eligible. In addition, current smokers were not considered. These rigid qualification standards narrowed the original SHHS pool from more than 6,000 to 183 diseased and 458 nondiseased subjects. These groups are not representative of the population, as conditions
REM
IV
IV Sleep State
Sleep State
REM
III II
III II
I
I
Wake
Wake 0
2
4
6
8
2.0
2.4
Time in hours
2.8
Time in hours
Figure 1. Hypnogram plot for a single subject. The left plot shows the full night, whereas the right plot highlights sleep between the second and third hours.
ID
Diseased
0
1
2
3
None NW
4
RW
5
WN
RN
WR
6
7
8
6
7
8
NR
ID
Controls
0
1
2
3
4
5
Figure 2. Transition plots for the 60 diseased (apneic) and matched control subjects. Grayscale points represent transitions where each gray scale corresponds to a different type of transition. The key is such that N represents non-REM, R represents REM, and W represents wake. Hence, NR represents non-REM to REM, RW represents REM to wakefulness, and so on.
such as hypertension and cardiovascular disease commonly occur with sleep apnea. Hence, our ‘diseased’ group is quite healthy in many aspects, excepting the high index of sleep apnea disease severity. The matching variables included body mass index (BMI, the ratio of a subject’s weight to the square of their height), age, race, and sex. Exact matching was used for race and
sex, whereas BMI and age were matched within a caliper (i.e., matched within an acceptable range). The matching procedure produced 60 pairs. Apart from BMI, none of the variables were significantly different (using Student’s t test and chi-square tests at the 5% level) when comparing the two groups. Although concern exists about the differential body mass indexes, we note that obesity is the primary cause CHANCE
13
1.5 1.0 0.5 0.0 -0.5
Log2 difference in rates (Diseased – Control)
-1.0
3.4
3.6
3.8
4.0
4.2
4.4
4.6
Log2 average rate
Figure 3. Mean/difference plot of log base two of transition rates with diseased minus matched controls on the vertical axis and pairwise average log base two transition rates on the horizontal axis.
of sleep apnea, and both the groups—though having statistically different BMIs—were similar practically. Specifically, the average body mass index for the diseased group was 30.7 Kg/ m2, whereas it was 29.2 Kg/m2 for the nondiseased. Of note is that traditional sleep architecture—though not considered for a matching parameter—was not statistically different between groups. This implies that this gold standard measurement summary of sleep states may not be affected by sleep apnea. Figure 1 displays the time series for the six sleep states for a single subject; such a plot is called a “hypnogram.” Figure 2 displays a plot comparing the 60 diseased and 60 matched control subjects. In this plot, each grayscale point corresponds to a different transition type, the horizontal index is time in hours, and the vertical index is subject. This plot simultaneously displays the information of many hypnograms. As discussed in the Journal of Clinical Sleep Medicine by Bruce Swihart and his colleagues, this plot and variations highlight the higher rate of transitioning occurring in the diseased group. Analysis of Sleep Transition Rates We restrict ourselves to studying the transitions starting at the first transition from wakefulness to sleep (usually to Stage 1) and ending at the last transition from a sleep state to wakefulness. That is, we discard time before sleep onset and after waking. We note that the initial time in bed before sleep onset (sleep latency) between the matched pairs showed no difference (Student’s t-test p-value of 0.70). Figure 3 shows mean/difference plots for the log base two of the transition rates for the matched pairs. By a transition, we mean a change from any sleep state to another; hence, the transition rate is the number of changes divided by the total time asleep in hours. Such plots highlight whether there is a difference between the two matched groups, whereas plotting 14
VOL. 22, NO. 1, 2009
against the average highlights whether any such difference is dependent on the magnitude of the transition rates. Log base two is used simply to work with ratios and because powers of two are easier to work with than powers of Euler’s number, e, and represent smaller increments than the other option of using base 10. Recall that the transition rate is defined as the number of transitions per hour of sleep. We note that a reasonable discussion could be had on which of the two measures, the transition rate or the raw number of transitions, is more important. It may be that a certain raw number of transitions is important for health, regardless of the rate. However, clearly a person who sleeps longer has more opportunities to transition between states, suggesting the use of rates. Regardless, we focus on only the analysis of the rates. Further analysis of the rates and transition numbers is presented in the paper by Swihart. In Figure 3, 39 of the 60 observations lie above the horizontal line, potentially indicating that diseased subjects transition more frequently than nondiseased. For example, under a null hypothesis of no difference in transition rates between the diseased and nondiseased groups, the binomial probability of 39 or more pairs out of 60 lying above the horizontal line is only 0.014 (the two-sided p-value would double this number). A useful summary of each subject’s data would be a threeby-three table that displays counts of their previous sleep state by their current sleep state. Table 2 displays the combination of such summary tables across subjects, with the non-REM sleeping states (Stages I–IV) aggregated. Shown are counts of the previous state by the current state, cross-classified by disease status. Transition counts occur in the off-diagonal cells, with the diagonal cells representing instances in which the subjects stayed in the same state from one epoch to the next. For example, in the diseased subjects, there were 346 transitions from non-REM (N) to REM (R), whereas there were 175 transitions from wake (W) to REM. Column totals in this data are special, representing the time at-risk for various kinds of transitions. Therefore, of the 346 transitions of type N → R, there were 281.78 total hours spent in non-REM where this type of transition could be made. The most frequent transition in both groups is W → N, with rates of 1,733/66.56 = 26.0 and 1,376/60.54 = 22.7, transitions per hour, respectively. The next most frequent transitions are N → W and R → W. The data paint the picture that a person, diseased or not, spends the majority of their time in the nonREM state. From there, they often transition to REM, but then spend a little time in REM before transitioning to wakefulness (more likely) or back to non-REM (less likely). In addition, from non-REM, they often wake up briefly, then transition (more likely) back to non-REM. It is of interest to compare whether these rates differ across disease groups. This is a somewhat challenging task, given that one must account for the correlation induced by matching and the correlation of the various rates within a particular subject. For example, in a subject with a high rate of non-REM to wake transitions, it is reasonable to believe there would be a correspondingly high rate of wake to non-REM transitions; hence, these two rates would be correlated. We fit a model that assumes a constant rate of transitioning over the time at risk for transitioning, a so-called exponential hazard model, accounting for these kind of correlations.
Table 2—Disease and Control Counts of Transitions Previous State Disease
Controls
Current State
N
R
W
N
R
W
Non-REM (N)
31,880
160
1,733
32,592
134
1,376
346
7,609
175
351
8,784
114
1,588
358
6,079
1,210
324
5,775
Total epochs
33,814
8,127
7,987
34,153
9,242
7,265
Total in hours
281.78
67.73
66.56
284.61
77.02
60.54
REM (R) Wake (W)
Note: The columns denote the previous sleep state, whereas the rows denote the current. The times in state are measured in 30-second “epochs.” The column totals, which are counts of epochs, are converted into hours for convenience.
Table 3—Relative Rates Comparing Diseased (Numerator) to Nondiseased (Denominator) Subjects Transition
Relative
Type
Rate
Interval
N→R
1.04
[0.84, 1.29]
N→W
1.39
[1.17, 1.65]
R→N
1.46
[1.11, 1.93]
R→W
1.34
[1.08, 1.67]
W→N
1.04
[0.87, 1.22]
W→R
1.27
[0.95, 1.68]
Table 3 shows the estimated relative rates (i.e., the estimated rate of transitioning for the diseased divided by that of the controls) and 95% credible interval. A credible interval is a Bayesian analog to the confidence interval; readers unfamiliar with Bayesian analysis can simply think of them as confidence intervals. The data suggest that the rate of transitioning from non-REM to wakefulness, REM to non-REM, and REM to wakefulness all differ between the two groups. Notably, all the estimates represent increases in the transition rates for the diseased subjects. This suggests that the disruption of sleep continuity in response to the airway collapses during sleep may cause increased transitions between wakefulness and the other states.
Discussion We briefly reviewed an area of observational sleep research with a particular emphasis on analyzing sleep transitions and their relationship with sleep apnea. The analysis showed some potential differences between the diseased and nondiseased group with respect to the amount of transitioning, though significant work remains to be done to fully understand this problem. It is especially important to consider how the transition rates differ over the night, relaxing the assumptions of the exponential model we presumed; improving the matching
algorithm; and applying the methods to the large, unmatched data using other adjustments. Hence, the study presented herein represents a small snippet of understanding how sleep is influenced by a specific disease. We emphasize, however, that the overarching focus of our research is to better exploit the full information contained in polysomnograms from large-scale observational studies of sleep. This includes functional data analysis of the sleep state and EEG signals. We believe there is important information omitted by considering only the standard epidemiological summaries, which in most cases were designed as simple clinical indexes and may be improved upon for research purposes. This point is driven home by the example provided here, where several relevant differences in sleep transition behavior were illustrated between a diseased and nondiseased group while the standard index of sleep staging showed none.
Further Reading The book by Chokroverty contains an excellent summary of sleep medicine. The manuscripts by Swihart et al. provide introductions to the display and analysis of sleep transitions. The article by Young et al. gives an overview of sleep disordered breathing. Chokroverty, S. (1999) Sleep Disorders Medicine: Basic Science, Technical Considerations, and Clinical Aspects. ButterworthHeinemann: Boston, MA. Quan, S.; Howard, B.; Iber, C.; Kiley, J.; Nieto, F.; O’Connor, G.; Rapoport, D.; Redline, S.; Robbins, J.; Samet, J.; et al. (1997) “The Sleep Heart Health Study: Design, Rationale, and Methods.” Sleep, 20(12):107785. Swihart, B.; Caffo, B.; Bandeen-Roche, K.; and Punjabi, N. (2008) “Quantitative Characterization of Sleep Architecture Using Multi-State and Log-Linear Models.” Journal of Clinical Sleep Medicine, 4(4):349–355. Swihart, B.; Caffo, B.; Strand, M.; and Punjabi, N. (2007) “Novel Methods in the Visualization of Transitional Phenomena.” Johns Hopkins University, Dept. of Biostatistics Working Papers. Young, T.; Peppard, P.; and Gottlieb, D. (2002) “Epidemiology of Obstructive Sleep Apnea. A Population Health Perspective.” American Journal of Respiratory and Critical Care Medicine, 165(9):1217–1239. CHANCE
15
Statistical Modeling of
16
VOL. 22, NO. 1, 2009
SLEEP James E. Slaven, Michael E. Andrew, Anna Mnatsakanova, John M. Violanti, Cecil M. Burchfiel, and Bryan J. Vila
B
etween 50 million and 70 million Americans have some form of chronic sleep disorder that makes them more vulnerable to accidents and disease, degrading their quality of life. It has been shown that quality of sleep affects a person’s health and sense of well-being. Poor sleep can have a negative effect on mental and physical characteristics, as well as on social interaction. Hence, the impact of sleep disorders and lack of sleep can be detrimental to job performance. Police officers are especially likely to suffer from sleep disorders, as they get too little sleep because of work schedules that often involve night-shift work, rotating shifts, and overtime, as well as part-time secondary jobs. This can put the officers and the communities they serve and protect at great risk. In a groundbreaking study, the State University of New York at Buffalo (SUNYAB) and the National Institute for Occupational Safety and Health (NIOSH), with additional funding from the National Institute of Justice (NIJ), are studying the effect of stress on the health of police officers, with the quality and quantity of sleep being part of a comprehensive investigation. Approximately 500 Buffalo, New York, municipal police officers are taking part in the study, which was approved by human subject review boards at both NIOSH and SUNYAB. For the sleep portion of the study, participants wear Motionlogger Actigraphs for 15 days, with the actigraph being removed only to protect from water damage. The actigraphs record
data every minute during those 15 days (Figure 1), giving an extremely large data set to analyze for each participant. According to the American Academy of Sleep Medicine, one of the best tools for studying sleep outside the lab is the wrist actigraph. Actigraphy is the method of using accelerometers—instruments that detect when a person moves— to determine sleep/wake patterns. The actigraph records information about movement and then uses predetermined algorithms to determine if the wearer was awake or asleep at any given time during the day. Actigraphy corresponds well with polysomnography—which must be done in a sleep lab— with the added benefit that the method can record information outside a sleep lab for several days in a row, allowing for less invasive data acquisition. Actigraphs have been used in many sleep studies, including research on physical activity, obesity, and disease associations. Unfortunately, actigraph data require painstaking analyses due to the amount of information collected during constant data recording and the number of possible derived parameters. Actigraphy analyses can range from simple sleep statistics (e.g., total sleep time and sleep efficiency) to more advanced statistical techniques, including structural equation modeling (to determine how groups of sleep variables cluster) and waveform analysis (to find peaks of activity). We have developed several statistical methods to help analyze these large data sets. CHANCE
17
25,000 20,000 15,000 10,000 5,000
PIM 400 300 200 100
11/05
11/07
11/09
11/11
11/13
11/05/04
Figure 1. First several days of an actigraphy reading. The actigraph’s PIM channel (top) measures the magnitude of acceleration. The life channel (bottom) records micro-vibrations, such as those caused by the wearer’s respiration, for quality control.
25,000 20,000 15,000 10,000 5,000
PIM
500 400 300 200 100 02/11
02/13
02/15
02/17
02/19
02/11/05
Figure 2. Poor-quality data caused by data corruption or noncompliance. Note how the total, or near total, lack of signal on the life channel during some periods differs from a normal actigraph reading in Figure 1.
18
VOL. 22, NO. 1, 2009
Statistical Methods
Good Sleep Data 250
AA =
n −1 χ + χ 1 i +1 i ∑ i = 1 n1 2
and the average distance between consecutive time points: AD =
n −1 χ − χ 1 i +1 i ∑ i = 1 n1 2
of the data to detect whether the overall readings are high enough to provide statistically useful information. The K-statistic is then given as:
K=
AA2 + AD 2 ,
where x and y represent the running mean and running variance, in any order. If too large a portion of the data is zero or truncated, or if the overall readings vary enough at small intervals, then data quality may be too poor for use in data analysis. Either of these types of data corruption can produce long periods with the same reading, something that is not possible, even when a subject is at rest because of basic bodily movements caused by respiration.
200 Number of Awakenings
K-statistic The first method developed to improve the efficiency of actigraphy analysis is a test of the quality of the actigraph data. Data quality can be poor for two reasons: data corruption and participant noncompliance. Data can become corrupted due to a malfunction of the actigraph or during transfer into a computer, giving large portions of time with no data or data that have been truncated. Noncompliance most often results from participants failing to wear the actigraph when they are supposed to. This results in zero readings over many time periods. Both of these issues present problems for data analysis, as simple statistics can be either completely incorrect, as in the case of truncated data, or biased, from noncompliance (Figure 2). We have developed a statistic (the K-statistic) to determine whether the quality of each officer’s actigraph data is good enough to use in the analyses. This method looks at the average amplitude of consecutive time points:
150 100 50 0 0
20
40 60 80 Time in Sleep (minutes)
100
120
Figure 3. Waiting time distributions of several participants
While specific sleep disorders cannot be determined with actigraphy, the movement and resultant poor sleep can be uncovered with the device. This means that analysis of nonlinear dynamics can be used to differentiate between participants’ sleep quality (good versus poor) and determine the extent of an individual’s poor-quality sleep. Waiting Time Distributions of Sleep One of the basic parameters used by medical professionals to evaluate sleep is the wake-to-sleep ratio. As this value is averaged across the entire night of sleep, it does not give an accurate description of the total time a participant spends waiting to fall sleep. We have developed a method for characterizing a participant’s total distribution of waiting times. Rather than using an overall average, we analyzed actigraph data to calculate the length of time from sleep to wakefulness for every awakening. This gave a better picture of what was happening during sleep, from the number of awakenings per night to the total number of minutes of sleep before each awakening.
Nonlinear Dynamics and Dimensionality Nonlinear dynamics play a part in many biological functions, such as heart rhythms and brainwaves. We used nonlinear analysis to compare participants who exhibit good sleep patterns with those who exhibit poor sleep patterns, according to basic sleep parameters provided by Maurice Ohayon and his colleagues in their Sleep article, “Meta-Analysis of Quantitative Sleep Parameters from Childhood to Old Age in Healthy Individuals: Developing Normative Sleep Values Across the Human Lifespan.” The actigraphy data from these two groups have significantly different fractal dimensions. Participants with poor sleep patterns exhibit much more movement during sleep than those with good sleep patterns. This movement can be caused by sleep disorders—such as insomnia, apnea, and restless leg syndrome—or by other sources of discomfort. CHANCE
19
Differences in waiting time distributions were shown to be significant between participants with good sleep patterns and those with poor sleep patterns. These distributions were also quite skewed (Figure 3), indicating that the use of the mean waiting time may be unnecessarily inaccurate. The median would be a better parameter to use when the distribution is asymmetric, and our analyses indicate the main differences between participants with good-quality sleep and those with poor-quality sleep occur in the upper end of the distributions, at the 75th and 90th percentiles.
First Canonical Coefficient
4 2 0 -2 -4 -6 -8
Advanced Sleep Variables -6
-4
-2 0 2 Second Canonical Coefficient
4
Figure 4. Clustering of study participants for good and poor sleep as determined by additional sleep variables, using canonical correlation analysis (N=228)
Factors
Variables
While the basic sleep variables (total sleep time, sleep efficiency, and sleep-to-wake onset) are typically used in classifying a study participant as having good or poor sleep patterns, the use of advanced variables may be of value. Actigraphy provides the opportunity to derive a large number of statistics from sleep/wake data, including sinusoid messor, sinusoid amplitude, maximum daily autocorrelation, time off 24-hour rhythm, number of awakenings during sleep, activity during sleep, sleep ratio, and wake-within-sleep percent. These variables offer additional mathematical and statistical information on sleep quality. We have performed cluster and discriminate analysis to show that these additional variables can perform exceptionally well in differentiating between good and poor-quality sleep. These additional variables had a low classification error rate of approximately 10% between the two sleep qualities (Figure 4). As can be seen from Figure 4, the first correlation coefficient is sufficient in differentiating the sleep qualities, with more than 90% of the variance explained. The first two coefficients describe nearly 99% of the variation in the data. Although these additional variables may lengthen the time required to analyze sleep data, they can add considerable information. Waveform/cosinor analysis gives parameters that allow a sinusoid to be fit to the data, which has been used extensively as a mathematical model for sleep. Autocorrelation coefficients give information about how the quality of a participant’s sleep pattern varies across days, as well as on the participant’s circadian rhythm. Average awakenings, activity during sleep, and length of wakefulness give additional information about in-bed activity during sleep. The use of these additional sleep variables will give research investigators more tools to test for differences in sleep quality between groups and to better characterize and parameterize sleep. Structural Equation Modeling
Figure 5. Structural equation model with standardized path (regression) coefficients between the factor solution and sleep variables and the correlations between factors
20
VOL. 22, NO. 1, 2009
Due to the many possible variables that can be derived from actigraphy, it may be necessary to reduce the variable set to a smaller subset, which still provides unique and meaningful information. Structural equation modeling (SEM) can be used as a dimension-reducing analytic procedure where a large number of observed variables are reduced into smaller sets containing highly correlated variables that describe the same underlying construct. Structural equation models consist of two parts: a measurement model and a structural model. The measurement model describes the relationship between measured and latent variables (the subsets to be discovered). The structural model deals with the relationships between those latent variables.
SEM is also useful in that it gives regression coefficients and correlation values between factors. Our initial variable set was total sleep time, sleep-to-wake onset (the average time it took to awaken after falling asleep), wake-within-sleep percent, mean activity during sleep (as measured in volts by the accelerometer), sleep efficiency (the proportion of time spent in actual sleep while in bed), sinusoidal messor (the wavelength of a fitted sinusoid from cosinor analysis), sinusoidal amplitude (the height of a fitted sinusoid from cosinor analysis), daily autocorrelation (correlation coefficient derived from each participant’s sleep/wake cycle), and the time off of a 24-hour sleep/wake rhythm. After initial analysis, the last two variables in the set were found to not be statistically significant. Without them, our final model had excellent fit statistics, which enabled us to group the variables into corresponding latent factors. The variables total sleep time and sleep-to-wake onset grouped together into a latent factor we called “sleep time.” Wake-within-sleep percent, mean activity during sleep, and sleep efficiency grouped together into a factor we named “during sleep activity.” The sinusoidal messor and sinusoidal amplitude parameters grouped into a factor we called “circadian rhythm” (Figure 5). SEM is an excellent method for reducing the number of variables needed to represent sleep/wake patterns. It can help researchers choose the proper variables to analyze, depending on the hypothesis being tested, in order to make their data analysis more efficient.
Discussion As a consequence of this work, we expect to be able to analyze the actigraphy data of the BCOPS study with more accuracy and with more tools to find differences not only in sleep quality, but also in the corresponding disease association analyses. Using the K-statistic makes determining the quality of data sets easier and faster, as opposed to manually viewing each data set. SEM is capable of reducing the set of possible variables by grouping them and allowing investigators to select the ones best suited for that particular study. Nonlinear analysis, waiting time distributions, and the ability to use many nonstandard variables give research investigators more tools to identify differences in participants’ sleep quality and in study popula-
tions. They also give researchers the ability to conduct more in-depth statistical and mathematical analysis. Ultimately, our work should help make wrist actigraphy more accurate and less expensive for research investigators and physicians who study and treat the millions of workers around the United States who suffer from sleep disorders. The findings and conclusions in this report are those of the authors and do not necessarily represent the views of the National Institute for Occupational Safety and Health.
Further Reading The National Academies. (2006) “U.S. Lacks Adequate Capacity to Treat People with Sleep Disorders; Efforts Needed to Boost Sleep Research, Treatment, and Public Awareness.” The National Academies: Washington, DC. http://www8. nationalacademies.org/onpinews/newsitem.aspx?RecordID=11617. Ohayon, M.; Carskadon, M.; Guilleminault, C.; Vitiellow, M. (2004) “Meta-Analysis of Quantitative Sleep Parameters from Childhood to Old Age in Healthy Individuals: Developing Normative Sleep Values Across the Human Lifespan.” Sleep, 27:1255–1273. Slaven, J.E.; Andrew, M.E.; Violanti, J.M.; Burchfiel, C.M.; Vila, B.J. (2006) “A Statistical Test to Determine Quality of Accelerometer Data.” Physiol Meas, 27:413–423. Slaven, J.E.; Andrew, M.E.; Violanti, J.M.; Burchfiel, C.M.; Mnatsakanova, A.; Vila, B.J. (2008) “Dimensional Analysis of Actigraph Derived Sleep Data.” Nonlinear Dynamic, Psychology, Life Sciences, 12(2):153–161. Slaven, J.E.; Mnatsakanova, A.; Burchfiel, C.M.; Li, S.; Violanti, J.M.; Vila, B.J.; Andrew, M.E. (2008) “Waiting Time Distributions of Actigraphy Data.” The Open Sleep Journal, 1:1–5. Slaven, J.E.; Andrew, M.E.; Violanti, J.M.; Burchfiel, C.M.; Vila, B.J. (2008) “Factor Analysis and Structural Equation Modeling of Actigraphy Derived Sleep Variables.” The Open Sleep Journal, 1:6–10. Vila, B.J. (2006) “Impact of Long Work Hours on Police Officers and the Communities They Serve.” Am J Ind Med, 49:972–80.
CHANCE
21
Disability: It’s Complicated Michele Connolly
were spent on disability programs, just for the working-age population. During that time, it was estimated that states spent an additional $50 billion for joint federal-state programs. Disability data must be collected to address program and policy issues. But disability measures are challenging.
What Is Disability?
I
t can happen in an instant. A soldier in Iraq loses a leg from a roadside bomb. A baby is born with Down syndrome. An elderly woman loses the ability to speak due to a stroke. Or it can happen more insidiously. A diabetic with nerve damage can gradually become blind. A man with Alzheimer’s disease may lose the ability to care for himself. All these individuals have a disability. Disabilities vary dramatically and can affect people at any time and at any age. An individual’s disability also may change over time. Someone with multiple sclerosis may have a deteriorating condition, but may regain function with rehabilitation. Disability cannot be directly measured. There is no blood test, medical procedure, or functional test that absolutely measures disability. Disability is a subjective construct used to measure the impact of very real, but disparate, events. Just as the status of an individual changes over time, overall disability levels and trends change as new interventions are discovered, new types of disability emerge, existing types of disability change, and the population ages. The changes are reflected in disability programs, policy, and the definitions we use. There are dozens of federal disability programs, each of which has its own unique purpose. In 2002, according to Nanette Goodman and David Stapleton in their article, “Federal Program Expenditures for Working-Age People with Disabilities” that was published in the Journal of Disability Policy Studies, 11.3% of all federal outlays amounting to $226 billion 22
VOL. 22, NO. 1, 2009
In general, disability is defined as a limitation or inability to perform usual societal roles due to a medical condition or impairment. Societal roles include growing, developing, and learning for people under age 18, working for working-age adults (ages 18–64), and living independently for the elderly (ages 65 and older). In addition, usual activities include recreation and interaction with family, friends, and neighbors. Usual activities vary by individual circumstances and age. For example, full-time college students are in their working years (18–64), but may not be ready to join the work force until graduation. The term “elderly” is often described as age 65 and older, yet many people retire later. In addition, growing, developing, learning, working, and living independently are all general terms open to interpretation. Not all people with medical conditions have a disability. According to 2007 estimates by the Centers for Disease Control and Prevention, 10.7% of the population aged 20 and older (23.5 million Americans) had diagnosed or undiagnosed diabetes. Yet, in December of 2002, about 237,000 (just 4.0%) of disabled workers received Social Security Disability Insurance (SSDI) program benefits as the result of “endocrine, nutritional, and metabolic disorders,” a category in which diabetes is the major (but not the only) condition. Disability can be permanent or temporary. Those who use crutches for a broken leg or who are recovering from knee replacement surgery may not be considered to have a disability by most federal programs because their condition is temporary. However, for purposes of accessibility mandated by the Americans with Disabilities Act (ADA), they are considered to have a disability because they need ramps or curb cuts to get around during recovery. Periodicity also can complicate disability definition. Disability can be ever present (e.g., blindness), episodic (e.g., cancer and mental illness), or somewhere in between (e.g., people with arthritis who have good and bad days). Disabling conditions can often be successfully treated and corrected. For example, congenital heart defects in infants can be resolved by surgical intervention. Disabilities also can disappear as situations (and thus definitions) change. Schoolage children with dyslexia may be regarded as having a learning disability, which may entitle them to special education (a disability program). But after they leave school, they may not continue to be considered as having a disability. Severity is also a consideration. Some measures contain implicit severity indicators. For example, a person may be
asked if she has difficulty climbing a flight of stairs. If she answers yes, she is asked if she is able to climb stairs at all. In this case, there are three levels of severity: able to climb stairs, limited in the ability to climb stairs, or unable to climb stairs at all. Some tasks are so basic that the inability to perform them is considered more severe. A person who reports he is limited in performing one of the activities of daily living, such as going to the bathroom, may be considered to have a more severe disability than someone who is limited in climbing a flight of stairs. Disabilities can be “visible,” as for people who use wheelchairs, scooters, or seeing-eye dogs, or “invisible,” as for those with mental illness or limited physical endurance. Technological advances, such as prostheses, can give functioning back to a person who has lost a limb due to cancer, combat, or an accident. Whether these individuals have a disability depends on the situation. No description of disability is complete without addressing ability. We all have many abilities—even if we have a disability. Abilities can be task-specific, such as lifting a bag of groceries, or more general, such as the ability to work. Many people with disabilities are able to work. President Franklin D. Roosevelt was confined to a wheelchair due to polio. One of the world’s most brilliant theoretical physicists is Stephen Hawking, a man who continues to publish and lecture while almost completely
paralyzed from amyotrophic lateral sclerosis (ALS, also known as Lou Gehrig’s disease).
Federal Disability Definitions and Programs Altogether, the federal government employs a staggering 67 definitions of disability, which can be pared down to 41 after overlaps are accounted for. Besides federal definitions, hundreds more disability definitions exist for state, local, and tribal governments. As most are either derived from or similar to federal definitions, our focus is on federal programs. Federal disability definitions are not written in stone, as federal definitions (and regulations) are rooted in congressional legislation and executive branch rules and regulations. Changes to federal disability definitions result from new legislation, regulations, reauthorizations, and court decisions. These and other definition changes affect estimates of disability prevalence rates and trends (see History of Federal Disability Programs). Most recently—on September 25, 2008—President George W. Bush signed the ADA Amendments Act (ADA-AA), which broadened and clarified the interpretation of the definition of disability that had become narrowed due to court decisions. The many and varied federal disability programs suggest it is highly unlikely that there will ever be a single federal
History of Federal Disability Programs Disability programs are as old as this country. The first disability program (and definition) was enacted by the Continental Congress on August 26, 1776, to provide compensation for “every officer, soldier, or sailor losing a limb in any engagement or being so disabled in the service of the United States so as to render him incapable of earning a livelihood.” Federal disability programs in the United States are historically rooted in veterans’ disability programs, which dominated federal disability for most of our history. As times changed, so did veterans’ disability programs and their impact on society, including the formation of disability programs for the nonveteran population. During the Civil War, an estimated 360,222 soldiers died on the Union side and 281,881 were wounded, many of whom were amputees. The best estimate of Confederate dead is 258,000 (Drew Gilpin Faust in The Republic of Suffering). No figures are available for the number of wounded Confederate soldiers. The huge number of disabled soldiers (called invalids) and dependent widows and orphans called for an extensive Civil War pension system. The Civil War pension definition of disability, similar to that of the American Revolution, was dependent on the ability to work. Disability pensions were given to “… any person who served in the military or naval service, received an honorable discharge, and who was wounded in battle or in the line of duty and is not unfit for manual labor by reason thereof, or who from disease or other causes incurred in the line of duty.” Pension benefits for the massive number of surviving dependents (i.e., widows) represented the first
large-scale social program in this country. This may have served as the precedent for dependent coverage in today’s Social Security and other programs. Approximately 204,000 wounded veterans came home from World War I. Veterans’ disability compensation was expanded and modernized to “… establish courses for rehabilitation and vocational training for veterans with dismemberment, sight, hearing, and other permanent disabilities.” The focus shifted from providing disability benefits to those who were incapable of work to providing services to help veterans with disabilities return to work. This policy shift reached over into the civilian population, when, by 1920, the Basic Vocational Rehabilitation Services program was established to help people with disabilities (not just veterans) attain gainful employment. The 1944 Servicemen’s Readjustment Act, known as the GI Bill, was enacted to provide returning veterans from World War II (including those with disabilities) a college or vocational education. Benefits included educational costs, a stipend, one year of unemployment compensation, and home and business loans. A striking social change occurred after World War II at the University of Illinois at Urbana-Champaign, where returning veterans with disabilities successfully obtained a college education. Some 30 years before the enactment of the Individuals with Disabilities Education Act and about 50 years before the ADA, these veterans showed the importance of architectural changes (accommodations) and personal assistance. CHANCE
23
Functional Disability Measures in the NHIS-D Functional disability measures were the most complex, as many body systems are involved, but they are also the most widely accepted and often the most useful for policy and program purposes. Functional measures included the following: • • •
•
• • •
Limitations in or the inability to perform a litany of physical activities (e.g., walking, lifting 10 pounds, reaching) Serious sensory impairments (e.g., inability to read newsprint, even with glasses or contact lenses; hearing and speaking impairments) Mental impairments (e.g., frequent depression or anxiety, frequent confusion, disorientation, or difficulty remembering) that seriously interfered with life during the past year Long-term care needs (e.g., needing the help of another person or special equipment for basic activities of daily living (bathing, dressing, going to the bathroom) and instrumental activities of daily living (going outside, managing money and/or medication)) Use of selected assistive devices (e.g., scooters, wheelchairs, Braille) Developmental delays for children identified by a physician (e.g., physical, learning) Inability to perform age-appropriate activities for children under age 5 (e.g., sitting up, walking by age 3).
Questions about mental impairments were difficult to develop due to the stigma of mental illness. The NHISD based the question series on an earlier supplement on mental illness in conjunction with the cognitive questionnaire lab. It was found that four approaches needed to be used: symptoms (e.g., frequent depression, anxiety, confusion, disorientation, difficulty remembering, getting along with others), a diagnosis, use of prescription psychotropic drugs, and use of community mental health services. For example, some respondents would report a diagnosis of schizophrenia, but not report any symptoms, prescription drugs, or use of services. Other individuals would report use of psychotropic drugs for schizophrenia, but not report any symptoms, diagnosis, or use of services. The final question designed to determine disability was whether a person reported that his or her mental illness seriously interfered with his or her life during the past year. A major flaw in the mental disability measures was the lack of a question on psychosis. One was proposed: “… Do you see things other people don’t see or hear things other people don’t hear?” The question did not work. Non-psychotic respondents answered yes in the cognitive lab, explaining that they were color-blind, had better than 20/20 vision, or had excellent hearing. 24
VOL. 22, NO. 1, 2009
definition of disability. Two federal disability programs illustrate the complexity we face in defining disability by specific criteria—IDEA, the Individuals with Disabilities Education Act, and the Social Security Administration’s SSDI program. Besides disability, both programs employ a number of other factors in their eligibility criteria.
Measuring Disability Through Surveys National population-based surveys are the best source of overall disability rates, profiles, and trends. Surveys collect data on a rich variety of other sociodemographic and economic characteristics so it is possible to understand the lives of people with disabilities. These data can be used to understand policy issues that cannot be examined using the often limited information in administrative program records. For example, while the Social Security Administration has data on those who receive SSDI benefits, the agency does not collect data on who might be eligible. It is challenging to replicate disability definitions and legislative eligibility criteria in surveys, but not impossible. Perhaps the hardest part of disability survey measurement is translating federal legislative definitions into plain English survey questions that can be understood by respondents. It can be done, however, by careful work in the cognitive questionnaire labs, pretests, and statistical analyses. Statisticians also must address other critical survey issues, such as the effects of where the questions fit into the overall survey (context), mode (telephone, mail, or personal interview), and self versus proxy response. Data sets from the large surveys discussed here are made available to researchers as public use data (stripped of personal identifiers) after confidentiality and privacy concerns are met. Several major national population-based surveys contain items on disability. These surveys include the National Health Interview Survey (NHIS), conducted by the National Center for Health Statistics, the Surveys of Income and Program Participation (SIPP), and the American Community Survey (ACS), conducted by the U.S. Census Bureau and the replacement to the long form of the decennial census. The purpose of the NHIS is to monitor the nation’s health. The NHIS—which covers the civilian, noninstitutionalized population—was established in 1956 and is the world’s longestrunning health survey. Altogether, approximately 35,000 households containing about 87,500 individuals are sampled each year. Interviews are conducted in person. The major disability items on the NHIS ascertain limitation of activity (e.g., working and going to school). Besides basic health information, special supplements are collected on areas of public health concern, such as cancer screening, smoking, and mental health. The SIPP, sponsored by the U.S. Census Bureau, examines income, labor force data, assets, and participation in and eligibility for dozens of federal programs. The SIPP, which started in 1983, has been a rich source of sociodemographic data, including disability, but now is in the process of being redesigned. The SIPP was designed as an overlapping set of longitudinal panels, each generally lasting about three years. SIPP panels ranged from 14,000 to 36,700 households representing the civilian, noninstitutionalized population. Respondents were interviewed largely through in-person interviews every four months on a basic set of questions (income and
participation in the labor force and federal programs) and on topical modules on various subjects. The disability module was asked at two points one year apart. Unfortunately, it appears that under the SIPP redesign, the disability questions will be reduced and that disability will only be obtained at one point in time. Other features, such as panel length, may be changed. The ACS, which began in 1996, is designed to provide data every year that was previously collected by the long form of the decennial census every 10 years. The ACS (and previously the decennial census) is the only source of sociodemographic and economic data at the state and local levels. More than 3 million households participate in the ACS by mail, with follow-up personal interviews if necessary. ACS estimates can be obtained by states, counties, cities, metropolitan areas, and population groups of 65,000 or more. Starting in December 2008, the U.S. Census Bureau will release three-year estimates for population groups of 20,000 or more. At first, the ACS used the 2000 census questions, but much statistical and methodological testing has been done. The latest version—now being collected in the 2008 ACS and scheduled to be collected in the 2009 Annual Social and Economic Supplement to the CPS—contains six separate items on hearing impairments, visual impairments, mental impairments, physical impairments, activities of daily living (self-care), and instrumental activities of daily living. In the ACS, the item on work disability has been dropped for methodological concerns. A list of the current ACS questions is contained in Disability Items from the 2008 American Community Survey.
Prevalence of Disability Disability is widespread, but the exact number of Americans with disabilities depends on the measure or definition used. In 2006, according to the ACS, nearly 41.3 million Americans (standard error 16,000), or roughly more than one in seven aged 5 or older, reported a disability. During 2006, estimates from the NHIS indicated that 35.8 million people of all ages reported a limitation in their usual activity due to a chronic health condition. This comes to 12.2% (standard error of 0.2%), or about one in six Americans. It is important to note that while disability prevalence increases with age, most people with disabilities are not elderly. The ACS reports that 65% of those with a disability are under age 65; the NHIS estimates that figure to be 67%. A fair amount of variation is expected between results from the ACS and NHIS. Disability definitions differed, even though a great deal of overlap exists in the concepts— if not the specific questions—and each had a different design and data collection mode. The ACS focused on broad categories of disabilities (i.e., physical, mental, sensory, self care, going outside, and the ability to work). The NHIS measures disability as limitations in the ability to carry out usual activities by age group (i.e., going to school, play, work, self care). Disability estimates among working-age adults from the 1994 National Health Interview Survey Supplement on Disability (NHIS-D) yielded four figures, depending on which broad measure was used. These measures, based on about 100 questions, were functional, work disability, perceived as having a disability as defined in the ADA, and receipt of
Social Security Disability Insurance (SSDI) The SSDI program is the largest disability program in the world and the primary program for working-age adults in this country. SSDI and Social Security retirement are funded by the Federal Insurance Contributions Act, or FICA, payroll tax, paid by employers and employees. The SSDI definition of disability is “the inability to engage in any substantial gainful activity (SGA) by reason of any medically determinable physical or mental impairment which can be expected to result in death or to last for a continuous period of not less than 12 months.” The SSDI definition of disability is “all or nothing.” There is no partial disability, as under Workers’ Compensation, nor are there degrees of disability, as for veterans’ programs. Eligibility is achieved through a five-step sequential process based on coverage under Social Security, disability, employment, education, age, and vocational factors. Those who are denied SSDI have the right to appeal the decision under certain circumstances. After 24 months of SSDI benefits, Medicare is extended to SSDI beneficiaries, even though they are under age 65. The Social Security system pays benefits to two additional categories of people with disabilities: disabled widow(er)s aged 50 to 59 and adults aged 18 or older who were disabled in childhood (ADC) and who have
at least one parent who receives (or received, if deceased) Social Security retirement or disability benefits. The average age for disabled workers in December of 2005 was 51.9 years for men and 51.7 years for women. Early retirement under Social Security can be obtained starting at age 62. In 2002, the leading causes of disability for disabled workers were mental disorders other than mental retardation (e.g., schizophrenia, severe depression) at 28.1%, musculoskeletal system and connective tissue disorders (e.g., bad back) at 23.9%, diseases of the circulatory system at 10.1%, diseases of the nervous system and sense organs (e.g., multiple sclerosis, traumatic brain injury, epilepsy) at 9.6%, and mental retardation at 5.2%. Disabled widow(er)s had a similar pattern of leading disability causes. However, among ADC, the leading causes were mental retardation at 43.6%, other mental disorders besides retardation at 13.0%, and diseases of the nervous and sense organs at 8.6%. In December of 2007, slightly more than 7.1 million disabled workers received a monthly average of $1,004 in SSDI benefits. As of December of 2007, nearly 225,000 disabled widow(er)s received benefits, and slightly fewer than 795,000 received benefits as ADC. Altogether, more than 8.1 million workers, survivors, or dependents received Social Security benefits on the basis of their own disability. CHANCE
25
disability program benefits. About 25.7 million working-age adults reported a functional disability (e.g., climbing stairs, seeing); 16.9 million reported a limitation or inability in work; 11.1 million reported that they perceived themselves or others perceived them as having a disability; and 9.1 million reported receiving disability program benefits from SSDI, Supplemental Security Income (SSI), and/or the Veterans’ Administration (VA) programs. Discussion of these broad measures is presented in Functional Measures in the NHIS-D.
Future Needs Disability measurement is constantly evolving as society changes and medical and rehabilitation advances are made. Perhaps the greatest challenge is posed by the large number of veterans returning with disabilities from the wars in Iraq and Afghanistan. As of December 18, 2008, the Department of Defense reported that 4,211 members of the military were killed in Iraq and 558 in Afghanistan. The ratio of troops who survive their wounds is the greatest of any American war. The number wounded was 30,879 in Iraq and 2,605 in Afghanistan as of December 18, 2008. We do know that, besides physical wounds, many returning veterans have either post traumatic stress disorder (PTSD) or traumatic brain injury (TBI)—two challenging disability measures.
As of December 31, 2007, 2.9 million veterans received benefits from the VA Disability Compensation program, and 7.8 million were enrolled in the VA Health Care System during fiscal year 2007. These numbers will grow as more veterans return from Iraq and Afghanistan. It is too early to see what changes will occur as a result of these wars to veterans’ disability programs, nonveterans’ disability programs, and society, but this is clearly an area where additional work is needed to improve measurement as a way to improve the treatment and support for this new group of disabled Americans.
Next Steps There are four major areas requiring attention so we can better examine our disability policy and programs to improve the lives of Americans with disabilities. Work Disability – The U.S. Census Bureau announced that the work disability item would be dropped from the ACS and the planned 2010 census due to methodological concerns. This is unfortunate. The ability to work is the central focus of most federal disability programs serving the working-age population and is specifically cited as an example of a major life activity in the ADA-AA. Previously, the work disability question has been considered and tested within the context of
Special Education: The Individuals with Disabilities Education Act (IDEA) The IDEA is divided into Part B, which serves children aged 3 through 21, and Part C, serving children under age 3. The purpose of the IDEA, enacted in 1975, is to make special education and related services available to children with disabilities so they can receive a free and appropriate public education to prepare them for employment and independence as adults. Under Part C, early intervention services are provided to prepare children for an education and eventual independence when they reach adulthood. Not all children with disabilities need special education services. In the fall of 2005, approximately 6.8 million children aged 3 through 21 received special education and related services under Part B of the IDEA. Disability in Part B is defined as having one of the following 13 conditions: mental retardation, hearing impairment (including deafness), speech or language impairments, visual impairments (including blindness), serious emotional disturbance, orthopedic impairments, specific learning disabilities, traumatic brain injury, multiple disabilities, deafblindness, autism, other disabilities (e.g., asthma, atten26
VOL. 22, NO. 1, 2009
tion deficit disorder), and developmental delay (for ages 3–9 at state discretion). Individual states establish criteria for each of the 13 categories. There is variation among states and localities in how Part B is defined. Children are evaluated by an educational team, specific to particular schools and type of disability. An individualized education plan (IEP) is typically prepared by a multidisciplinary team, tailored to the needs of each child, and periodically reviewed with the child’s parents. The IEP changes over time, as children mature and learn. Part C of IDEA, known as the Early Intervention Program for Infants and Toddlers with Disabilities, served nearly 300,000 children during the fall of 2006. Infants are under age 1, and toddlers are between the ages of 1 and 3. Part C is a federal grant program to states serving infants and toddlers with disabilities and their families. The purpose of Part C services is to enhance the development of infants and toddlers with disabilities, reduce the need for Part B, and help families meet the needs of their very young children with disabilities. Infants and toddlers served by Part C are defined as either having a developmental delay or a condition with a high probability in a developmental delay based on diagnostic medical measures. Developmental delays include cognitive, physical, communication, social or emotional, or adaptive functioning. Similar to Part B, eligibility for the disability categories is determined by states. Altogether, in the fall of 2006, approximately 2.4% of the population under age 3 was served by Part C. Slightly less than half (46%) of those served were under the age of 2.
the entire disability series. Yet, work disability is also an aspect of employment. Could the U.S. Census Bureau look at work disability as an employment item? This may be a better fit for methodological concerns. Second, disability is subjective and not easily verified through methodological work. It appears that participants in the cognitive questionnaire lab reported no limitations in work, even though they were collecting disability benefits. More program knowledge is needed. The SSDI and disability portion of the SSI programs allow and encourage employment and rehabilitative efforts for those receiving benefits (e.g., Ticket to Work). Mental Impairments – The emphasis, as described in the U.S. Census Bureau report “Evaluation Report Covering Disability” was that methodological work was geared toward the elderly in terms of cognitive impairments. Clearly, this focus needs to be expanded to included returning veterans with PTSD and TBI. Instrumental Activities of Daily Living – The IADL item was dropped from the ACS. IADLs—which include shopping, using the telephone, and managing money and/or medication—typically refer to activities that involve social interaction and more sophisticated self care. IADLs tend to require more mental and cognitive skills. While in the past, methodological research on IADLs has focused on the elderly, it is worth reexamining these items in light of returning veterans. ADA-AA – On January 1, 2009, the ADA-AA, which now includes specific examples of major life activities in the law, takes effect. Although many examples appear in the ACS and other surveys, many do not—the ability to work being the most critical. ADA-AA major life activities not typically included in surveys are manual tasks, eating, sleeping, standing, bending, speaking, breathing, learning, thinking, communicating, and working. Methodological work needs to be done to measure progress of the ADA-AA. Methodological Work – The U.S. Census Bureau is to be commended for its methodological work performed on the ACS and its coordination with other federal agencies. Because definitions of disability are constantly evolving, methodological work and analyses must continue to evolve. Space and time constraints on surveys are real concerns. For example, if one question identifies disability for 90% of a certain category and 10 questions identify 97%, we could analyze who is in the 7% and potentially drop nine questions. Even though the NHIS-D is old, the 100+ questions can be analyzed for overlaps and more efficient disability questions can be designed. Data from other surveys could be used, as well. Cooperation is required. No one federal program (or agency) is responsible for disability, and no single federal agency is responsible for disability statistics. Disability is too important to ignore. Creative work needs to be done by statisticians, federal agencies, academia, and advocacy groups.
Further Reading Clipsham, J.A. (2008) “Disability Access Symbols.” Graphic Artists Guild: www.gag.org/resources/ das.php. Adler, M.C. and Hendershot, G. (2000) “Federal Disability Surveys in the United States: Lessons and Challenges.” In ASA Proceedings, Section on Survey Research Methods, pp.
Disability Items from the 2008 American Community Survey The full questionnaire can be found at www.census.gov/acs/ www/Downloads/SQuest08.pdf. The disability questions begin with question 16 and are asked of each person listed at the address. If the respondent is 5 years or older, then question 17 is asked. Otherwise, question 17 is skipped for that respondent. Question 18 is asked of those aged 15 or older. The answer to each question is yes or no. 16. a. Is this person deaf or does he/she have serious difficulty hearing? 16. b. Is this person blind or does he/she have serious difficulty seeing, even when wearing glasses? 17. a. Because of a physical, mental, or emotional condition, does this person have serious difficulty concentrating, remembering, or making decisions? 17. b. Does this person have serious difficulty walking or climbing stairs? 17. c. Does this person have difficulty dressing or bathing? 18. Because of a physical, mental, or emotional condition, does this person have difficulty doing errands alone, such as visiting a doctor’s office or shopping?
98–104, Alexandria, VA: American Statistical Association. www.amstat.org/sections/SRMS/proceedings/papers/2000_014.pdf. Adler, M.C.; Clark, R.F.; DeMaio, T.J.; Miller, L.F.; Saluter, A. (1999) “Collecting Information in the 2000 Census: An Example of Interagency Cooperation.” Social Security Bulletin, 62(4). www.ssa.gov/policy/docs/ssb/v62n4/v62n4p21.pdf. Brault, M.; Stern, S.; Raglin, D. (2007) “Evaluation Report Covering Disability.” 2006 American Community Survey Content Test Report P.4. Washington, DC: U.S. Census Bureau. www.census.gov/acs/www/AdvMeth/content_test/P4_ Disability.pdf. Turek, J. (2008) “Committee on Statistics and Disability.” Amstat News, 370:10. www.amstat.org/publications/amsn/index. cfm?fuseaction=pres042008. General Accounting Office. (2008) “Federal Disability Programs: Coordination Could Facilitate Better Data Collection to Assess the Status of People with Disabilities.” Statement of Daniel Bertoni, Director of Education, Workforce, and Income Security; Testimony before the Subcommittee on Information Policy, Census, and National Archives, Committee on Oversight and Government Reform, House of Representatives, June 4. www.gao.gov/ new.items/d08872t.pdf. Goodman, N. and Stapleton, D. (2007) “Federal Program Expenditures for Working-Age People with Disabilities.” Journal of Disability Policy Studies, 18(2):66–78. National Health Interview Survey 1995 Supplement Booklet: Disability, Phase 1. www.cdc.gov/nchs/data/nhis/dis_ph1.pdf.
CHANCE
27
Jussi Jokinen, Regression to the Mean, and the Assessment of Exceptional Performance W. J. Hurley
Dallas Stars forward Jussi Jokinen, of Finland, works the puck in a hockey game against the Montreal Canadiens on December 23, 2007, in Dallas, Texas. (AP Photo/Matt Slocum)
L
ate in the second period of a game between the Dallas Stars and Edmonton Oilers during the 2005–2006 National Hockey League (NHL) season, Jussi Jokinen, a rookie with the Stars, was awarded a penalty shot. For such shots, the goaltender has a sizable advantage, as NHL players score on roughly one in three penalty shots. Jokinen scored and, in the process, continued a rather remarkable streak. He scored on his three penalty shots during the NHL preseason and on nine straight penalty shots during the regular season up to the Oiler game. So, all in, he was successful on his first 13 penalty shots, an unofficial NHL record not likely to be broken any time soon. Jokinen’s performance, and those of some lesser known players, led media commentators to assert 28
VOL. 22, NO. 1, 2009
that there were players with exceptional ability on penalty shots and that NHL teams were actively looking for these penalty shot specialists. Jokinen has since cooled off. Throughout the complete 2005–2006 regular season, he went 10 for 13, and over the 2006–2007 season, five of 12, a frequency more in line with the league-wide rate. Jokinen’s streak, his subsequent cooling off, and the media discussions at the time the streak was maturing raise some interesting questions. The first is an assessment of the role of chance in explaining Jokinen’s rookie season performance. Based on a standard order statistics argument, one would expect the highest relative scoring frequency (defined as the fraction of penalty shots that result in a goal) to be fairly high. The question is whether Jokinen’s frequency, 10 for 13, is sufficiently high to warrant the conclusion that something other than chance is part of the explanation. The second question is whether the media prognosticators were correct that taking a penalty shot is a special skill, that some otherwise gifted NHL scorers did not possess this skill, and that it was important for teams to identify their shootout specialists. For example, Joe Sakic, the Colorado Avalanche captain and future Hall of Famer, is a prolific scorer, but did not score a single goal in his seven penalty shots over the 2005–2006 season. There is significant experimental evidence that people are quick to find a pattern in a random sequence where there isn’t one, especially when the sequence is relatively short. Amos Tversky and Daniel Kahneman, in a 1971 article published in Psychological Bulletin, termed this bias the “law of small numbers.” In sum, does the evidence support the existence of a subclass of players with shootout ability superior to proven NHL stars? Finally, the shootout data set covers two NHL seasons. This affords an opportunity to study the phenomenon of regression to the mean. Were players who exhibited high scoring rates on penalty shots throughout the 2005–2006 season able to maintain those rates over the 2006–2007 season? Did the worst players over the 2005–2006 season improve over the 2006–2007 season? Given the increasing competitiveness of the NHL, teams must do well in shootouts. Every point counts. For instance, during the 2005–2006 season of the Eastern Conference, Tampa Bay finished in the last playoff spot with two more points than Toronto. During the season, Tampa Bay won six of
10 shootouts, whereas Toronto won only three of 10. Clearly, the Toronto shootout performance cost them a playoff spot.
Table 1—Shootout Performance of All NHL Players Over the 2005/06 and 2006/07 Seasons Attempts
Shootouts and Uncertainty At the start of the 2005–2006 season, the NHL introduced a shootout competition to determine which team gets an additional point (toward league standings) when there is a tie at the end of regulation time and five minutes of 4-on-4 overtime play. The shootout rules are specified in Rule 89 of the NHL Rulebook (see www.nhl.com/rules/rule89.html): Each team will be given three shots, unless the outcome is determined earlier in the shootout. After each team has taken three shots, if the score remains tied, the shootout will proceed to a “sudden death” format. No player may shoot twice until everyone who is eligible has shot. As in soccer, teams alternate taking penalty shots at the goalies until there is a winner. It is more difficult to score on a hockey penalty shot than on a soccer free kick. In Table 1, the shootout success percentages are calculated for the 2005–2006 and 2006–2007 NHL seasons. Based on all shots for both years, the frequency of scoring is 0.3311. It is interesting to note that scoring on an NHL goaltender on a penalty shot is comparable in difficulty to hitting a baseball thrown by a Major League Baseball pitcher, a task generally considered one of the toughest in sports. Most NHL teams tend to rely on the same three to four players for shootouts. It is only when a shootout goes beyond six shots (three for each team) that additional shooters are employed. For instance, throughout the 2005–2006 season, the Dallas Stars used Jokinen, Mike Modano, Sergei Zubov, and Antti Miettinen for 37 of the team’s 42 penalty shots. A sample of those who took the most shots should be useful for assessing how good the best shootout players are. Coaches select the players they think best at shootouts to take the shots. I decided to examine an “elite set,” those players who took at least five penalty shots in both the 2005–2006 and 2006–2007 seasons. There were 52 such players. Over both seasons, they took 859 shots and were successful on 339 for a relative scoring frequency of 0.395. The relative scoring frequency for nonelite set players was only 0.290. Hence, there is a substantial difference in the performance of these two groups. Another important factor is the chance that an NHL game gets to a shootout. In Table 2, the number of shootouts for the 2005–2006 and 2006–2007 seasons is presented. Throughout both seasons, the frequency of games going to a shootout is 0.1253.
Assessing the Jokinen Streak In 2005–2006, Jokinen’s rookie season, he finished fourth in rookie scoring with 55 points. Over the 2006–2007 season, his point production fell to 48, but his plus-minus rating was eight. The plus-minus statistic is important because it is a rough approximation of the player’s offensive and defensive capabilities. Jokinen’s plus-minus was the second-highest on the Dallas Stars team. To assess Jokinen’s shootout ability, it would be inappropriate to consider only his first 13 shots. Obviously, this sample would be highly selective and biased. Nonetheless, it is inter-
Goals
Ratio
2005/06
981
329
0.3354
2006/07
1,209
396
0.3275
Totals
2,190
725
0.3311
Table 2—Frequency of Shootout Games Over the 2005/06 and 2006/07 NHL Seasons Games
#Shootouts
Ratio
2005/06
1,230
145
0.1179
2006/07
1,230
163
0.1325
Totals
2,460
308
0.1253
esting to determine the chances of it happening. Under the assumption that Jokinen’s probability of scoring on a penalty shot was the average for the elite set, 0.395, and that these shots are independent, the chance of 13 consecutive goals is (0.395)13 5 0.000057,
(1)
a small chance indeed. One of the explanations for Jokinen’s early success was his uncanny ability to execute a shot known as “The Paralyzer.” This move requires the shooter to first fake to his forehand side, and in so doing, get the goalie to move that way. Thereafter, he quickly moves the puck back to his backhand and, with one hand on his stick and the puck as far away from his body as possible, guides it into the corner away from the direction of his and the goaltender’s movement. Interested readers can visit YouTube for a video of Jokinen executing the technique (see www.youtube.com/watch?v=QJbI-nITjIM). It is a shot Peter Forsberg made famous in the shootout in the 1994 Olympic Gold Medal game between Sweden and Canada. It proved to be the winning goal and has since been commemorated on a Swedish stamp. I am not sure how often Jokinen tried it during his streak, but it was a lot. Hence, the combination of his rookie status and the almost flawless execution of a very difficult shot gave him a temporary CHANCE
29
edge on goaltenders. But this edge was likely to be short-lived. Given the importance of the shootout to NHL regular season success, goaltenders and their coaches study shooters and gradually developed a ‘book’ on Jokinen, just as Major League hitters develop a book on opposing pitchers, particularly young pitchers. Alternatively, suppose we assess his shootout performance throughout his complete rookie season (excluding pre-season), when he was successful on 10 of 13 shots. If we were to consider this performance in isolation, we could compute a one-sided p-value under the hypothesis that Jokinen was able to score at the rate of an elite set player, 0.395. Therefore, the chance Jokinen would score 10 or more times on 13 shots would be a sum of binomial probabilities: 13
13
Σ i (0.395) (1 − 0.395)
13−i
i
= 0.007018
(2)
i =10
which is certainly small enough to reject the null that Jokinen scores with a 0.395 frequency in favor of an alternative that says he does better. One of the limitations of this calculation is that it ignores the uncertainty associated with the number of shots. To include the effects of this uncertainty, suppose elite set player i takes an uncertain number of shots, Si. As each team plays 82 games and a particular game goes to a shootout with probability 0.1253, Si follows a binomial random variable with parameters n = 82 and q = 0.1253 and density 82 S 82 − S g ( Si ) = q i (1 − q ) i . Si
(3)
Given a realization of Si, say si, player i will have a success frequency of at least 10/13 if he scores on at least
10 δi ( si ) = si 13
(4)
means to round the number x to the next highest integer. Hence, conditional on Si = si, a player will have a success frequency of at least 10/13 with probability s
si
Σ( ) j (0.395) (1 − 0.395) j
i
si − j
.
j =δi si
(5) Therefore, the chance that player i’s relative scoring frequency would be at least 10/13 is
υi = Σ r ( s j ) g ( s j ) . j
30
VOL. 22, NO. 1, 2009
Ym 5 max(X1, X2, … Xm).
(6)
(7)
To get its cumulative distribution function, we proceed in the usual way: Fm(y)
5 Pr(Ym # y) 5 Pr(max(X1, X2, … Xm) # y) 5 Pr(X1 # y, X2 # y, … Xm # y) 5 Pr(X1 # y) Pr(X2 # y) Pr(Xm # y) 5 [B(y)]m , (8)
where B(y) is the probability that a player has a relative scoring frequency no better than y. Note that B(10/13) 5 1 – 0.0066230 5 0.993377.
of these shots, where the notation [ χ]
r ( si ) =
The only difficulty in this calculation is where to start the summation in (6). It would not be appropriate to include the outcome where a player was, say, two for two on penalty shots. For this reason, I imposed the restriction that a player had to take at least eight shots. Under this assumption, yi = 0.006623, which is close to what I got above using (2). In this case, making allowances for the uncertainty in the number of shots has little effect on the p-value. But there is a more serious problem. We are picking the player with the highest penalty shot scoring frequency during the 2005–2006 season and applying a binomial sampling distribution for the average elite set player. Instead, we should examine his performance using the distribution of the relevant order statistic; in this case, the maximum. What we are interested in is the chance the player with the highest relative scoring frequency has a relative scoring frequency of at least 10/13. This can be calculated as follows. Suppose there are m players in our elite set. Let the relative scoring frequency for player i be Xi. We need to consider the statistic
(9)
Hence, the probability that the player with the highest relative scoring frequency would have one 10/13 or better is Pr(Ym $ 10/13) 5 1 – Fm (10/13) 5 1 – [B(10/13)]m.
(10)
Values of Pr(Ym $ 10/13) are shown below for three values of m: m
Pr(Ym $ 10/13)
50
0.282707
75
0.392501
100
0.485490
Assuming there were somewhere between 50 and 100 players taking at least eight shots each, there was a relatively good chance (approaching the flip of a coin) that we would observe a relative scoring frequency of at least 10/13. This is hardly evidence that Jokinen’s shootout performance during the 2005–2006 regular season was exceptional.
While the use of the maximum order statistic to assess Jokinen’s performance is a step in the right direction, it could be criticized because all players are assumed to have the same chance of scoring on a penalty shot. An obvious way to relax this assumption is with shrinkage estimators.
Shrinkage Estimators Charles Stein, in his 1955 article titled “Inadmissibility of the Usual Estimator for the Mean of a Multivariate Normal Distribution,” showed it is possible to improve upon maximum likelihood parameter estimates in terms of squared error loss when the parameters of several independent normal distributions are to be estimated. Bradley Efron and Carl Morris, in a 1975 paper published in the Journal of the American Statistical Association, applied this idea to estimating the batting averages of a subset of Major League Baseball players throughout the 1970 season based on averages over their first 45 at-bats. Their technique is not directly applicable to the assessment of relative scoring frequency on penalty shots for each player for a number of reasons, most notably because the sample data on penalty shots is not large enough to justify the normality assumption. Fortunately, Jim Albert has developed an empirical Bayes procedure for estimating binomial probabilities for any set of sample sizes, which he discusses in a 1984 paper titled “Empirical Bayes Estimation of a Set of Binomial Probabilities.” A rough outline of his procedure is as follows: Suppose we have a sample of binomial observations X1,X2, …,Xp with probabilities u1, u2, …, up and numbers of observations n1, n2, …, np. Let n 5 n1 + n2 + … + np
(11)
X 5 X1 + X2 + … + Xp.
(12)
and
Applied to our shootout data, ni is the number of shots taken by player i and ui is player i’s relative scoring frequency. Assuming heterogeneity of players, the maximum likelihood estimate of ui is θ iMLE =
Xi . ni
(13)
On the other hand, if all players were the same, we could estimate ui with X θI = . n (14) The empirical Bayes procedure, then, estimates ui with a linear combination of these two: θ iEB = (1 − λ )θ iMLE + λiθ I
(15)
where the estimate of li depends on the assumption about the prior distribution for u = (ui, u2, …, up). Albert employs a beta distribution, Beta(Kh,K(1 – h)), with a suitable joint distribution for the hyper-parameters K and h. Under his assumption,
λi =
K . ni + K
(16)
To apply this technique to estimate relative scoring frequencies for elite set players, I employed the method of moments to estimate K and h. (The method of moment estimators for a beta distribution can be found in the Engineering Statistics Handbook, www.itl.nist.gov/div898/handbook.) With these in hand, equation (15) was used to estimate the relative scoring frequencies for players in the elite set. The data set for the estimation consisted of the aggregated elite set player shootout performances over the 2005–2006 and 2006–2007 seasons. The results for a subset of these players are shown in Table 3. Note the effect of these shrinkage estimators: for the highest relative scoring rates, the estimated scoring rates are lower, and for the lowest rates, they are higher. How, then, could we use this information to assess Jokinen’s shootout performance over the 2005–2006 season? Suppose we take as given the empirical Bayes scoring frequencies in Table 3 and calculate the chance that the scoring frequency of the player with the highest scoring frequency would exceed 10/13. Under the assumption that these players take the same number of shots they did over the 2005–2006 season, I calculate this probability to be 0.842855. Under the assumption that all players take 13 shots (the same as Jokinen), the probability is 0.514710. Hence, this approach to assessing Jokinen’s performance also suggests that, while it was a very good performance, it was well within the bounds of normal statistical fluctuation.
Regression to the Mean This data set covering two NHL seasons offers a good opportunity to examine the concept of regression to the mean. Suppose all shooters in the elite set score with probability 0.395 on every penalty shot they take. With this assumption in mind and for a specific period of time, the actual scoring frequencies will vary about 0.395. Now consider what would happen in a subsequent period. We would expect that the performance of the best shooters in the first period would fall in the second. In fact, this is precisely what has happened. Table 4 compares the relative scoring frequency of the best 10 players in the CHANCE
31
Table 3—Empirical Bayes Estimates of Scoring Frequencies for Players in the Elite Set During the 2005/06 and 2006/07 Seasons Player
Name
Attempts
Goals
Rate
θ iEB
1
Kariya
18
12
0.6667
0.5525
2
Kozlov
18
12
0.6667
0.5525
3
Jokinen
24
14
0.5833
0.5170
4
Koivu
21
12
0.5714
0.5038
5
Kotalik
13
7
0.5385
0.4665
48
Hejduk
14
3
0.2143
0.3012
49
McDonald
16
3
0.1875
0.2804
50
Prucha
11
2
0.1818
0.2972
51
Boyes
12
2
0.1667
0.2853
52
Ponikarovsky
12
1
0.0833
0.2453
Table 4—The Shootout Performance of the Best 10 Shooters in 2005/06 During the 2005/06 and 2006/07 Seasons 2005/06 Season
2006/07 Season
Player
Goals
Shots
Relative Frequency
Goals
Shots
Relative Frequency
Sykora
5
6
0.833
1
5
0.200
Whitney
4
5
0.800
0
5
0.000
Jokinen
10
13
0.769
5
12
0.417
Frolov
3
4
0.750
4
9
0.444
Kozlov
5
7
0.714
7
11
0.636
Kariya
5
7
0.714
7
11
0.636
Williams
5
7
0.714
3
9
0.333
Satan
7
10
0.700
5
13
0.385
Kozlov
8
12
0.667
5
13
0.385
Richards
6
9
0.667
5
12
0.417
2005–2006 season with their performance in the 2006–2007 season. Note that the relative scoring frequency of all 10 fell over the 2006–2007 season. These players had a combined relative scoring frequency of 0.725 over the 2005–2006 season and 0.420 over the 2006–2007 season. For Jokinen in particular, his scoring frequency for the 2006–2007 season fell to five goals in 12 attempts, a considerable drop in his stellar rookie 32
VOL. 22, NO. 1, 2009
performance. For these 10 players, the correlation coefficient for their relative frequencies over the two seasons is –0.571, which, as expected, is negative and significant. Regression to the mean also should apply to the worst performers in the elite set for the 2005–2006 season. The bottom 10 players in the elite set, as measured by relative scoring frequency over the 2005–2006 season, scored 12 goals in 68
Table 5—A Comparison of the Shootout Performances of Players Grouped into Quartile Ranges by 5-on-5 Point Totals for the 2005/06 Season Shootout Performance Quartile
Average Points
#Shots
#Goals
Relative Frequency
Q1 (36 players)
83.3
305
108
0.3541
Q2 (36 players)
59.1
252
95
0.3770
Q3 (36 players)
43.9
225
82
0.3644
Q4 (35 players)
27.3
161
62
0.3851
attempts for a relative scoring frequency of 0.1765. This same group during the 2006–2007 season scored 35 goals in 84 attempts for a relative scoring frequency of 0.4167, a considerable improvement and comparable to the elite set average scoring frequency of 0.395.
The data suggest NHL players have comparable conversion skills but substantially different generation skills. This, of course, is the beauty of 5-on-5 play. Great goals require great team play and gritty determination, two characteristics not essential to success in a penalty shootout.
Who Are the Shootout Specialists?
Summary
Toward the end of Jokinen’s streak and the 2005–2006 NHL season, the hockey media, particularly the Canadian TV media, were suggesting the existence of a group of players who were better than average in shootouts and that, for specific teams, it was not necessarily the case that their best shootout players were their best 5-on-5 players. What does the evidence suggest? To this point, I have argued that the performances of the players with the best shootout percentages are consistent with normal statistical variation. Given the chance nature of the shootout, we would expect a subset of players to do very well, but the performances of these players at the top does not support the conclusion that they have above average ability on shootouts Moreover, for these players at the top, we observe the phenomenon of regression to the mean: The best players in one year are not the best the next year, and the worst in one year improve their performance the next. Here is another piece of supporting evidence. For the 2005–2006 season, I looked at all NHL players who took at least three shots in the penalty shootout. There were 143 such players. I then ranked these players according to their regular season points (goals plus assists) and put them into quartiles. Table 5 compares the quartile performances in the penalty shootout. In 5-on-5 play, there are clearly significant differences in the performance of the quartiles. The top quartile (36 players) had an average point count (goals + assists) almost three times that of the bottom quartile. But these shootout frequencies are statistically the same. All in, this evidence does not support the existence of a specialist shootout group. What it does suggest is that the game of hockey is a continuous flow game, where both sides compete to create and destroy scoring chances. We can think of the play leading to a goal in two parts. There is the play that leads to a scoring chance (opportunity generation) and, subsequently, the shooter converting the opportunity into a goal (opportunity conversion).
During his rookie NHL season, Jussi Jokinen had the highest relative scoring rate (10 of 13) in the NHL. The interesting question is whether this exceptional performance can be explained by chance. Based on an order statistic argument using shrinkage estimators to estimate scoring abilities, I found there is a high probability that the player with the highest relative scoring frequency would have one 10/13 or better. Hence, I conclude that Jokinen’s performance over the 2005–2006 season was consistent with normal statistical variation.
Further Reading Albert, J.H. (1984) “Empirical Bayes Estimation of a Set of Binomial Probabilities.” Journal of Statistical Computation and Simulation, 20:129–144. Efron, B. and Morris, C. (1975) “Data Analysis Using Stein’s Estimator and Its Generalizations.” Journal of the American Statistical Association, 70:311–319. Everson, P. (2007) “Stein’s Paradox Revisited,” CHANCE, 20:49–56. Engineering Statistics Handbook, www.itl.nist.gov/div898/ handbook. Gould, S.J. (1989) “The Streak of Streaks,” CHANCE 2:10– 16. Nevzorov, V.B. (2001) Records: Mathematical Theory, American Mathematical Society. Stein, C. (1955) “Inadmissibility of the Usual Estimator for the Mean of a Multivariate Normal Distribution,” Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1, Berkeley: University of California Press, 197–206. Tversky, A. and Kahneman, D. (1971) “Belief in the Law of Small Numbers,” Psychological Bulletin, 2, 105–110.
CHANCE
33
Go For It
What to consider when making fourth-down decisions M. Lawrence Clevenson and Jennifer Wright
will punt the ball (kick it) as far from their defensive goal as possible. Other times, a team will try for a field goal, which means a player kicks the football between posts at the end of the field, if the team is close enough to its offensive goal line, a possibility we do not address here. This means we assume the offense has the ball at a point on the field where a field goal attempt is not a realistic option. If the offense punts, goes for a field goal, or tries for the first down and fails on their fourth down, then the other team is the team on offense with the opportunity to advance the ball.
Fourth-Down Decisions We must examine three scenarios to make the decision to punt or “go for it.” Punting gives the other team the ball and a corresponding expectation of scoring from their new position on the field after the punt. Going for a first down and failing also gives the other team the ball and an even greater expectation of scoring, as it will be much closer to the goal line, the worst result of these three. Going for a first down and succeeding gives the offense a positive expectation of scoring, clearly the best of these three scenarios for the offense. Time may be a factor if near the end of the half or game. We assume the team on defense will have adequate time to try to score after the fourth-down play. In these cases, a statistician would argue that the decision should be made by computing expected points (both positive and negative) from the decisions. Therefore, we need statistical models for these expectations.
I
t is Sunday afternoon in the late fall and a football game has just started between the Green Bay Packers and the St. Louis Rams on the frozen tundra of Lambeau Field. St. Louis received the opening kick-off, moved the ball forward for some plays, but their third-down attempt to reach a first down failed and left them with fourth down and two yards to go on their own 42-yard line. Without much thought, the punting team comes onto the field and St. Louis punts. The decision to punt on fourth down is so common that hardly anyone thinks about it. But, wait. Should St. Louis have punted, or should they have gone for it? One way to decide would be to quantify expected points after both decisions. But what does one need to know to do that? For readers unfamiliar with American football, the two teams try to advance the ball toward the other’s (defensive) goal line. The team with the ball (offense) gets four chances to advance the ball. These chances are called “downs.” An important rule is that if the offense can advance the ball 10 yards or more before their four chances are over, they get a “new first down” and four more chances to advance the ball. Often, if a team has reached its fourth down (its last chance) and the players feel they cannot gain a new first down, they 34
VOL. 22, NO. 1, 2009
Expected Points from a First Down What is the expected number of points with a new first down from a given yard line? Let EP(x) represent a team’s expected points when that team has a first down with a given number of yards, x, from the opposing goal. We seek a model to estimate EP(x) for each team. There are 32 teams in the National Football League with various strengths in offensive and defensive play. Teams from Green Bay, San Francisco, St. Louis, Chicago, and Indianapolis in the 2005 season were chosen for this study. This selection reduces the required data collection effort while maintaining a variety of strengths and weaknesses among the teams used to build statistical models. A qualitative summary of the team strengths and weaknesses in 2005 appears in Table 1. All the plays from all the games played by these five teams in the 2005–2006 were collected (see www.nfl.com). When a team had a first down (first down with 10 yards to go or first and goal to go), a point in the data set for the team is created. Each point has a bivariate measurement. The explanatory variable (x) is yards from the goal line, and the response variable
2. Cubic regression: EP(x) = b0 + b1 x + b2 x2 + b3 x3.
Table 1—Summary of Offensive and Defensive Team Strength, Mid-Season 2005 Teams
( ) / (1 + e(
β30 + β31 x )
/ 1 + e(
β30 + β31 x )
+ e(
) ) )
β30 + β31 x
β00 + β01 x
Chicago
Weak
Strong
Indianapolis
Strong
Strong
San Francisco
Weak
Weak
St. Louis
Strong
Weak
Medium
Medium
Data Points for Chicago 7 6 5 4 3 2 1 0
β00 + β01 x )
) P ( 0 ) = P (Y = y0 ) = e( + e( P(7) = 1 – P(3) – P(0). For this model, we replaced actual points of 6, 7, or 8 with 7. There were actually very few such replacements, as there were few touchdowns that did not result in seven points.
β00 + β01 x
Defense
Green Bay
3. Linear logistic regression with the response being one of three events—no score (0), a field goal (3), or a touchdown (7): EP(x) = 0 P(0) + 3 P(3) +7 P(7), where
P (3) = P (Y = y3 ) = e(
Offense
Points
(y) is the number of points made before giving up the ball to the defensive team. The response variable result, therefore, is a member of {0, 3, 6, 7, 8}. Zero points means no score. Three points are awarded for a field goal. A touch down counts as six points. After a touch down, a kick through the uprights adds one point (7) and a single play yielding a score (e.g., a touch down) adds two points (8). Negative scores—a safety or a fumble or interception returned for a touchdown—are rare and do not appear in these data sets. First downs after a penalty (e.g., first and 15 or first and five, etc.) were disregarded, as these all originally began with the standard first and 10. Four models were examined for predicting points from a first down. Models 1 and 2 use least squares to estimate coefficients. That is, coefficients were estimated to minimize the average squared difference between the observed value of y and the expected value. Models 3 and 4 use logistic regression. Again, the coefficients were estimated to minimize the average squared difference between the observed value of y and the expected value, but the computation algorithm is more involved. The models are as follows: 1. Quadratic regression: EP(x) = b0 + b1 x + b2 x2.
0
50
100
Yards from the Goal Figure 1. Graph of data points for Chicago's points vs. first down position (yards from the goal) in the 2005 regular season. Points are jittered for viewing repetitions.
4. Quadratic logistic regression: This is similar to Model 3, but there is a quadratic term in the exponential functions.
Table 2—Estimated Equations and R2 Values for the Models of EP(x) for Chicago
With the logistic regression models, expected points were computed using the value 7 for the event of a touchdown, which was the actual result in nearly all cases. There was little difference between EP(x) for models 1, 2, and 4. Figure 1 shows all of the first-down data for Chicago; the points have been jittered to display repetitions of cases. Three polynomial models for EP(x) and the almost identical fits of the quadratic and cubic models can be seen in Figure 2. Table 2 exhibits the best fitting polynomial and logistic equations, along with their R2 values. The interesting general consistency, but slight variation, in the expected fits is displayed in Table 3. We calculated the average points at intervals of five yards to see more clearly how the average points scored varied with the yards from the goal. These are displayed in Figure 3 with the chosen quadratic model. Because of its greater simplicity, we chose a quadratic regression model to estimate EP(x) for each team. Of course, different teams had different coefficients resulting from the least squares estimates when EP(x) was fit to their data. The other teams had similar results.
Model
Equation
R2
Linear
EP(x)=-0.0481x+4.4156
0.1795
Quadratic (Model 1)
EP(x)=0.0009x 0.1408x+6.1255
0.2260
Cubic (Model 2)
EP(x)=-2E-6x3+0.0013x20.154x+6.2434
0.2262
2
P(0)=e(-1.190+0.048x)/ (1+e(-0.450+0.009x)+e(-1.190+0.048x)) Linear Logistic P(3)=e(-0.45+0.009x)/ (Model 3) (1+e(-0.450+0.009x)+e(-1.190+0.048x)) P(7)=1-P(0)-P(3)
Quadratic Logistic (Model 4)
0.1849
P(0)=e(-2.832+0.140x-0.001x^2)/ (1+e(-1.308+0.070x-0.0007x^2) + e(-2.832+0.140x-0.001x^2)) P(3)=e(-1.308+0.070x-0.0007x^2)/ (1+e(-1.308+0.070x-0.0007x^2) + e(-2.832+0.140x-0.001x^2))
0.2287
P(7)=1-P(0)-P(3) CHANCE
35
Data Points and Models for Chicago 7
6 Points Cubic Model Quadratic Model Linear Model
5
Points
4
3
2
1
0 0
10
20
30
40
50
60
70
80
90
100
-1
Yards from the Goal Figure 2. Graph of data points and models for Chicago's points vs. first down position (yards from the goal) in the 2005 regular season. Points indicate data. The solid, dotted, and dashed lines are the cubic, quadratic, and linear model fits.
Chicago – Model of Expected Point vs. Mean of Data Points 7
6
5 Mean Points
Points
4 Quadratic Model 3
2
1
0
Yards from the Goal Figure 3. Graph of mean points joined by solid lines at yard intervals noted on the x-axis and the fitted quadratic model boxes joined by the dashed lines for EP(x) for Chicago’s first down data for the 2005 regular season
36
VOL. 22, NO. 1, 2009
Table 3—Mean of Actual Data Points, for Chicago, at the Indicated Intervals vs. the Model’s Expected Points for the Same Interval Midpoint (2005 Regular Season)
Chicago
Count (n)
Mean
Quad Model
Cubic Model
Quadratic Logistic Model
Yards 1–2
8
6.375
5.934
6.034
5.766
4.535
Yards 3–7
8
5.125
5.399
5.455
5.403
4.365
Yards 8–12
13
5.462
4.823
4.840
4.942
4.155
Yards 13–17
8
4.250
4.201
4.185
4.349
3.886
Yards 18–22
15
2.400
3.649
3.614
3.761
3.605
Yards 23–27
10
2.800
3.180
3.136
3.234
3.328
Yards 28–32
15
3.400
2.772
2.727
2.772
3.053
Yards 33–37
25
1.560
2.325
2.286
2.279
2.710
Yards 38–42
19
3.053
1.986
1.955
1.923
2.412
Yards 43–47
21
2.000
1.723
1.704
1.663
2.156
Yards 48–52
26
1.423
1.469
1.463
1.426
1.878
Yards 53–57
26
1.385
1.240
1.249
1.225
1.591
Yards 58–62
25
0.760
1.093
1.112
1.100
1.373
Yards 63–67
35
0.657
0.970
0.998
0.997
1.143
Yards 68–72
44
1.568
0.911
0.941
0.943
0.976
Yards 73–77
21
0.762
0.887
0.914
0.915
0.805
Yards 78–82
35
0.486
0.915
0.929
0.925
0.663
Yards 83–87
15
1.133
0.985
0.978
0.972
0.551
Yards 88–92
10
0.300
1.115
1.071
1.076
0.444
Yards 93–99
10
2.100
1.352
1.240
1.313
0.340
Total
4–19-CHI 3
Linear Logistic Model
389
(12:34) B.Maynard punts 40 yards to CHI 43, Center-P.Mannelly. A.Randle El to CHI 42 for 1 yard (H.Hillenmeyer). PENALTY on PIT-S.Morey, Offensive Holding, 10 yards, enforced at CHI 42.
Figure 4. Example of a play-by-play situation from the December 11, 2005, game, Chicago vs. Pittsburgh. This is in the third quarter, as extracted from www.nfl.com.
Expected Net Yards for a Punt When St. Louis punts to Green Bay from their own 42-yard line, the position from which Green Bay starts their next series of downs will vary with the effectiveness of the punt. The data for all the punts made for our five teams, in 2005, were examined. Figure 4 is a typical example extracted from NFL.com for one play.
This is how you read the summary from Figure 4: On this fourth-down situation, Chicago punted 40 yards, and then Pittsburgh ran the punt reception back for one yard, bringing the net punt gain to 39 yards. A penalty was called (holding on the return team) for a loss of 10 yards. The net punt distance became 49 yards. Similarly, for each punt, the net punt distance
CHANCE
37
Indianapolis Third Down Success 0.9 0.8
Quadratic Probability Model
0.7
Success Frequency
Probability
0.6 0.5 0.4 0.3 0.2 0.1 0 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
21
Yards to Go (for new first down) Figure 5. Graph of relative frequency of successful first down conversions and probability model
was calculated as the distance from the starting punt position to the new punt position at the end of the play. The example above was an effective punt, with a net punt distance of 49 yards. Most punts are less effective, and we chose to simplify this part of the analysis by assuming all punts net the average punting effectiveness for that team. Because the quadratic model for EP(x) is nearly linear, there will be little change in the expected points after a punt with this assumption.
Expected Yards When Getting a First Down If St. Louis goes for it and obtains a first down, what do they gain on average? They gain the opportunity to score points on this possession. They will have a first-and-10, some number of yards from the goal line. How many yards from the goal? That depends on how much yardage they gained, beyond the necessary two yards (recall it was fourth and two at their own 42 in our example). Data for all of St. Louis’ successful attempts at a first down from third-down positions showed that, on average, they gained approximately eight more yards than the first-down marker. We use this value to estimate St. Louis’ position after a successful first down. The analysis will change little if we use the detailed distribution, as EP(x) is nearly a linear function. In addition, the quadratic function EP(x) is convex, and so Jensen’s inequality says this analysis understates, slightly, the value of a successful first-down attempt. A study looking at variability and expectations would need to address this issue 38
VOL. 22, NO. 1, 2009
more carefully. Unsuccessful attempts usually result in not much change from the current position and were not analyzed separately. That is, it is assumed an unsuccessful attempt delivers the football to the opposing team at the line of scrimmage, where the fourth-down play started.
Probability of a Fourth-Down Conversion If the offense does not punt on fourth down, what is the probability of it successfully achieving a new first down? How do we answer this question? Teams rarely try for a new first down on fourth down, and thus not much data exist on fourth-down conversion attempts. Of course, teams always try for a new first down on third down. We decided to use success rates on third-down and fourth-down conversions together to model the probability of a successful conversion on fourth-down attempts. While defensive teams might try even harder to prevent fourth-down conversions (by risking longer gains in an all-out attempt to stop the conversion), they already usually align their defense to prevent third-down conversions. We believe the fourth-down conversion rates would be quite close to the third-down conversion rates. The cases are again bivariate, with the explanatory variable being yards to go for a new first down and the response variable being success or failure. Because the response variable is binary, some logistic regression models were compared. The linear logistic model gave approximately the same estimated probabilities as the quadratic logistic model, and so was chosen
for its greater simplicity. Figure 5 exhibits the quadratic logistic probability graph for Indianapolis, shown with the actual relative frequency of success (for Indianapolis).
Comparing the Expected Points Recall that we are questioning St. Louis’ decision to punt on fourth and two from their own 42-yard line. For St. Louis, the average net punt is 34 yards. From St. Louis’ perspective, if the punt nets 34 yards—the St. Louis average—then Green Bay will be 76 (42 + 34) yards from the goal and have EP(76) = 1.320, expected points. Punting puts St. Louis down, on average, 1.320 points. If they go for it, they need two yards, and, if successful, they average eight additional yards, and so, on average, a successful attempt gains 10 yards. This would leave them 100 – 42 – 10 = 48 yards from the goal. Their expected points from this position are 2.635. However, the previous analysis asked what Green Bay’s scoring potential was when they received the ball after a punt. For proper comparison, we need to compare that with the net average points when Green Bay next receives the ball, regardless of what St. Louis does, and how many points they score. Remember, we are considering scenarios in which the team on defense will have enough time to try to score as the offense. So, the gain from a successful fourth-down conversion has to be decreased by Green Bay’s scoring potential on their next possession. Of course, we do not know where they will start that next possession. Assuming St. Louis does score a touchdown or field goal, the average position would be approximately the 25-yard line. The exact position is not so important, because EP(x) changes little when x is large. Green Bay’s EP(75) is 1.338. Thus, St. Louis will achieve a gain by successfully making a first down of 2.635 – 1.338 = 1.297. The linear logistic model for St. Louis shows their estimated probability of a successful fourth-down conversion, at two yards to go, is 0.576. So, St. Louis has an expected loss, by going for it, as follows: P(Failure)(Expected Points for Green Bay 42 yards from the goal) – P(Success)(Expected Points by Successful Conversion) = (1-0.576) (2.548) + 0.576 (-1.297) = 0.333 (expected points behind the next time Green Bay has the ball). Recall that they expect to be down 1.320 points by punting. St. Louis gains an average of almost a point by the decision to go for it in this situation. Notice that this analysis shows the correct choice is to go for it, and punting provides an expectation that is not close to the expected loss from attempting to convert a first down. Yet, with the prevailing understanding of NFL games, St. Louis would be strongly criticized by every NFL expert for ‘gambling’ or ‘not playing the percentages’ or being ‘wild risk takers’ if they went for it and failed. Pundits (punt-its) might even say, “They should have gone with the percentages.” But our analysis shows that to go for it is the percentage play, and many experts probably have never looked at any percentages. If they go for a first down, they have a reasonable chance (58%) of keeping the ball and thus scoring points with this possession. And their expectation goes from -1.320 to -0.333, or from clearly negative to almost even. The data show the decision to go for it is the “percentage play.”
Comparison of Teams We computed the values of expected points, EP(x), using each team’s quadratic regression model. To obtain an idea of how optimal fourth-down choices vary from team to team, we chose to average our EP(x) values for the five teams studied when computing the ‘other’ team’s scoring potential. The table uses EP(x) for the specific team in the following tables when that team is on offense. Tables 4 and 5 show similar calculations to the one done for St. Louis, with fourth and two at their own 42-yard line for all cases. The columns indicate yards necessary for a first down (1 to 20 yards), and the rows specify the yards from the goal line (30 to 99). Teams should not punt when they are within 30 yards of the goal line. The relevant decisions then are go for it or try a field goal. The values in the table cells are the differences between the expected points for punting and expected points for attempting to get a first down. An X in a cell means the situation is not possible. Grey areas indicate situations in which the team should punt (positive expected difference, or, in other words, the expected loss for attempting a first down is larger than the expected loss for punting). Table 4 gives results for Chicago. Table 5 gives results for St. Louis. Results for the other three teams used in the analysis can be found at www.amstat.org/ publications/chance. Interestingly, our results show that even a poor offensive team such as Chicago should go for a first down more often than they actually do. For example, on fourth and one at midfield (50 yards from the end zone), this analysis shows Chicago should go for a first down. Intuitively, this may be more obvious than NFL coaches seem to realize. They have an estimated probability of success of 0.4747—about 50/50. So essentially, they are taking a 50/50 shot at having nearly the same situation as their opponent, which, should favor neither team on average. So going for it is not disadvantageous, and punting produces a disadvantage. Offensively strong and defensively weak teams such as St. Louis should go for a first down even more often. These teams have high expected points when they have the ball and high expected points for the opponent when the opponent has the ball. They also have higher probabilities of successfully converting a fourth-down attempt for a new first down. They should try to keep the ball.
Summary Our analysis shows there are many situations in which the correct decision on fourth down is to go for a first down. This analysis assumed it was early in the game and the correct CHANCE
39
Table 4—Expected Difference in Points for Going for It vs. Punting on Fourth Down for Chicago
Yards from End Zone
Note: Yards from the end zone and yards to go for the first down are on axes. Points are more for going for it in the grey area. Impossible situations are marked with X.
decision would be determined by expected points. Coaches may be more comfortable with punting when the expectation analysis shows they gain a small amount by trying to keep the ball. After all, if the team fails on an attempted fourth down, the coach will likely receive some criticism. So, for factors beyond those considered in our analysis, they may want to widen the grey areas to punt. However, even allowing a margin of, say, 0.25 or 0.5 points, the coach could make a better decision in many circumstances by going for it. Our analysis only applies when the game is not close to finished. Near the end of the game, models for expected points 40
VOL. 22, NO. 1, 2009
should be replaced by considering the models for probabilities of particular results—no score, a field goal, or a touchdown— probabilities that we have modeled in our analysis. At the beginning of the season, teams would be without the data we used to analyze fourth-down decisions to go for it. To use our analyses, the decisionmakers might try to find the team of our five most similar to their team with regard to offensive and defensive strength. As their season progresses, they could then use the data from the current season. Similar analyses to compare kicking a field goal with going for a first down were done by Jennifer Wright in her master’s
Table 5—Expected Difference in Points for Going for It vs. Punting on Fourth Down for St. Louis
Yards from End Zone
Note: Yards from the end zone and yards to go for the first down are on axes. Points are more for going for it in the grey area. Impossible situations are marked with X.
thesis at California State University. Again, she found that the decision to go for a first down or touchdown, rather than kick a field goal, should be made more often than it is.
Further Reading Agrestt, A. (2002) Categorical Data Analysis (2nd Edition), John Wiley & Sons. Bartshe, P. (2005) “An NFL Cookbook: Quantitative Recipes for Winning.” STATS, 6:12–13.
Myers, R. (2000) Classical and Modern Regression with Applications (2nd Edition), Duxbury Press. Sackrowitz, H. (2000) “Refining the Point(s)-After-Touchdown Decision.” CHANCE, 13:29–34. Stern, H. (1998) “Football Strategy: Go for It!” CHANCE, 11:20–24. Theismann, J. and Tarcy, B. (2001) The Complete Idiot’s Guide to Football (2nd Edition), Alpha Books. NFL Football Data, www.nfl.com/stats/2005/regular.
CHANCE
41
Application of Machine Learning Methods to Medical Diagnosis Michael Cherkassky
T
he technological boom in recent years has come with great advances in medical technologies. New technologies, such as MRI (magnetic resonance imaging) and ECG (electrocardiogram), enable better understanding of the functions and malfunctions of the human body. However, technological progress adds pressure on medical professionals, who are now faced with an influx of data to interpret. Moreover, inundated with data, doctors may be more prone to mistakes. Misdiagnosis, though seemingly rare, is actually quite common. A recent study of Patient Safety in American Hospitals by HealthGrades found that per 1,000 patients, more than 150 were classified as “failure to rescue,” meaning a failure to diagnose correctly or in time. Twenty percent of patients in the emergency department (ED) are misdiagnosed, according to reports at http://wrongdiagnosis.com. In addition, John Davenport, in his paper “Documenting High-Risk Cases to Avoid Malpractice Liability,” observes that the majority of the cases of misdiagnosis occurs in serious diseases such as breast cancer, appendicitis, and colon cancer. Clearly, from these statistics, it can be concluded that misdiagnosis is a significant problem. Thus, it is necessary to find an efficient, unbiased, and accurate method for diagnosis. Machine learning computer aided diagnostics (CAD) provides a realistic solution to the problem of misdiagnosis. Though the intuition of a human can never be replaced, machine learning can provide a useful secondary opinion to help in the diagnosis of a patient. In CAD applications, empirical data from various medical studies are used to estimate a
42
VOL. 22, NO. 1, 2009
predictive diagnostic model, which can be used for diagnosing new patients. The simplest type of predictive problem is classification. In a classification setting, the goal is to estimate a model that classifies patients’ data into two classes (e.g., healthy and sick) based on available features of each patient. The input features may include clinical data (e.g., the number of lymph nodes, results of a blood test), demographic data (e.g., age and sex), genomic data, etc. Often, the number of input variables d is large, say, d =10 inputs or 100 inputs, and can even reach hundreds of thousands of variables in genomic data. Current medical technologies, such as heart monitors, take numerous readings on several characteristics each minute. The scenario here is limited, but future work would explore several of the topics considered previously. Due to the large number of input variables, designing a classifier amounts to estimating a decision boundary in a high-dimensional space based on available diagnostic data about past patients. This is a difficult task for medical doctors because (a) humans have no intuition of working in/ visualizing a high-dimensional input space and (b) there may be many models (decision boundaries) that explain available historical data equally well, but have different prediction accuracies. Machine learning methods address both problems and enable estimation of statistically reliable classification models for medical diagnosis. Once such a predictive model is estimated, it can be used in future diagnosis (classification) of new patients. Prediction refers to the CAD model assigning the diagnosis to a new patient, based on the values of input features for this patient. Two binary classification methods are k-nearest neighbors (kNN) and support vector machine (SVM) classifier. The kNN method is a simple classical method based on the intuitive idea of classifying a new patient based on his/her similarity to other patients (with known classification labels). The SVM method is a more recent technology that has become widely used since the late 1990s. It is based on a solid theoretical foundation of statistical learning theory and uses a new concept (i.e., margin) to control the prediction accuracy.
Statistical Learning Methods The field of statistical learning or pattern recognition studies the process of estimating an unknown (input, output) dependency or structure of a system from a finite number of (input, output) samples (i.e., observations or cases). Learning methodologies have been of growing interest in modern science and engineering when the underlying system under study is unknown or too complex to be mathematically described. Machine learning can estimate a ‘useful’ model to characterize the unknown system using available data. The estimated model
is expected to have good prediction accuracy for the future data. Learning is the process of estimating an unknown (input, output) dependency or structure of a system using a limited number of observations. The general scenario for machine learning involves three components (see Figure 1): a generator of random input samples, a system that returns an output for a given input vector, and a learning method that estimates an unknown (input, output) mapping of the system from the observed samples (xi,yi),i,…n. Here, the input vector x denotes the patient’s characteristics relevant for diagnosis (classification), and y denotes the class label. We only consider applications with two possible classes (so-called binary classification). Note that observed or training data has classification labels. However, the future (test) data is unlabeled, and it has to be classified by a learning method. The learning method is a computer algorithm that implements a set of possible models (or a set of functions) ƒ(x,v),v e describing the unknown system. This set of functions is parameterized, and denotes a set of parameters used to index the set of functions. The learning method then attempts to select the ‘best’ predictive model ƒ(x,v) from this set of functions, using only available data samples. For classification problems used in this study, the quality of a model is measured as its error rate (for future samples). That is, for a given input x, if a model correctly predicts the class label y, then its error is zero; if it makes incorrect prediction, its error is one. The prediction error rate is a fraction of incorrectly classified future samples over the total number of future samples. The main problems are that the future or test data are unknown and the model has to be estimated using only a finite number of training samples. Statistical learning does not necessarily use statistical models as they would be presented in an introductory statistics course. The procedures in statistical learning can be defined by algorithms with necessary values (e.g., parameters, tuning constants) being estimated using available data. Whereas a model in the world of statistics corresponds to a statistical model that relies on probability distributions, the procedures (or models) in statistical learning are algorithmic and do not have to correspond to underlying statistical models. Occasionally, algorithms in statistical learning incorporate statistical models (e.g., mixture models), but the predictions have little or no statistical interpretation at other times. In any case, researchers in statistical learning theory refer to their procedures as models and their tuning constants as parameters. k Nearest Neighbors Method kNN is a simple method of classification that bases itself on the notion that for a given input x, its estimated class label should be similar to the class labels of its ‘neighbors,’ or surrounding points. So the new (unlabeled) input can be classified by a majority vote of its neighbors from the (labeled) training set. Intuitively, this method makes sense. For example, if the new patient is 50 years old, female, and has five cancerous nodes, she probably should have the same diagnosis as a woman who is 49 years old with six cancerous nodes. One critical consideration is how to measure distance in X-space. In this method, the similarity between input samples is measured as the Euclidean distance (between these samples)
Input Samples
Learning Method
System Figure 1. General Setting for Machine Learning
in the input space. However, there are many ways to quantify distance, including statistical—or Mahalanobis—distance. Moreover, each input variable is normalized from zero to one to prevent one variable from ‘outweighing’ another. The main practical issue in the design of kNN classifiers is a proper selection of the value of k, which usually depends on the training data. However, as the number of input variables—or dimensionality of the input space X—increases, the concept of k nearest neighbors can become problematic. As dimensionality increases, the kNN method should become less accurate because points are likely to be more dispersed in high dimensions. This phenomenon is known as “the curse of dimensionality” in statistics. Support Vector Machine Classifier The SVM method was introduced for estimating a linear decision boundary with high prediction accuracy. It uses a new concept in binary classification, margin. For example, in Figure 2(a), there are many linear decision boundaries that classify (separate) available data into two classes equally well (with 100% accuracy). However, these linear decision boundaries will have different prediction accuracy (for future samples). It can be shown that the best decision boundary has the maximum separation margin between the training samples from two classes (as shown in Figure 2(b)). Therefore, the optimal model (decision boundary) would be the model that has the maximum distance between the decision boundary and the nearest points from each class. The motivation for achieving the maximum margin is relatively simple. If the margin of the decision boundary is small, there is large variability in terms of how many ways the model can separate the data. However, if the margin is large, there is only one way to separate the data. Because of such a lower variability, the large-margin model is less prone to random fluctuations of the training data and thus yields higher prediction accuracy (for future data). Note that the concept of margin is independent of dimensionality, so a large-margin decision boundary can guarantee good generalization, even for high-dimensional data. Vladimir Vapnik, in The Nature of Statistical Learning Theory, extended the idea of margin to situations where the training data is not separable, and when the decision boundary is nonlinear. These extensions have resulted in the (nonlinear) SVM methodology for classification, where nonlinear models are implemented via so-called kernel mapping. Nowadays, many publicly available CHANCE
43
Haberman’s Survival Data Set Haberman’s data set contains data from studies conducted on the survival of patients who have undergone surgery for breast cancer. The studies were conducted between 1958 and 1970 at the University of Chicago’s Billings Hospital. The data set offers three numerical input variables: age of patient at time of operation, patient’s year of operation, and number of positive auxiliary nodes detected. I disregarded the second input variable (patient’s year of operation) for this study because it has very low association with survival and most likely would not have any influence on the output. The output, or class attribute, represents whether a patient survived or died within five years of the operation. There are a total of 306 samples; 81 are labeled as dead and 225 as alive. The correlation coefficients between the two inputs and the output are shown in Table 1. Figure 2(a). Multiple linear decision boundaries separating the data with zero error
Optimal Linear Boundary
Statlog Heart Disease Data Set The Statlog Heart Disease data set contains 270 patient records, where each record (sample) has 13 input variables (see Table 2) and an output (diagnosis) indicating whether the patient has heart disease or is normal. Input variables include age, sex, blood pressure, cholesterol level, etc. Inputs 1, 4, 5, 8, 10, and 12 are real valued. Inputs 2, 6, and 9 are categorical. Inputs 3, 7, 11, and 13 are ordinal (i.e., high, medium, low). There are a total of 270 samples in this data set; 120 are positive and 150 are negative. Correlation coefficients between each of the inputs and the output are shown in Table 2. Wisconsin Diagnostic Breast Cancer Data Set
Margin
Figure 2(b). Linear decision boundary with maximum margin
software implementations of SVM exist, and this study used MATLAB-based software. When applying SVM classification software to available data, one has to specify two SVM parameters: parameter C that controls the margin size and a kernel parameter that controls the degree of nonlinearity (of the decision boundary). This study uses the radial basis function (RBF) kernel, with single parameter Sigma, denoting the RBF width. Proper tuning of SVM parameters C and Sigma for a given data set is analogous to optimal selection of k in kNN method.
Description of Data Sets I used three publicly available data sets from UCI Machine Learning Repository. Preprocessing the data included scaling each input variable to the same range (zero to one) and organizing the data into MATLAB format. Scaling was necessary to make sure one input variable did not dominate other input variables because of its much larger values. 44
VOL. 22, NO. 1, 2009
The Wisconsin Diagnostic Breast Cancer data set includes results of a medical test for 599 female patients suspected of having breast cancer. Each patient record has 30 input variables computed from a fine needle aspirate (i.e., diagnostic procedure used to investigate lumps and tumors) of a breast mass conducted in November of 1995. The input variables describe characteristics of the cell nuclei present in the breast, including radius, area, symmetry, etc. (see Figure 3). The output is a diagnosis of whether the cells are benign or malignant. There are a total of 599 data points, of which 375 samples are classified as benign instances and 212 as malignant instances. All the input variables are real valued.
Experimental Procedure In the problem of estimating a classifier from available data (i.e., training data), we are faced with two goals: (1) an accurate explanation of the training data and (2) good generalization for future data. All modeling approaches implement some sort of data fitting, but the true goal of modeling is prediction. The trick to finding the optimal model is balancing these goals. For example, a model may be great at explaining the training data but can be poor in generalization for future data. This problem is addressed using complexity control, which amounts to choosing an optimal model complexity for a given (training) data set. For example, as k increases, the complexity of the kNN model decreases, and vice versa. Likewise, increasing the SVM Sigma will decrease the complexity of the SVM model. An optimal model complexity should help generalization for future data.
Resampling An effective tool for implementing complexity control (i.e., model selection) is resampling or cross-validation. An extremely complex model may successfully classify all the training data correctly, but it will usually yield poor generalization for future data. A solution lies in cross-validation, which partitions the available data into training and validation sets. Thus, we can use the training set for model estimation and the validation set to validate our model. Next, we change the model complexity and repeat the prior steps and finally select the model that provides the lowest prediction error for validation data. This approach solves the dilemma of overfitting because the most complex model (low training error) will not necessarily provide the best validation error. However, the results are sensitive to how the data is split, thus it is necessary to partition the data randomly. A specific type of cross-validation, illustrated in Figure 4, is called M fold cross-validation, which involves dividing the available data set Z of size n into M randomly selected separate subsets of size n/M each. Then, one subset is left out and the remaining M-1 subsets are used to estimate the model. The prediction error for this model is estimated using the left-out subset. This is repeated for M partitions (or M folds) to find the average prediction error. In model selection, we would test many parameters and select the ones that gave the least cross-validation error. A special case of M fold cross-validation (with M=n) is called leave one out (LOO) cross-validation. In LOO cross-validation, each left-out subset includes only one point; the remaining n-1 data points are used for model estimation. In large data sets, LOO cross-validation and higher fold cross-validation (i.e., 10–15 fold) would yield the same results. However, in smaller data sets (