Editor’s Letter
Mike Larsen,
Executive Editor
Dear Readers, It is with great pleasure that I take over as the executive editor of CHANCE. I began reading the magazine in graduate school and have enjoyed it ever since. It has accessible articles about interesting and important topics that highlight the contributing role of probability and statistics. Topics and the background of authors are diverse. I look forward to working with contributors, columnists, and editors to produce a quality magazine each and every issue. I would like to thank Michael Lavine for his great service as executive editor and for his tremendous assistance to me during the transition. Most of the articles in this issue were first submitted to CHANCE during his term as editor, and he and his editors deserve credit for their efforts. Michael has advised me on handling submissions and answered many questions without complaint. ASA Communications Manager Megan Murphy also deserves thanks for her guidance and help to me and for her work on this issue. There are a few changes in CHANCE I would like to see occur during my time as editor. First, I would like to see CHANCE go online. This will increase access, impact, and attractiveness to people who might consider submitting articles. Wouldn’t it be neat to have all the election-related articles from CHANCE’s 20-year history immediately available on your desktop? I have begun discussions with the ASA and Springer about options. Second, the posting of online supplemental material for CHANCE can be greatly increased. Supplemental material can include supporting documents, survey questionnaires, videos, color graphics, and teaching material, including handouts and study questions. The ASA has already begun working on the web site. Have you noticed changes? Besides these two goals, I primarily want to maintain the high level of quality, relevance, and entertainment past editors have achieved. This issue contains the diversity of topics that make probability and statistics such an interesting field of study. Erik Heiny conducts an in-depth analysis of the leaders of the PGA Tour,
America’s top professional golf association tournament. David Peterson describes algorithms for drawing electoral boundaries to reduce the politicization of the process. Methods are illustrated for districts in North Carolina. Three articles are related to probability. M. Leigh Lunsford, Ginger Holmes Rowell, and Tracy Goodson-Espy discuss an assessment of student understanding of the central limit theorem. Jay Kadane relates the Cauchy-Schwatz inequality to crowding on airplanes. I always suspected the airlines had goals other than my comfort! Jarad Niemi, Brad Carlin, and Jonathan Alexander compare strategies for playing sports betting pools. Two articles concern gender differences. Kris Moore, Dawn Carlson, Dwayne Whitten, and Aimee Clement report on a survey of female and male executives. Chris Sink and three high-school students (Matthew Sink, Jonathan Stob, and Kevin Taniguchi) present a well-planned study of computer literacy. Two entries concern experiments and observational studies. Michael Proschan describes the procedure he and his wife used to evaluate a medical treatment for their son. Determining whether a treatment works for a specific individual is quite challenging, but important. David Freedman offers an editorial review of the merits of experiments versus observational studies and subtleties of statistical approaches. Completing the issue are the Visual Revelations and Puzzle Corner columns. Sam Savage and Howard Wainer demonstrate a visualization tool for probabilities of false positive determinations in the war on terror (and many other applications). Thomas Jabine gives us Statistical Spiral Puzzle No. 9. I look forward to your comments, suggestions, and article submissions. Enjoy the issue! Mike Larsen
CHANCE
5
Self-Experimentation and Web Trials Michael Proschan
A
clinical trial is a great way to see whether treatment works “on average,” but not a great way to see whether it works in an individual patient. One problem is that each patient typically receives only one treatment, so no within-patient comparison is possible. Crossover trials do offer such a comparison, but because the patient is on treatment for either the entire first or second half of the trial, any observed difference in a patient might be confounded by a temporal trend or other effect. For example, a traumatic event occurring early in the trial might affect the patient’s results for the entire first half of the trial, during which time he/she is on the same treatment. This does not cause a problem for estimating the average treatment effect over patients because half the patients receive treatment followed by placebo and half the opposite order; temporal trends and other effects tend to balance across treatments. Because it is possible for a treatment to work on average but not in the person we care most about, we would like to conduct an “n of 1” trial to see if the treatment works in the individual of interest. In limited settings where it is safe to receive treatment intermittently as needed, we can conduct such a trial.
This problem is of more than just academic interest to me. When my older son, Daniel, was in elementary school, a counselor from his school informed my wife, Maria, and me that he had attention deficit disorder (without hyperactivity) and that the school could do little to help unless he was placed on medication. Daniel’s doctor concurred and prescribed Ritalin, a drug that remains in the system for only a matter of hours, and therefore can be given as needed. Like most drugs, Ritalin can have side effects (e.g., insomnia, decreased appetite, stomach ache, headache, jitteriness, etc.), and we read it might even stunt growth. Moreover, the Council on Scientific Affairs of the American Medical Association pointed out that “the response rate for any single stimulant drug in ADHD is approximately 70%,” which implies that 30% of children do
not respond to a given stimulant drug. Therefore, we wanted to make sure Ritalin helped Daniel before we kept him on it longer-term.
Self Experimentation: Our ‘n of 1’ Trial of Ritalin We decided on the following trial design to answer the question quickly and rule out a temporal trend as a possible explanation. On Monday of each week, Maria flipped a coin and recorded whether Daniel would get Ritalin on Monday or Tuesday, with the other day being an “off” day. Then on Wednesday, she flipped a coin and recorded whether to give Ritalin on Wednesday or Thursday, with the other day being an “off” day. She flipped a coin on Friday to determine whether it would be an “on”
CHANCE
7
worried about potential bias because she knew his medication status. Although we had no data on which to base sample size, we deduced from the teacher’s feedback that he was probably about 1.5 on the five-point scale, and we hoped medication would move him to at least average, which would be a 1.5point improvement. We conservatively assumed a large standard deviation of 1.5 and determined that about seven weeks (35 days) were needed for 80% power to detect an improvement of 1.5 points.
or “off” day. This design ensured balance of temporal trends across on and off days. The primary outcome for this trial was Daniel’s teacher’s assessment, blinded to his medication status, of how well Daniel was able to work independently on each day, on a scale of 1 (worst) to 5 (best). We chose a simple fivepoint scale because we did not want to burden the teacher with extra work and teachers routinely use such fivepoint scales to score students in the form of letter grades—A,B,C,D, and F. We selected as primary outcome the teacher’s, rather than our own, assessment of Daniel’s ability to work independently because his teacher was the one who first identified a problem and was therefore in the best position to judge any improvement. My assessment of Daniel’s ability to do his homework independently— using the same 1–5 scale—was an important secondary outcome. I, rather than Maria, was blinded to his medication status because I was expected to help Daniel with homework. In actuality, my travel forced Maria to help Daniel with homework. Therefore, we 8
VOL. 21, NO. 1, 2008
Analysis and Results Even though the design of the trial seemed to call for a paired t-test, we analyzed it using an unpaired t-statistic, comparing on and off days. It has been shown that whenever the intra-pair correlation is nonnegative, applying an unpaired t-test always preserves the type 1 error rate and can increase power when the number of pairs is small. We also did a sensitivity analysis using a permutation test. After roughly half of the planned trial, the p-value for the t-statistic on the primary outcome—his teacher’s ratings—was .06. He was doing better on Ritalin by about three-fourths of a point on the five-point scale, which we interpreted as a three-fourths letter grade improvement. Maria’s assessment of his improvement in homework was even more dramatic: about 1.3 points. We felt time was of the essence, and even though the evidence did not rise to the level required to stop early in a conventional clinical trial, it was enough to convince us that Ritalin helped Daniel. Our concerns that Maria’s scores might be influenced by her knowledge of Daniel’s medication status appear to have been justified; her ratings were higher than the teacher’s when Daniel was on Ritalin (3.83 versus 3.57), but lower than the teacher’s when Daniel was off Ritalin (2.50 versus 2.80). This underscores the need to maintain as much blinding as possible. Lessons Learned Overall, the trial went well, though we identified two major weaknesses. The first was that Maria was not blinded to Daniel’s medication status. A student in a class I teach on clinical trial methodology suggested the following clever way we could have avoided this problem: I would number envelopes from 1 to 36 and pair them: (1,2), (3,4), … ,
(35,36). For each pair, I would flip a coin to determine which envelope would contain Ritalin. I would put Ritalin in that envelope and a vitamin in the other envelope of the pair, writing down—but not sharing with Maria—which numbered envelopes contained Ritalin and which contained the vitamin. On the Monday beginning the experiment, Maria would pull out the first two envelopes and flip a coin to determine whether to give Daniel envelope 1 or 2. She would give him the randomly selected envelope and record on a sheet of paper, but not share with me, the number of the envelope given on Monday. On Tuesday, she would give Daniel the remaining envelope of the pair and record its number. On Wednesday and Thursday, she would repeat the steps above for envelopes 3 and 4. Friday would not be included in the experiment. Maria would repeat these steps for the remaining weeks and envelope pairs. At the end of the trial, we would unblind by comparing my wife’s list of the envelopes given on the different days to my list of the contents of each envelope. This procedure would have guaranteed that neither Maria nor I would know until the end of the trial which days were on and which were off. This might have prevented the possible bias noted above. Another lesson from our trial is that carryover can occur in unexpected ways. Daniel sometimes had trouble sleeping as a result of taking Ritalin too late in the afternoon (he had no problems on off days), causing him to be tired the next day. Therefore, an undesirable effect of Ritalin may have carried over and affected Daniel’s performance the next day.
Web Trials If n of 1 trials are so good, why not get an even better answer by amassing data from n of 1 trials from millions of people all following a common protocol on the web? Web trials sound good on the surface, but there are potential biases. What if only those participants who do very well or very poorly report their
results? That is exactly what happened when the Literary Digest predicted—on the basis of 2.3 million responses to 10 million questionnaires—that Senator Alfred M. Landon would beat President Franklin D. Roosevelt by a 3:2 ratio in the 1936 presidential election. Roosevelt’s landslide victory illustrates how misleading it can be to rely on the opinions of only those who feel most strongly about an issue. Similarly, treatment A might be better than B for the majority of people, but only those with strong results report, and among them, treatment B may be superior. A closely related issue is that you may get data only from people who are able to remain on a given treatment. It is wellknown that compliers often differ from noncompliers. For example, investigators in the Coronary Drug Project found the five-year mortality rate to be .15 for participants taking at least 80% and .25 for patients taking fewer than 80% of their clofibrate pills, suggesting possible benefit of the drug. Then, they found similar mortality rates for patients taking at least 80% and fewer than 80% of their placebo pills. The act of taking pills did not lower mortality; rather, those who took the pills differed in important ways from those who did not. Moreover, nothing guarantees that treatment A compliers are comparable to treatment B compliers. For this reason, traditional clinical trials expend tremendous effort to maintain compliance and obtain outcome data on all participants, regardless of their compliance. I question whether web trials could ensure the same level of compliance and outcome ascertainment. I am troubled by the suggestion that nonrandomized and randomized data from web trials be lumped. How can we know whether participants who choose treatment A are comparable to those who choose B? Such biases played a prominent role in the hormone replacement fallacy. The possible benefit of hormones on heart disease was first postulated when researchers noted the much lower rate of heart disease in women than in men until roughly the age of menopause, after which women’s risk paralleled that of men’s. Observational studies seemed to confirm the benefit of hormone replacement therapy (HRT) on heart disease, but the women choosing HRT were different from the women not choosing HRT. Then came the Women’s Health Initiative (WHI), a huge clinical
trial whose random assignment of postmenopausal women to HRT or placebo eliminated such selection bias. The WHI found HRT not only fails to provide cardiovascular benefit, but appears to increase the incidence of certain cardiovascular events, especially early after HRT initiation. The early harm in the WHI was consistent with what was observed in another large trial of postmenopausal women, the Heart and Estrogen/Progestin Replacement Study. WHI investigators were concerned enough to announce the cardiovascular results before the scheduled end of the trial. The bottom line is that selection bias from nonrandomized studies can lead to totally false conclusions. The issues raised above may be described as unintentional biases, but I also would worry about more nefarious plots. What is to stop rabid proponents or opponents of certain therapies from reporting bogus data to support their cause? For that matter, what is to stop the hooligans—such as those who introduce computer viruses—from sabotaging web trials? All in all, I believe the advantages of web trials—the large sample sizes that allow detection of rare side effects and the ready availability of data to anyone who wants to analyze it—are far outweighed by the potential biases inherent in web trials.
Summary Traditional clinical trials tell whether treatment works “on average,” but not necessarily whether it will work in a given individual. In limited settings in which patients can repeatedly go on and off treatment safely without the worry of carryover, an n of 1 trial can help answer that more focused question. (Always discuss your planned trial with your doctor before undertaking it.) Because you (or a loved one) are the only participant, it should be easy to obtain complete data and avoid bias. Such is not the case if we try to combine results from many n of 1 trials reported on the web. The resulting large sample size creates a false sense of reliability. In actuality, the potential biases make the results of web trials
less credible than those of a smaller, but carefully designed, traditional clinical trial.
Further Reading Bryson, M.C. (1976). “The Literary Digest: Making of a Statistical Myth.” The American Statistician, 30:184–185. The Coronary Drug Project Research Group. (1980). “Influence of Adherence to Treatment and Response of Cholesterol on Mortality in the Coronary Drug Project.” The New England Journal of Medicine, 303:1038–1041. Diehr, P.; Martin, D.; Koepsell, T.; et al. (1995). “Breaking the Matches in a Paired T-Test for Community Interventions When the Number of Pairs Is Small.” Statistics in Medicine, 14:1491–1504. Goldman, L.S.; Genel, M.; Bezman, R.J.; et al. for the Council on Scientific Affairs, American Medical Association. (1998). “Diagnosis and Treatment of Attention Deficit/ Hyperactivity Disorder in Children and Adolescents.” Journal of the American Medical Association, 279:1100– 1107. Hulley, S.; Grady, D.; Bush, T.; et al. (1998). “Randomized Trial of Estrogen Plus Progestin for Secondary Prevention of Coronary Heart Disease in Postmenopausal Women.” Journal of the American Medical Association, 280:605–613. Manson, J.E.; Hsia, J.; Johnson, K.C.; et al. (2003). “Estrogen Plus Progestin and the Risk of Coronary Heart Disease.” The New England Journal of Medicine, 349:523–534. Proschan, M.A. (1996). “On the Distribution of the Unpaired T-Statistic with Paired Data.” Statistics in Medicine, 15:1059–1063. CHANCE
9
PGA
Tour Pro:
Long but Not
so Straight
Erik L. Heiny
D
uring recent years on the PGA Tour, there have been significant technological advancements in both golf balls and clubs. As a result, today’s PGA Tour players are hitting the ball farther than ever before. Golf course designers are conscious of these increased distances and are designing longer courses in response. In a 1999 Golf Magazine article by S. Kramer, noted golf architect Rees Jones was quoted as saying, “Our turning point on doglegs is now 280 yards off the tee, not 250 like it’s been traditionally. The driver is the major component that’s changed our thinking.” Golf course design is trying to keep up with the increased distance, but with the ball traveling so much farther and PGA Tour pros playing on increasingly longer courses, has the importance of individuals’ golf skills changed during this time? Jack Nicklaus mentioned this as a possibility in 1998 when he wrote the following in a Golf Magazine article: “The farther
the ball goes, the harder it becomes to separate the best from the rest. That’s because the more players are able to depend on power, the fewer shot-making skills they need to develop beyond driving, pitching, and putting.” Driving distance increased steadily over the time period of this study, with the largest jumps in distance occurring in 2001 and 2003. Driving distance increased from 260.40 in 1992 to 273.17 in 2000, a steady increase of one to two yards per year over a nine-year period. However, in 2001, the average driving distance increased six yards to 279.33. In 2003, driving distance increased to 286.38 from 279.76 in 2002. See Figure 1. The increased driving distance on the PGA Tour appears closely related to technological advances in equipment. Titanium drivers were introduced in 1995 with club head sizes in the 245 cc to 285 cc range. Average driving distance on tour that year was 263.43 yards. In 2000, drivers began CHANCE
11
F ig u r e 1 . A v g D r iv in g D i s t a n c e o n P G A T o u r : 1 9 9 2 - 2 0 0 3
Table 2 — %I/O for Driving Distance on the PGA Tour
290 285
Year
Average Driving Distance
%I/O
275
1992
260.40
-0.37%
270
1993
260.16
-0.09%
265
1994
261.89
0.67%
260
1995
263.43
0.59%
1996
266.36
1.11%
1997
267.61
0.47%
1998
270.53
1.09%
1999
272.46
0.71%
2000
273.17
0.26%
2001
279.33
2.25%
2002
279.76
0.16%
2003
286.38
2.37%
280
255 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 Y e ar Year
Figure 1. Average driving distance on PGA Tour: 1992–2003
Table 1— Driving Distances for Selected PGA Tour Players Name
1997 Average 2003 Average Driving Distance Driving Distance
Phil Mickelson
284.1
306.0
Ernie Els
271.6
303.3
Vijay Singh
280.9
301.9
Robert Allenby
276.6
294.8
Peter Lonard
259.2
292.8
to be developed with club head sizes over 300 cc. Also, the majority of tour pros began playing Titleist’s solid-core Pro V1 ball, instead of the old Titleist wound ball in late 2000. The combination of larger drivers with the new ball coincided with the six-yard spike in driving distance for 2001. In 2003, drivers became even larger, with club head sizes in the 365 cc to 400 cc range. Again, this coincided with another spike in driving distance of nearly seven yards. Some might argue the increased driving distance is due to younger and stronger players with better fitness joining the tour. However, there is a large core of players continuing on the tour from one year to the next, as the top 125 on the money list are exempt for the following season. (A player who is “exempt” has qualified to be a full-time member of the PGA Tour.) In addition, consider five players whose driving distances from 1997 were compared to their driving distances in 2003 in the Golf Digest article “The Gap,” written by Jaime Diaz in 2003 (see Table 1). These players were not younger and stronger in 2003 than they were in 1997. Ray Stefani discussed a way to identify technological breakthroughs in sports in his 2005 CHANCE article, “Politics, Drugs, and the Olympics and the Winners Are … Politics and Drugs.” He defines percent improvement per Olympiad, %I/O, to be 100(w(n)/w(n – 1) – 1), where w(n) is the winning performance at Olympics n and w(n – 1) is the winning performance at Olympics n – 1. For this study, %I/O will be the percent increase in driving distance from one year to the next. Stefani writes that when there is a technological breakthrough enhancing athletic performance, there is a one-time increase 12
VOL. 21, NO. 1, 2008
in %I/O, and after that, %I/O returns to past values. For this study, %I/O was calculated for the years 1992 through 2003 (see Table 2). In 2001, %I/O increased 457% (from 0.49% to 2.25%) over the average %I/O from 1992 to 2000. In 2003, %I/O increased 380% (from 0.62% to 2.37%) over the average %I/O from 1992 to 2002. These numbers certainly indicate significant technological breakthroughs on the PGA Tour regarding driving distance. Between 1992 and 2003, average driving distance on tour increased by more than 25 yards. Clearly, technology has played an important part in this change. Even if the longer distances are due to better swing mechanics, improved physical fitness, or faster course conditions, the result remains the same. PGA Tour pros are hitting the ball farther than ever before. Here, we attempt to measure the change, if any, on the importance of different golf skills as driving distance increases.
Variables 1. GIR (Greens in Regulation): Prior research identified GIR as the most important statistic on the PGA Tour. GIR is defined as whether a player has reached the putting surface in par minus two strokes. For example, if a player is playing a par 4, he must reach the green in two. If he is playing a par 5, he must reach the green in three, and his tee shot on a par 3 must reach the putting surface. The GIR statistic represents the percentage of holes played that the player reached the green in regulation. A disadvantage of GIR is that it does not account for a par 5 reached in two instead of three. PGA golfers frequently reach par 5 greens in two, so this deficiency in GIR should be kept in mind. Another disadvantage is that GIR does not count additional strokes to reach the green. For instance, GIR will not distinguish between a player who reaches a par 4 green in three strokes and a player who reaches a par 4 green in four strokes.
In a 2002 The Statistician article, Robert Ketzcher and Trevor J. Ringrose introduced a statistic called NonGreen that represents the number of strokes taken by the player to reach the green. NonGreen does not have the same disadvantages as GIR. It accounts for par 5 greens reached in two and it counts additional shots required to reach the green. In many cases, these additional strokes are the result of a tee shot that was hit out of bounds, or an approach shot to the green that landed in the water. However, these additional shots are played around the green frequently. For instance, a player may land in deep greenside rough, or try to get cute with an awkward chip, and take additional strokes. These mistakes actually are being made in the short game, and do not represent full swing shots that are considered ball-striking skills. The disadvantage of NonGreen is that it combines full swing shots and short game strokes, and it is difficult to say what type of skill is being measured with NonGreen. GIR was used in this study because it is a good measure of ball-striking skill and is well-known and of interest in golfing circles. 2. PPR (Putts per Round): PPR, the average number of putts per round, has been the most frequently used measure of putting skill in prior studies, and not surprisingly, has been found to be a key indicator of performance. For this statistic, a putt is defined as only those strokes taken while the player is on the green. Strokes taken with the putter from off the green, say the fringe, are not included in this statistic. Another measure of putting, putts per green in regulation (PPGIR), is available on the PGA Tour web site. This records the average number of putts taken only on holes where the player hit the green in regulation. Players who hit the green in regulation will usually be putting from a longer distance, as they typically reached the green with a full shot swing from the fairway. Conversely, players who have missed the green will usually be putting from a shorter distance because they just had to chip onto the green. Thus, the more greens hit in regulation, the higher PPR will tend to be. However, PPGIR has a disadvantage. Better ball-strikers will have shorter putts on greens that were hit in regulation, which can make them appear to be better putters than they really are using PPGIR. Neither measure is perfect, and only one measure could be included in this study due to multicollinearity issues in regression. PPR was chosen because it records a player’s putting on all holes, not just the ones where they hit the green in regulation, and direct comparisons can be made to prior studies. 3. Scrambling: Scrambling is a relatively new statistic on the PGA Tour, and was not recorded until 1992. The great majority of prior research did not consider scrambling, but it was recently incorporated into a 2004 study by Peter S. Finley and Jason Halsey and found to be second in importance only to GIR. Scrambling is defined as the percentage of holes where the green was missed in regulation, but a score of par or better was obtained. To scramble successfully usually requires a good chip shot followed by a solid putt from short range, probably inside 10 feet. There is certainly correlation between scrambling and PPR, but not enough to make multicollinearity an issue, so scrambling was included in the model. 4. Driving Distance (DrDist): DrDist on the PGA Tour is measured only twice each round. Two holes are selected that run in opposite directions to minimize the effect of wind and where players are most likely to hit driver. DrDist is then computed to be the average of these measurements compiled
during the year. Historically, DrDist has been significantly correlated with performance on the PGA Tour, but it has not been as important as GIR and PPR. 5. Driving Accuracy (DrAccu): DrAccu on the
PGA Tour is recorded on every hole, excluding par 3s. Only whether the player hit the fairway with their tee shot is recorded. DrAccu represents the percentage of fairways hit over the course of the year, and it certainly isn’t a perfect measure of driving accuracy. For example, a player can be one foot off the fairway in the first cut of rough and be perfectly fine, or he could have hit his tee shot 30 yards off line into deep rough and a forest of trees. In both cases, the measure is recorded in the same way: fairway missed. However, over the course of the season, this measure should identify the most accurate drivers on tour. In prior studies, both DrDist and DrAccu have been found to be significantly correlated with performance on tour. However, they have not been found to be as important as GIR and PPR. Between DrDist and DrAccu, DrAccu has been found to be more highly correlated with scoring average on tour. We will attempt to determine if this relationship still holds, when PGA Tour pros are hitting the ball farther than ever before. 6. Sand Saves: Sand Saves is defined as the percentage of up-and-downs from greenside bunkers. Prior research indicates sand saves has not been found to be important at all. This is probably an indication that PGA Tour pros do not spend much time in sand traps, and when they are in bunkers, the level of play across players is fairly uniform. Robert Moy and T. Liaw analyzed 1993 PGA Tour data using money earned as the performance measure. They found sand saves not to be an important statistic and concluded all PGA pros are good sand players. In other words, good and bad tour pros cannot be distinguished by their sand play. 7. Bounce Back: The PGA Tour began recording the bounce back statistic in 1986. It is defined as the percentage of holes with a bogey or worse followed on the next hole with a birdie or better. The only study to include bounce back in the analysis was the previously mentioned article by Finley and Halsey. They concluded bounce back added a unique contribution to predicting player performance and should be included in future studies.
Performance Measures 1. Scoring Average: Actual scoring average—the total number of strokes divided by total number of rounds—was one of the performance measures used in this study. Another possibility could have been adjusted scoring average, which is a weighted average that takes the stroke average of the field into account. It was reasoned that if a player plays a more difficult course, where the fairways are tougher to hit, which corresponds to a lower fairway percentage and a higher score, then that relationship should appear in the data. Using adjusted scores might distort the real relationship between the independent variables and the stroke average. CHANCE
13
Figure 3. Correlation: ln(Money) vs. Driving Distance
2. Ln(Money per Event): Ultimately, PGA Tour pros are
judged by how much money they make. The top 125 on the money list are exempt for the following season, the top 30 qualify for the season-ending tour championship, and the top 40 are exempt for the next year’s Masters Tournament, which is the first of four major championships in men’s professional golf each year. Players falling out of the top 125 can even use a one-time exemption if they are high enough on the career money list. Considering the importance of money earned, it seemed necessary to use money as a measure of performance. Money per event was used, instead of total money, to remove the disparity in the number of events played by different professionals. Many of the top players play in events all over the world and therefore don’t play a full PGA Tour schedule. However, due to the skewness of money earned, the natural logarithm of money per event was used, similar to Moy and Liaw, who used the natural logarithm of total money. Using the natural logarithm of money per event tended to dampen the effect of the majority of money going to the top-place finishers and provided a better fit for the model.
Correlation Analysis To investigate the importance of individual golf skills on golf performance, time series plots of the correlation of each skill with scoring average and ln(money) were constructed. As expected, all independent variables have negative correlations with scoring average and positive correlations with ln(money). The better a player’s performance in a category, the lower their score and the more money they earn. The one exception is PPR, where lower numbers obviously produce lower scores and more money. To make the side-by-side plots more comparable, the absolute magnitude of each correlation, ignoring sign, was used in figures 2 through 15.
Figure 2. Correlation: Scoring Avg vs. Driving Distance
1.00 1.00 0.90 0.80 0.80 0.70 0.70 0.60 0.60 0.50 0.50 0.40 0.40 0.30 0.30 0.20 0.20 0.10 0.10 0.00 0.00 0.90
92 993 994 995 996 997 998 999 000 001 002 003 2 2 2 1 1 1 2 1 1 1 19 1 Year Year Figure 2. Correlation: scoring average and driving distance
14
VOL. 21, NO. 1, 2008
1.00 1.00 0.90 0.90 0.80 0.80 0.70 0.70 0.60 0.60 0.50 0.50 0.40 0.40 0.30 0.30 0.20 0.20 0.10 0.10 0.00 0.00 92 993 994 995 996 997 998 999 000 001 002 003 2 2 2 1 1 1 2 1 1 1 19 1 Year
Figure 4. Correlation: Scoring Avg vs. Figure 3. Correlation: In(Money) Year and driving distance Driving Accuracy
1.00 1.00 0.90 0.80 0.80 0.70 0.70 0.60 0.60 0.50 0.50 0.40 0.40 0.30 0.30 0.20 0.20 0.10 0.10 0.00 0.00 0.90
92 993 994 995 996 997 998 999 000 001 002 003 2 2 2 1 1 1 2 1 1 1 19 1 Year Figure 5. Correlation: ln(Money) vs. Driving Accuracy
Year Figure 4. Correlation: scoring average and driving accuracy
1.00 1.00 0.90 0.90 0.80 0.80 0.70 0.70 0.60 0.60 0.50 0.50 0.40 0.40 0.30 0.30 0.20 0.20 0.10 0.10 0.00 0.00 92 993 994 995 996 997 998 999 000 001 002 003 2 2 2 1 1 1 2 1 1 1 19 1 Year Figure 6. Correlation: Scoring Avg vs. Greens in Regulation
Year Figure 5. Correlation: In(Money) and driving accuracy
1.00 1.00 0.90 0.90 0.80 0.80 0.70 0.70 0.60 0.60 0.50 0.50 0.40 0.40 0.30 0.30 0.20 0.20 0.10 0.10 0.00 0.00
92 993 994 995 996 997 998 999 000 001 002 003 2 2 2 1 1 1 2 1 1 1 19 1 Year
Year Figure 6. Correlation: scoring average and greens in regulation
Figure 11. Correlation: ln(Money) vs. Sand Saves
Figure 7. Correlation: ln(Money) vs. Greens in Regulation
1.00 0.90 0.80 0.80 0.70 0.70 0.60 0.60 0.50 0.50 0.40 0.40 0.30 0.30 0.20 0.20 0.10 0.10 0.00 0.00
1.00 1.00 0.90 0.90 0.80 0.80 0.70 0.70 0.60 0.60 0.50 0.50 0.40 0.40 0.30 0.30 0.20 0.20 0.10 0.10 0.00 0.00
1.00
0.90
92 993 994 995 996 997 998 999 000 001 002 003 2 2 2 1 1 1 2 1 1 1 19 1 Year Figure 8. Correlation: Scoring Avg vs. Putts Per Round
Year Figure 7. Correlation: In(Money) and greens in regulation
1.00 1.00 0.90 0.90 0.80 0.80 0.70 0.70 0.60 0.60 0.50 0.50 0.40 0.40 0.30 0.30 0.20 0.20 0.10 0.10 0.00 0.00
92 993 994 995 996 997 998 999 000 001 002 003 2 2 2 1 1 1 2 1 1 1 19 1 Year Avg vs. Figure 12. Correlation: Scoring
Year Figure 11. Correlation: In(Money) and sand saves Scrambling
1.00 1.00 0.90 0.80 0.80 0.70 0.70 0.60 0.60 0.50 0.50 0.40 0.40 0.30 0.30 0.20 0.20 0.10 0.10 0.00 0.00 0.90
92 993 994 995 996 997 998 999 000 001 002 003 2 2 2 1 1 1 2 1 1 1 19 1 Year Figure 9. Correlation: ln(Money) vs. Putts Per Round
Year Figure 8. Correlation: scoring average and putts per round
1.00 1.00 0.90 0.80 0.80 0.70 0.70 0.60 0.60 0.50 0.50 0.40 0.40 0.30 0.30 0.20 0.20 0.10 0.10 0.00 0.00
92 993 994 995 996 997 998 999 000 001 002 003 2 2 2 1 1 1 2 1 1 1 19 1 Year
Figure 13. Correlation: ln(Money) Year Figure 12. Correlation: scoring average and vs. scrambling Scrambling
1.00 1.00 0.90 0.80 0.80 0.70 0.70 0.60 0.60 0.50 0.50 0.40 0.40 0.30 0.30 0.20 0.20 0.10 0.10 0.00 0.00 0.90
0.90
92 993 994 995 996 997 998 999 000 001 002 003 2 2 2 1 1 1 2 1 1 1 19 1 Year Avg vs. Figure 10. Correlation: Scoring
Year Figure 9. Correlation: In(Money) and putts per round Sand Saves
1.00 0.90 0.80 0.80 0.70 0.70 0.60 0.60 0.50 0.50 0.40 0.40 0.30 0.30 0.20 0.20 0.10 0.10 0.00 0.00
92 993 994 995 996 997 998 999 000 001 002 003 2 2 2 1 1 1 2 1 1 1 19 1 Year Avg vs. Figure 14. Correlation: Scoring
Year Figure 13. Correlation: Bounce In(Money) and scrambling Back
1.00 1.00 0.90 0.80 0.80 0.70 0.70 0.60 0.60 0.50 0.50 0.40 0.40 0.30 0.30 0.20 0.20 0.10 0.10 0.00 0.00
1.00
0.90
0.90
92 993 994 995 996 997 998 999 000 001 002 003 2 2 2 1 1 1 2 1 1 1 19 1 Year Year Figure 10. Correlation: scoring average and sand saves
92 993 994 995 996 997 998 999 000 001 002 003 2 2 2 1 1 1 2 1 1 1 19 1 Year Year Figure 14. Correlation: scoring average and bounce back
CHANCE
15
Figure 15. Correlation: ln(Money) vs. Bounce Back
Figure 16. Correlations of Driving Distance and Driving Accuracy with Scoring Average
0.50 0.50 0.45 0.45 0.40 0.40 0.35 0.35 0.30 0.30 0.25 0.25 0.20 0.20 0.15 0.15 0.10 0.10 0.05 0.05 0.00 0.00
1.00 1.00 0.90 0.80 0.80 0.70 0.70 0.60 0.60 0.50 0.50 0.40 0.40 0.30 0.30 0.20 0.20 0.10 0.10 0.00 0.00 0.90
Year _______ driving distance _______ driving accuracy
92 993 994 995 996 997 998 999 000 001 002 003 2 2 2 1 1 1 2 1 1 1 19 1 Year
Figure 16. Correlations of driving distance and driving accuracy with Figure 17. Correlations of Driving Distance scoring average driving distance
Figure 15. Correlation: In(Money) and bounce back Year
0.50 0.50 0.45 0.40 0.40 0.35 0.30 0.30 0.25 0.20 0.20 0.15 0.10 0.10 0.05 0.00 0.00
92 993 994 995 996 997 998 999 000 001 002 003 2 2 2 1 1 1 2 1 1 1 19 1 Year Year _______ driving distance _______ driving accuracy
Figure 17. Correlations of driving distance and driving accuracy with Fi g u r e 1 8 . P G A T o u r S c o r in g A v e r a g e b y In(Money) driving distance
driving accuracy
Ye a r
Residual
Scoring Average
VOL. 21, NO. 1, 2008
driving accuracy
and Driving Accuracy with ln(Money)
Greens in regulation (GIR) was consistently the most highly correlated independent variable with both scoring average and ln(money). Scrambling was a close second to GIR, and even had a higher correlation with scoring average in 1999 and higher correlations with ln(money) in 1995, 1999, and 2002. These two variables clearly separated themselves from the rest of the independent variables, which is consistent with prior research. However, interesting trends were noticed in regard to scrambling. The correlation between scrambling and the performance measures was trending down throughout the 1990s, rebounded in 1999, and then dropped again in 2002 and 2003, when driving distances were at the highest levels. The large drop in 2003 may be an anomaly, but it is something to watch in future studies. DrAccu tended to be more highly correlated than DrDist with scoring average (see Figure 16). The difference was not large, but it was consistent, which supports prior research. However, it doesn’t appear either measure is more important than the other when money is used to measure performance (see Figure 17). An interesting trend with DrDist is that correlations between DrDist and the performance measures went down after 2000, just when actual driving distance on tour spiked. Scoring averages on tour also have dropped during this time, which makes the low correlations even more surprising. Figure 18 shows a time series plot of scoring average on tour during this study. This figure clearly shows scoring average on tour has dropped since 1999, while average driving distance has increased (see Figure 1). If lower scores on tour coincide with large increases in driving distance, why is the correlation between these two variables going down? This seems counterintuitive, but it could be an indication that nearly all the pros on tour hit the ball a long way now. Certainly, the longer hitters can hit par 5 greens in two with irons and use short irons and wedges into par fours, but maybe their advantage isn’t as pronounced today because shorter hitters can now reach par 5s in two as well. Consider the quote from Jay Haas, 49, in a 2003 Golf Digest article by J. Diaz: “Now even I can reach par 5s. I’m thinking: Wow, is this what it’s like to finally be long? No wonder these guys have been kicking my butt for so long.” Scott Berry, in his article “A Game of Which I Am Not Familiar” published in CHANCE in 2003, estimated that golfers 16
92 993 994 995 996 997 998 999 000 001 002 003 2 2 2 1 1 1 2 1 1 1 19 1 Year
71.8 71.8 71.6 71.6 71.4 71.4 71.2 71.2 71 71 70.8 70.8 70.6 70.6 70.4 70.4
92 993 994 995 996 997 998 999 000 001 002 003 2 2 2 1 1 1 2 1 1 1 19 1 Year
Figure 18. PGA Tour scoring average by year Year
today are about 1.5 strokes better than golfers in 1980, based on their driving skills. Scoring average on tour in 1980 was 72.15 and 70.95 in 2003, a difference of 1.20 strokes. Berry’s estimate is difficult to verify empirically because course conditions on tour have changed during this time. Course designers are incorporating more distance into their designs, and tour officials can cut pins closer to the edge of greens, grow the rough higher, and increase the speed and firmness of greens. In light of these changes, Berry’s estimate certainly seems to be reasonable, and indicates that, everything else being equal, better driving skills on tour lead to lower scores. The distance component on driving certainly has improved, but it has improved for all golfers, and it now appears more difficult for players to separate themselves from the field, based on driving distance. Correlations between DrAccu and the performance measures were stable throughout this study, and then took a large
F ig u r e 1 9 . D r iv in g A c c u r a c y o n PG A T o u r : 1 9 9 2 - 2 0 0 3
Regression Analysis To supplement the correlation analysis, linear regression models were used to determine how well the independent variables as a group could predict performance and to measure the unique contribution of each independent variable to prediction. For each year, two regression models were developed. One model regressed scoring average against the seven independent variables and a second model regressed ln(money) against the seven independent variables. The residual plot for the model-predicting scoring average for 1992 is shown in Figure 20. Residual plots for the other years were similar, so they were not included. The residual plot shows a nice random scatter pattern about zero—indicating a good fit—and no major departures from the typical regression assumptions of normality, linearity, homogeneity, and independence. The R-square value, equal to .9457, is listed in the chart as well. Throughout this study, the seven independent variables typically explained around 94% or 95% of the variation in scoring average. The lowest R-square value was in 2003 at 0.90. High R-square values were expected from these models. It should not be considered a “finding” that GIR, PPR, scrambling, sand saves, DrDist, DrAccu, and bounce back predict score accurately. However, the high R-square values indicate the independent variables included in this study almost completely account for any variability in a player’s score. Therefore, when analyzing which independent variables are the most important to PGA Tour players, we are not missing any important skills with regard to score.
70% 69% 68% 67% 66% 65% 64% 63% 62% 61% 60%
70%
69% 68% 67%
66% 65%
64% 63% 62% 61%
60%
92 993 994 995 996 997 998 999 000 001 002 003 2 2 2 1 1 1 2 1 1 1 19 1 1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
Figure 19. Driving accuracy on PGA Tour: 1992–2003 Fig u r e 2 0 . R e s id u a ls v s . Pr e d ic te d V a lu e s - 1 9 9 2 D V : S c o r in g A v e r a g e
R 2 = .9 4 5 6 9 3
0.8
0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 69 0.6
Residual
0.4
0.2
0
-0.2
-0.4
-0.6
-0.8
69
69.5
70
69.5 70
70.5
71
71.5
70.5 71
72
71.5
72.5
72 72.5
Predicted Value
73
73.5
73
74
73.5 74
Predicted Value
Figure 20. Residuals versus predicted values—1992 DV: Scoring average R2=.945693 Fig u r e 2 1 . R e s id u a ls v s . Pr e d ic te d V a lu e s - 1 9 9 2 D V : ln ( m o n e y )
2 1.5 1 0.5 0 -0.5 -1 -1.5 -2
R 2 = .7 8 9 4 0 6
2
1.5
Residual
drop in 2003, just when DrDist jumped seven yards to an all-time high of 286.38 yards. Such a large drop in correlation between DrAccu and the performance measures raises the question of whether DrAccu on tour went down dramatically in 2003. A time series plot of DrAccu was constructed and is displayed in Figure 19. DrAccu dropped almost two percentage points in 2003, which isn’t too surprising considering the ball was traveling farther. The longer the tee shot is, the more accurate it must be to stay in the fairway. (It’s actually surprising that DrAccu did not decrease in 2001, when there was a previous spike in DrDist). Although DrAccu trended down over the past two years, it only decreased from 68% to 66%. This alone does not seem to explain the drop in correlation between DrAccu and scoring average from 0.35 to 0.14, and the correlation between DrAccu and ln(money) dropping from 0.22 to zero. The large decreases in correlation, even though DrAccu only decreased 2%, seem to suggest that with the driver going so far, it just didn’t matter whether the player was in the fairway. With longer drives, players could get close enough to the green to play short irons or wedges. Even from rough, they could control the shot into the green. A final observation regarding the correlation plots is made with respect to bounce back. The correlation between this statistic and the performance measures is clearly trending down after 2000. Prior research indicated this statistic was a meaningful measure of performance and should be included in future studies. The correlation plots seem to indicate this statistic is actually becoming less meaningful as driving distance increases.
1
0.5
0
-0.5
-1
-1.5
-2
5 5
6 6
7 7
8 8
Predicted Value
9 9
10 10
11 11
12 12
Predicted Value
Figure 21. Residuals versus predicted values—1992 DV: In(Money) R2=.789406
The residual plot for the model predicting ln(money) for 1992 is shown in Figure 21. As mentioned earlier, money earned in each event is highly skewed toward the top-place finishers. Using the natural logarithm of money per event tends to dampen this effect and provide a better fit for the models. Again, the residual plot shows a nice random scatter pattern about zero, indicating that the regression assumptions of normality, linearity, homogeneity, and independence are reasonable. The R-square values for ln(money) were much less than what was observed for scoring average in this study. Typically, values ranged from 0.65 to 0.79 until 2003, when they dipped to 0.52. The independent variables certainly explain less of the variation for money than for scoring average, but this was expected due to money earned being weighted heavily toward high-place finishers. However, using the natural logarithm of money, between 65% and 80% of the variation in the response can be explained, which gives meaningful results. In regression, the unique contribution of each independent variable can be measured using an F-test, which measures whether the addition of one independent variable significantly improves the prediction of the response, given that the rest of
CHANCE
17
the independent variables are already included in the model. For example, in this study, we can test whether adding GIR to a model that already includes DrDist, DrAccu, PPR, sand saves, scrambling, and bounce back significantly improves our prediction of scoring average. In our study, if a variable does not make a unique contribution to the prediction of performance, it might indicate that while correlated to a degree with performance, it is not an important variable in determining performance compared to the other variables in the study, and therefore, not an important skill on the PGA Tour. Tables 3 and 4 list the p-values measuring the unique contribution of each independent variable per year for both measures of performance. These tables indicate that GIR and PPR are two important statistics for predicting player performance, whether measured by scoring average or money. They each made significant and unique contributions to both models in every year of this study. As expected from its high correlation with scoring average noted earlier, scrambling also made a significant and unique contribution to the scoring average model for each year. However, scrambling only made a unique contribution to the ln(money) model during two years, 1992 and 2001. This indicates scrambling is not an important statistic for predicting money earned on the PGA Tour, and due to money being skewed toward the top-place finishers, the players at the top are not separating themselves by saving pars, but by making birdies. When a player is scrambling well, it only indicates they are recovering from poor shots. To earn large paychecks, players must be at the top of their game, consistently giving themselves birdie opportunities and converting. DrDist was not strongly correlated with either performance measure, with correlations ranging from 0.10 to 0.40. However, it made a unique contribution to predicting both performance measures during the time period of this study. While the correlation analysis indicates DrDist is not the most important measure of performance, regression shows
its contribution to prediction is unique and not explained by other variables, making it an important statistic. An interesting trend regarding DrDist and ln(money) should be noted: Average driving distance on tour spiked in 2001 and then again in 2003. The p-value for DrDist has increased since 2001, and rose above 0.05 in 2003, when average driving distance on tour was at its highest level. This might be another indication that nearly all the pros on tour can hit the ball a long way now, and if driving distances continue to increase, it may become increasingly difficult for a player to separate himself from the field on driving distance. DrAccu was more highly correlated with the performance measures than DrDist, but its contribution to prediction was not as important. For scoring average, DrAccu only made a unique contribution (= 0.05) three times during this study, and not once since 1999. With respect to ln(money), DrAccu made a significant and unique contribution to prediction in all but two years before 2001. However, when driving distances spiked in 2001, the p-value shot up to 0.47, stayed high in 2002 at 0.18, and then shot up again to 0.89 in 2003, when there was another spike in driving distance. Considering that DrAccu did not make a significantly unique contribution to predicting either performance measure during the last three years of this study, it seems to be evident that accuracy off the tee is becoming less important as driving distances increase. The regression analysis supports the conclusions of prior research that sand saves is not an important statistic. In this study, sand saves never made a significant contribution to predicting scoring average, and was only significant three times when predicting ln(money). Prior research indicates bounce back should be included in future studies. Regression results from this study regarding the importance of bounce back to future research are inconclusive. In some years, bounce back made a unique contribution to performance; in other years it did not, but there was no pattern to the results. Bounce back should probably be included
Table 3 — P-Values Measuring Unique Contribution to Predicting Scoring Average Year
DrDist
DrAccu
GIR
PPR
Sand Saves
Scrambling
Bounce Back
1992
< .0001
.1193
< .0001
< .0001
.2763
< .0001
.1206
1993
.0104
.6473
< .0001
< .0001
.8092
< .0001
.0107
1994
< .0001
.0311
< .0001
< .0001
.5456
< .0001
.0117
1995
< .0001
.6597
< .0001
< .0001
.9007
< .0001
.1747
1996
< .0001
.2361
< .0001
< .0001
.4440
< .0001
.0022
1997
< .0001
.0033
< .0001
< .0001
.6801
< .0001
.1006
1998
< .0001
.0417
< .0001
< .0001
.6082
< .0001
.2043
1999
.0030
.0776
< .0001
< .0001
.7804
< .0001
.1023
2000
< .0001
.3940
< .0001
< .0001
.8550
< .0001
.9308
2001
< .0001
.0586
< .0001
< .0001
.0632
< .0001
< .0001
2002
< .0001
.1222
< .0001
< .0001
.7276
< .0001
.0242
2003
.0011
.3959
< .0001
< .0001
.5982
< .0001
.7718
Bold indicates unique contribution is not significant at = .05
18
VOL. 21, NO. 1, 2008
in future studies, but it is clear from the regression results that GIR, PPR, scrambling, and DrDist are much more important statistics. It should be noted that inferences in this study with regard to regression were only made one year at a time. In a given year, all observations are independent (different players), so these inferences are valid. However, a primary interest of this study is to examine trends that may appear from one year to the next. As the top 125 pros on the money list are exempt for the following year, a large cohort of players will remain on tour from one year to the next. This means there is dependence among observations from one model to another. However, no formal statistical tests are being made comparing one year to another. This study is simply attempting to identify any changes in the style of play on tour, specifically with regard to the changing importance of different statistics from one year to the next. Whether these changes are due to the same players changing their games, or to new players with different styles joining the tour, is not being examined in this study.
Cluster Analysis Correlation and regression analysis support prior research that GIR is the most important single statistic on the PGA Tour. However, the results also seem to indicate that as driving distance continues to increase, driving accuracy is not as important as it used to be. To further examine this trend, the most successful golfers on tour were profiled using cluster analysis. Cluster analysis attempts to divide the data set into groups where the observations in each group have similar characteristics. The premise is that based on the independent variables,
cluster analysis would combine players into groups where the players in each group had similar styles of play. For instance, in this study, one group might be characterized as long, inaccurate drivers with average short games. Another group might be characterized as short, accurate drivers with good short games. In order to separate the observations into like groups, some type of distance measure must be used and observations with a small distance from each other are combined into the same group. This study used Ward’s hierarchical clustering method, which attempts to minimize the error sum of squares for each cluster. The error sum of squares is the sum of the squared deviations for every item in the cluster from the cluster mean. The cluster mean is the average player for the group. This average player has the mean driving distance, the mean putting average, etc., for the entire group or cluster. The closer any player’s statistics are to the average player, the smaller their distance is from the cluster mean. The error sum of squares is the sum of all these squared distances from the cluster mean. In other words, the more alike the items are in each cluster, the smaller the error sum of squares will be. Ward’s cluster analysis was performed by year, using the independent variables in this study after they were standardized to produce four and five clusters. Standardized variables were used because the independent variables are measured on different scales. Average performance for a category is zero, and above average performance is given by positive numbers. The one exception, of course, is PPR. The lower this number is, the better the performance, so groups of good putters would have negative numbers in this category. In order to make the results easier to read, the sign was reversed for PPR so positive values imply above average performance for all independent
Table 4 — P-Values Measuring Unique Contribution to Predicting In(Money) Year
DrDist
DrAccu
GIR
PPR
Sand Saves
Scrambling
Bounce Back
1992
< .0001
.0473
< .0001
< .0001
.1786
.0464
.3027
1993
< .0001
.1196
< .0001
< .0001
.2606
.0706
.6819
1994
< .0001
< .0001
< .0001
< .0001
.8295
.7269
.9717
1995
< .0001
< .0001
< .0001
< .0001
.2924
.0508
.0311
1996
< .0001
.0003
< .0001
< .0001
.1126
.9412
.0111
1997
< .0001
.0032
< .0001
< .0001
.0124
.8537
.5065
1998
< .0001
.0592
< .0001
< .0001
.7134
.9805
.1475
1999
< .0001
.0063
< .0001
< .0001
.4592
.7024
.4875
2000
< .0001
.0156
< .0001
< .0001
.6515
.4512
.5931
2001
.0008
.4681
< .0001
< .0001
.0069
.0045
.0005
2002
.0022
.1845
< .0001
< .0001
.0005
.1450
.4496
2003
.0545
.8893
< .0001
< .0001
.2034
.8234
.1621
Bold indicates unique contribution is not significant at = .05
CHANCE
19
variables. Deciding on the number of clusters to use was somewhat arbitrary, but combining the players into four or five groups seemed reasonable. After the clustering was finished, money per event and stroke average were averaged over all players for each cluster. Remember that the clusters were constructed without any regard to the performance measures. In this study, the cluster with the highest average earnings was also the cluster with the lowest stroke average and labeled the “best” cluster. Both four- and five-cluster solutions were similar, so only the fourcluster solution is reported. Table 5 lists the standardized values for the independent variables, averaged over all players in the best cluster from the four-cluster solution, for each year in the study. Using Ward’s hierarchical clustering method, it became clear that golfers in clusters with high earnings and low stroke averages tended to perform well in several categories, supporting Moy and Liaw. Sometimes these elite clusters would be average, or even below average, in a skill, but they clearly did many things well. Despite new technology enabling golfers to hit their drives farther than ever before, simply bombing the ball off the tee is not enough to be successful. Clusters with low earnings and high stroke averages were usually below average in multiple categories, especially the short game. One consistency for the elite cluster in each year was high performance in greens in regulation. This can be seen clearly in Figure 22, which is a time series plot of GIR performance for the best cluster. The only exceptions were 1995 and 2001, when the average performance in greens in regulation was compensated for with high marks in all short game skills. This gives further evidence that GIR is the most important skill on the PGA Tour, dating back to a study by James Davidson and Thomas Templin in 1986, and continuing through to today’s PGA Tour, with its advanced technology. An interesting trend can be seen regarding driving accuracy and driving distance. As mentioned before, titanium drivers were first introduced in 1995. The top cluster was above aver-
age in driving accuracy and below average in driving distance during 1993, 1994, and 1995. However, beginning in 1996, the characteristics reversed. The top cluster was at or below average in driving accuracy and above average in driving distance. This can be seen more clearly in the time series plot in Figure 23. The only exception is 2002, when driving accuracy was better than driving distance. Recall that 2002 was also a season when driving distance leveled off for a year. The cluster analysis is entirely exploratory, but it seems to support the conclusions of the correlation and regression analysis. GIR is the most important statistic on tour, and driving accuracy has not been an important statistic in recent years.
Conclusions The analysis shows a high greens in regulation percentage and a good short game are the keys to low scoring averages and high earnings. This is no surprise and should not be considered a ‘finding.’ However, as technology improves and drives go farther, the importance of driving accuracy, even among the top golfers, is becoming less important. Consider the driving statistics for the 2003 season, listed in Table 6, of the top three ranked players in the world at the time of this study. These are just three players, but they were the best three players in the world at the time of this study. The message is clear. If a player is driving the ball far enough, driving accuracy is not an important statistic. The comments of Hal Sutton in the 2003 Golf Digest article “The Gap” seem to confirm this as the feeling on tour: “If you were to ask everybody out here whether they wanted distance or accuracy, they’d all tell you distance,” he said. “Forget accuracy. The fairways are soft, the greens are hard, and there is no rough. Let’s kill it. There’s no such thing as thinking. It’s just grab your driver and hit it as far as you can.” An interesting direction for future research would be to continue to examine the relationship between driving distance and driving accuracy. As mentioned by Ketzscher and Ringrose,
Table 5 — Characteristics of the ‘Best’ Cluster, 1992–2003 Year
Size
DrDist
DrAccu
GIR
PPR
Sand Saves
1992
39
-0.67
0.99
0.33
0.50
0.49
0.93
-0.07
1993
41
-0.06
1.03
1.15 -0.14
0.08
0.84
0.66
1994
67
-0.11
0.55
0.50
0.22
0.05
0.46
0.64
1995
63
-0.29
0.51
-0.06
0.83
0.61
0.92
0.10
1996
14
1.64
-0.32
1.00 -0.16
1.00
0.61
1.03
1997
27
1.47
-0.67
0.47 -0.08
-0.10
-0.06
0.64
1998
31
1.02
-0.26
0.94 -0.45
-0.40
-0.24
1.13
1999
66
0.55
0.03
0.84 -0.33
0.04
0.16
0.53
2000
44
-.56
0.02
0.53
0.58
0.98
0.81
0.73
2001
46
0.45
-0.62
-0.04
0.67
0.74
0.48
0.59
2002
42
0.17
0.54
1.11 -0.36
0.21
0.38
0.11
2003
53
0.56
-0.50
0.13
0.50
0.46
0.76
Bold indicates unique contribution is not significant at = .05
20
VOL. 21, NO. 1, 2008
0.54
Scrambling Bounce Back
2 1.5 1.5 1 1 0.5 0.5 0 0 - 0.5 -0.5 -1 - 1 - 1.5 -1.5 -2 - 2
F ig u r e 2 3 . D r i v i n g C h a r a c t e r is t i c s o f " B e s t " C lu s t e r
2
Average Standardized Score
Average Standardized Score
F i g u r e 2 2 . G IR P e r f o r m a n c e o f " B e s t " C lu s t e r
GIR
AVG
1 9 92 2 1 9 93 3 1 9 944 1 9 955 1 9 966 1 9 977 1 9 988 1999 2000 2001 2002 2003 9 999 000 001 002 003 9 9 9 9 9 9 2 1 2 2 19 19 19 19 19 19 Y e 1a 9r 2 Year
2 2 1.51 . 5 1 1 0.50 . 5 0 0 - 0.5 -0.5 -1 - 1 - 1.5 -1.5 -2 - 2
DrDist DrAccu
1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 92 993 994 995 996 997Y e a 9r 98 999 000 001 002 003 2 2 2 2 1 1 1 1 1 1 19 1 Year
_______ GIR
_______ GIR
_____ AVG
_____ AVG
Figure 22. GIR performance of “best” cluster
Figure 23. Driving characteristics of “best” cluster
Table 6 — 2003 Driving Statistics for the Top Three Ranked Players in the World World Rank
Name
DrDist
1
Tiger Woods
299
2
Vijay Singh
3
Ernie Els
DrAccu
DrAccu Rank
11
62.7%
142
302
6
63.4%
132
303
5
61.0%
161
driving ability is a skill that can be measured accurately because it does not depend on any other statistics, and therefore, researchers should be able to accurately measure its impact on performance. In addition, club and ball manufacturers will continue to try to improve their products, and unless the USGA or some other governing body intervenes, driving distances will almost certainly continue to increase. The distance the ball travels and its effect on the game should be an interesting and hotly debated topic in the years to come.
Further Reading Belkin, D.S.; Gansneder, B.; Pickens, M.; Rotella, R.J.; and Striegel, D. (1994). “Predictability and Stability of Professional Golf Association Tour Statistics.” Perceptual and Motor Skills, 78:1275–1280. Berry, Scott M. (2003). “A Game of Which I Am Not Familiar.” CHANCE, 16(3):45–48. Davidson, J.D. and Templin, T.J. (1986). “Determinants of Success Among Professional Golfers.” Research Quarterly for Exercise and Sport, 57:60–67. Diaz, J. (2003). “The Gap.” Golf Digest, 54(5):134–140. Engelhardt, G.M. (1995). “ ‘It’s Not How You Drive, It’s How You Arrive’: The Myth.” Perceptual and Motor Skills, 80:1135–1138. Finley, P.S. and Halsey, J.J. (2004). “Determinants of PGA Tour Success: An Examination of Relationships Among
DrDist Rank
Performance, Scoring, and Earnings.” Perceptual and Motor Skills, 98:100–1106. Hale, T. and Hale, G. (1990). “Lies, Damned Lies, and Statistics in Golf.” In A. Cochran (Ed.) Science and Golf: Proceedings of the First World Scientific Congress of Golf. London: E & FN Spon. Jones, R. (1990). “A Correlation Analysis of the Professional Golf Association (USA) Statistical Rankings for 1988.” In A. Cochran (Ed.) Science and Golf: Proceedings of the First World Scientific Congress of Golf. London: E & FN Spon. Ketzcher, R. and Ringrose, T.J. (2001). “Exploratory Analysis of European Professional Golf Association Statistics.” The Statistician, 51(2):215–228. Kramer, S. (1999) “State of the Game.” Golf Magazine, 41(10):144-146, 187. Moy, R. and Liaw, T. (1998). “Determinants of Professional Golf Tournament Earnings.” American Economist, 42:65– 70. Nicklaus, J. (1998). “Too Hot for the Game’s Good.” Golf Magazine, 40(9):42–46. Stachura, M. (2003). “Bigger Is Better.” Golf Digest, 54(5): 146–148. Stefani, R.T. (2005). “Politics, Drugs, and the Olympics and the Winners Are … Politics and Drugs.” CHANCE, 18(4): 20–28.
CHANCE
21
Putting Chance to Work: Reducing the Politics in Political Redistricting David W. Peterson
Peterson shows the role chance can play in creating hundreds—or perhaps thousands—of partial solutions to a problem, of which we select only the best for further consideration.
T
o many people, chance is something that befalls them. It causes their neighbors to win the lottery instead of them; it causes their officemates’ daughters to be admitted to Stanford instead
22
VOL. 21, NO. 1, 2008
of their own daughters; and it causes their old college roommates to win those glamorous job assignments they covet for themselves. Chance is not always perverse, though, for it acts as well to keep some of us free
from illnesses or injuries that afflict others, it gives us the occasional sunny day when we have an outing planned, and it gives us the happy coincidence—as when a gift turns out to be just exactly what we want. But chance in all such instances is something that happens to us; we don’t count or rely on it, for to do so too often leads to disappointment. How odd it must seem, then, to learn that statisticians—far from passively suffering the vagaries and indignities of chance—actively harness it to perform amazing feats. Indeed, they harness chance in such a way that chance not only performs a valuable function, but eliminates—or at least greatly reduces— the adverse effects of, well, chance. Consider the designed experiment, perhaps a comparison of the effects of two headache treatments. In such an experiment, patients are assigned treatment based on chance so that any differences in response can be attributed to no more than two possible causes: the chance assignment process or a difference in the effectiveness of the two treatments. Because of the way the experiment is structured, one can calculate the probability that the assignment process alone would cause the observed difference in effectiveness. If this probability is sufficiently small, it strongly suggests the differences in response are due
Shading indicates super districts. Numbers indicate the number of senators for the super district.
Figure 1. A super district cover for the North Carolina Senate
Shading indicates super districts. Numbers indicate the number of senators for the super district.
Figure 2. A second super district cover for the North Carolina Senate
to a difference in the effectiveness of the two treatments—that one treatment is more effective (or at least has different effects) than the other. Or consider the random sample, perhaps a survey of the sort conducted regularly on behalf of politicians. A thousand people are chosen at random from a certain target population—say, all vot-
ing-age residents of Ohio—and each is asked some yes-or-no question, such as whether he or she favors an increase in gasoline tax as part of a national effort to encourage use of alternative energy sources. Because of the way chance works in such situations, the result of this poll is almost guaranteed to reveal—to within about 5%—the
percentage of all voting-age residents of Ohio who, if asked, would say they favor such a tax. Chance can be harnessed usefully in other ways, as well. Consider Monte Carlo simulations, in which elaborate computer models of an evolving economic system or an aircraft in flight are fed randomly generated shocks to see CHANCE
23
Member residence Clusters Counties
Figure 3. Final 2003 North Carolina Senate voting districts showing super districts (county “clusters”) and incumbent residences
how they respond. And consider, too, the rather odd application that follows, where we harness chance to obtain a partial solution to a problem that seems too complex to solve outright, a problem that has plagued the voting public for decades and that seems to get worse as technology improves: How to define voting districts without undue reliance on politics. The role of chance here is in creating hundreds, or perhaps thousands, of partial solutions to our problem, from which we select only the best for further consideration.
The Problem The North Carolina Legislature consists of a 50-member Senate and a 120-member General Assembly. For purposes of electing members of the Senate, the state is divided into 50 nonoverlapping geographic regions (“Senate voting districts”), and the voters in each region elect their own senator. Similarly, the state is divided into 120 “General Assembly voting districts” for purposes of electing members of that legislative body. To elect its members in the U.S. House of Representatives, the state is likewise divided into 12 congressional voting districts. These three divisions of the state have a common quality. Each Senate voting district is supposed to contain the same number of residents as every other Senate district; each General 24
VOL. 21, NO. 1, 2008
Assembly voting district the same number as every other General Assembly district, and each congressional voting district the same number as every other congressional district. Because populations grow, shrink, and shift over time, a set of district boundaries that meets these requirements for one year’s elections often fails to meet them for another’s. So, with every decennial census, North Carolina, like many states, usually has to redefine at least one of its several sets of voting districts. Over the years, drawing new political boundaries has become a fine art—a fine political art. Armed with data on past voting patterns and modern mapping software, politicians can fine-tune the process to almost guarantee a particular incumbent will be re-elected or not, or that the party in control of the redrawing process will remain in control indefinitely. The whole process is so fraught with political nuance that representatives almost choose their voters, rather than the other way around. Of course, there are certain restrictions on how one may define districts. They should be contiguous—no fair plucking a few acres from each of several distant towns and calling that a district. And districts should be compact; an oblong or square district is preferable to one shaped like a salamander. Natural barriers, such as mountains or bodies of water, may be of some account, as
may a difference in interests of, say, people living in rural areas relative to city-dwellers. But all in all, aside from the requirement that each voting district encompass nearly the same number of residents as every other, anyone drawing new district boundaries has plenty of latitude. The problem, then, is how to do the job in a way that limits this opportunity for political mischief. Others before us have tried to solve this problem more or less objectively. In a 1998 article in Management Science, Anuj Mehrotra, Ellis L. Johnson, and George L. Nemhauser describe an approach using a mix of heuristics and mathematical programming, an approach much different from ours. In Harvey Wagner’s Principles of Operations Research, students are challenged to formulate (but not solve) a version of the problem. Additionally, this problem was posed as part of the 2007 High School Mathematical Contest in Modeling run by the Consortium for Mathematics and Its Applications (COMAP).
The Super District Method The North Carolina Supreme Court took on this knotty problem and contributed the following sensible requirement: Voting district boundaries should, wherever possible, be synonymous with county boundaries. By adding a require-
ment that maximal use be made of existing geopolitical boundaries in forming voting districts, the court cuts down sharply on the number of acceptable ways of forming districts and removes myriad opportunities for fine political tuning. It also raises a new issue: How in the world does one construct a set of districts that makes the greatest possible use of county boundaries? The court, itself, describes how. Suppose there is a county that just happens to contain one-fiftieth the state’s population. (The court makes clear it really means here one-fiftieth, plus or minus 5%.) Then, declares the court, any such county is a Senate voting district. Suppose there is a county that just happens to have two-fiftieths of the state’s population. Then, any such county needs to be divided into two Senate voting districts, but neither district may include any area outside that county. Suppose there is a county that just happens to contain three-fiftieths … well, you see where this is going. You also see this solution is far from complete. There may not be any counties of the sort described above, and what are we to do then? What we need to do is divide the state into groups or clusters of adjacent counties in a way that every county cluster
contains either one-fiftieth, two-fiftieths, three-fiftieths, etc., of the state’s population. We need to do this in a way that a county group cannot be further subdivided into two county groups, each having a whole number of fiftieths of the state’s population. Although a county cluster might consist of a single county, it is more likely to contain several counties, and we need to insist, in the latter instance, that the counties be contiguous. We call any such county group a “super district,” and any nonoverlapping collection of super districts that covers the entire state a “super district cover.” By definition, then, a super district is a geographic area bounded entirely by county lines that has just enough residents to support one, two, or more state senators. A super district with population sufficient to support one senator could serve as a Senate voting district with no further adjustment—it is an area bounded entirely by county lines, consisting of either a single county or two or more contiguous counties. A super district with population sufficient to support two or more senators has to be split into two or more pieces, each containing one-fiftieth of the state’s population, but at least the outer limits of these voting districts will follow the county lines that define the super district. A super district cover is a collection of super districts. There are many ways North Carolina can be covered with super districts. As a starting point for defining actual voting districts, we would like to have in hand a super district cover consist-
ing of as many super districts as possible, because such a cover makes maximal use of existing county boundaries. Figure 1 depicts a super district cover for North Carolina. All the counties within a super district are shaded the same way; a change in shading corresponds to a change in super district. Each county within a super district contains a number indicating the number of senators associated with the super district as a whole. In all, there are 30 super districts in this cover and they collectively support 50 senators. With this super district cover as a starting point, the task of defining the 50 Senate voting districts within the state becomes 30 separate tasks, each much simpler than the original. In fact, as 19 of the 30 have a population sufficient to support exactly one senator, there is no more work to be done with them—each of the 19 already qualifies as a Senate voting district. Of the remaining 11 super districts in this cover, seven have populations just sufficient to support two senators. Each of these seven must be divided into two geographic areas of equal population. Of the other four super districts, three have populations supporting four senators and one supports five; these, too, need to be subdivided into Senate voting districts. These remaining 11 tasks are left to the politicians or their agents; our super district method does not address them. Note, however, that the magnitude of these 11 tasks, each of which can be executed completely
3. 6. 1.
2.
4.
5.
CHANCE
25
What the Computer Does Our computer program finds a cover through a two-step process. In the first step, it creates a library of super districts, which is a long list of counties and groups of contiguous counties having just enough residents to support a whole number of state senators. In the second step, the computer randomly picks one of these super districts, and then another that does not overlap it, and then another that does not overlap the first two, and so forth, until the whole state is covered. In the picking, heavy preference is given to those remaining super districts having the smallest numbers of counties, so as to pack as many super districts as possible into the cover. The process is tricky, because after each pick, there is a possibility that the portion of the state yet to be covered is neither a super district nor coverable by a nonoverlapping collection of super districts. Once the computer emerges from these steps with a super district cover for the state, the cover is stored for later comparison with other covers it finds. The best of these covers—the ones that contain the most super districts—are good candidates for use in defining voting district boundaries.
26
VOL. 21, NO. 1, 2008
independently of the other, is by any reasonable measure far simpler than splitting the entire state into 50 compact and contiguous areas of equal population. Furthermore, while politics may play a role in subdividing these 11 super districts, such considerations do not involve trading portions of one super district for those of another; any political juggling must take place entirely within the borders of the super district. The cover shown in Figure 1 is not unique. There are many others, one of which is shown in Figure 2. This cover also consists of 30 super districts and is a possible starting point for the creation of Senate voting districts. It differs from the cover shown in Figure 1 in that it has 20 (rather than 19) single-senator super districts. Of the 10 other super districts, six support two senators each, one supports three, two support four, and one supports seven. For use in its 2003 state Senate redistricting, the North Carolina Legislature used the cover depicted in Figure 3. To create these covers and many others, we wrote and ran a computer program that discovered several hundred covers for the state. Sifting through these, we found 30 to be the largest number of super districts packed into a cover, and we found about a dozen distinct ways of doing this, all of which we presented to the Legislature. The Legislature chose to work with the cover depicted in Figure 3 and went on to subdivide the super districts into the voting districts shown there. This cover features 20 super districts having one senator, five with two, one with three, three with four, and one with five. The Legislature’s work consisted, then, of choosing one of the dozen covers, and, having done so, solving 10 independent and highly constrained little problems, instead of one large and amorphous one.
How To Create Super District Covers So, from where do super district covers come? And what does all this have to do with chance? Glad you asked. We generated covers randomly using a computer program (see sidebar), discarding those
covers that are obviously inferior and saving only those that consist of the largest numbers of super districts. For North Carolina’s Senate redistricting in 2003, we randomly generated more than 2,000 covers for the state. A lot of them were duplicates, and many of them contained far fewer than 30 super districts. None contained more than 30 super districts. We do not know for sure whether there exist covers having more than 30 super districts, but as 30 was the most we found out of the more than 2,000 we examined, it seems a good bet that there are few, if any, such covers. And how exactly does our computer go about the process of building a cover? It isn’t pretty. It does what computers do best; it slogs randomly, but intelligently, step-by-step through myriad possibilities, running more often than not into blind alleys from which it must back up and try again before it finally completes a cover. Even so, generating 1,000 covers is the work of but a few minutes on a fairly standard PC running SAS.
Other States Could Do This, Too In essence, defining North Carolina’s Senate voting districts is no different than defining its General Assembly voting districts or its U.S. congressional voting districts. The size of the population associated with each representative is different, but that is all. Indeed, the state’s Legislature used the super district method to redraw both its General Assembly and Senate voting districts in 2003. Other states wishing to use existing county, city, or parish boundaries wherever possible in defining voting districts could use this method, too.
Further Reading Stephenson v. Bartlett, 355 N.C. 354 (2002) Mehrotra, A.; Johnson, E.L.; and Nemhauser, G.L. (1998). “An Optimization-Based Heuristic for Political Redistricting.” Management Science, 44(8):1100–1114. Wagner, H.M. (1975). Principles of Operations Research, 2nd ed. New Jersey: Prentice-Hall.
Assessment of Student Understanding of the Central Limit Theorem M. Leigh Lunsford, Ginger Holmes Rowell, and Tracy Goodson-Espy
W
hat is classroom or action research? In the spring semester of 2004, we used a classroom research model (see supplemental material at www.amstat.org/ publications/chance) to investigate our students’ understanding of concepts related to sampling distributions of sample means and the Central Limit Theorem (CLT). Our goal, when implementing our teaching methods and assessing our students, was to build on the work of Robert delMas, Joan Garfield, and Beth Chance. We applied their “classroom research model” to Math 385, the first course of a two-semester, post-calculus mathematical probability and statistics sequence taught at the University of Alabama in Huntsville (a small engineering- and science-oriented, PhD-granting university in the southeastern United States with an approximate undergraduate enrollment of 5,000). The CLT is one of the most fundamental and important results in probability and statistics. For students who are learning statistics, it provides a gateway from probability and descriptive statistics to inferential statistics. Recall that, from
properties of mathematical expectation, we know the sample mean statistic, , is an unbiased estimator of the population mean
(i.e.,
) and has a standard deviation of
,
where n is the size of the random sample and σ is the population standard deviation. Intuitively, the CLT tells us that not only does the sample mean statistic have the above stated mean and standard deviation, but that as n ‘gets large,’ the sampling distribution of the sample mean statistic approaches a normal distribution with mean and standard deviation .
The Central Limit Theorem Illustrated Values of X are generated independently from an exponential distribution. The first plot is a histogram of 1,000 single values. The second is a plot of 1,000 averages, each computed as the average of 12 observations. The third is a histogram of 1,000 averages each of 144 observations.
0
100
The Central Limit Theorem Illustrated
300
Frequency
1000 samples
0.0
0.5
1.0
1.5
2.0
x
200 50 100 0
Frequency
1000 averages of n=12
0.1
0.2
0.3
0.4
0.5
0.6
0.7
xbar12
250 150
Frequency
1000 averages of n=144
0 50
Values of X are generated independently from an exponential distribution. The first plot is a histogram of 1,000 single values. The second is a plot of 1,000 averages, each computed as the average of 12 observations. The third is a histogram of 1,000 averages each of 144 observations.
0.25
0.30
0.35
0.40
xbar144
CHANCE
27
Stage 1: What Is the Problem? What Is Not Working in the Classroom?
Stage 3: Collecting Evidence of Implementation Effectiveness
In the first stage of the classroom research model, the instructor reflects on classroom experiences and uses available data— observations of students’ behaviors and analysis of students’ work—to identify areas where students are experiencing learning difficulties. Once a problem area is identified, the instructor can refer to published research concerning the topic, if any exists, to gain insight into what might be causing student difficulties and how such difficulties might be alleviated. While introductory statistics courses have been the focus of reform curricula and pedagogy, the typical two-semester mathematical probability and statistics sequence has not received the same degree of attention and is generally taught using traditional lecture methods. By “typical,” we refer to the sequence in which the first semester mainly consists of probability topics, culminating with sampling distributions and the CLT, and the second semester mainly consists of further statistical concepts. Since the majority of our students were science, education, engineering, and computer science majors who would take only the first course of the sequence, they needed a good understanding of the statistical concepts in this course. However, due to the generally late, short, and fast coverage of sampling distributions and the CLT in the first semester, our students—particularly those who only completed that course—may not have had the opportunity to develop a deep understanding of these important concepts. Such students might be unprepared to understand applications of sampling distributions and the CLT in their own discipline areas (e.g., mechanical engineers involved in quality-control processes, electrical/software engineers performing high-speed network analyses, biologists studying animal populations). Thus, we wanted to assess and improve our teaching of sampling distributions and the CLT in our first-semester course.
Once the instructor decides to use particular materials or teaching strategies, he or she must identify types of data that can be collected to assess the effectiveness of the materials and/or methods. We endeavored to measure how well students understood sampling distributions and the CLT before and after coverage of the topic in Math 385. Again building on the work of previous researchers, we used a quantitative assessment tool provided by delMas. This tool featured graphically oriented questions and questions of a fact-recollection nature, as well as straightforward computational questions. We used this tool as both a pretest and a post-test to measure student understanding of sampling distributions and the CLT. The pretest was administered just before covering sampling distributions and the CLT. We did not return the pretest to the students, nor did we give the students feedback regarding their pretest performance. The post-test was administered on the last day of class as an in-class quiz. The students turned in their reports from the activity during the class period before the quiz was administered. We also developed a qualitative assessment tool that was given to the students at both the beginning and end of the semester. This tool measured students’ attitudes and beliefs about several aspects of the course, including their use of technology and their understanding of concepts. As expected, we saw significant improvement in student performance from pretest to post-test on the quantitative assessment tool. Even with our small sample size, the paired differences of post-test minus pretest scores were significantly greater than zero for the Math 385 class, (t(17)=6.7, p 0, since otherwise X or Y is zero with probability 1. Let a and b be any numbers, and let Z = aX − bY. Then, 0 ≤ E( Z2 ) = a2 E( X 2 ) − 2abE( XY ) + b2 E(Y 2 ) (A.2) for all a and b. If this equation, viewed as a quadratic in a for fixed b, had two real roots, then there would be values of a and b for which the right-hand expression in (A.2) would be negative, contradicting (A.2). Hence, (A.2) has, at most, one real root. Then, its discriminant, 4 b2 ( E( XY ))2 − 4 E( X 2 )b2 E(Y 2 ) (A.3) = 4 b2 ( E( XY ))2 − E( X 2 ) E(Y 2 ) must be nonpositive, proving (A.1). If the discriminant (A.3) is zero for some a and b, then E(Z2) = 0 so P(aX = bY) = 1 for those a and b. If W and Z have zero mean, then the CauchySchwarz inequality is the same mathematical fact as 2(W,Z) ≤ 1, where is the correlation between variables W and Z.
{
}
CHANCE
33
Hypothetical Airline Numerical Example Suppose a tiny airline company has two planes, one with four seats and one with six, each flown half the time. Imagine the probabilities (pk,n ) of the number of passengers on the flights are given in the table below.
Passengers
Four-Seat Plane
Six-Seat Plane
0
0.01
0.01
1
0.04
0.02
2
0.13
0.04
3
0.12
0.10
4
0.20
0.11
5
0.12
6
0.10
Total
0.50
0.50
But the Cauchy-Schwarz Inequality assures us this is always the case. Furthermore, the equality is strict unless X is a constant times Y, which would mean that k/ n is proportional to n , or k is proportional to n. This would mean every airplane flies exactly k/n full, in which case k/n is both CR and LF. The other extreme occurs when the airline flies one full plane and many empty ones. Then, every passenger experiences CR = 1, but the airline’s LF is nearly zero. Thus, we may conclude that crowdedness as experienced by the typical passenger is never lower—and is usually higher—than the load factor experienced by the airline. Isn’t that a comforting thought? Of course, the same considerations apply to every other mode of transportation—highways, buses, ships, etc.—to classes in a college or university, to health facilities, to internet queues …
Further Reading Grimmett, G. and Stirzaker, D. (2001). Probability and Random Processes, 3rd ed., Oxford University Press: Oxford.
34
VOL. 21, NO. 1, 2008
The average number of seats flown is 4(0.50) + 6(0.50) = 5 seats. The average number of passengers flown is computed as follows: 0(0.01) + 1(0.04) + 2(0.13) + 3(0.12) + 4(0.20) + 0(0.01)+1(0.02)+2(0.04)+3(0.10)+4(0.11)+ 5(0.12) + 6(0.10) = 3.5 seats. Thus, the load factor (LF) for the airline is 3.5/5.0 = 0.70, or 70%. The computation of the numerator of the average crowdedness is similar, but differs by the factor k/n in each term. It necessarily is smaller than the average number of occupied seats: 0(0.01)(0/4) + 1(0.04)(1/4) + 2(0.13)(2/4) + 3(0.12)(3/4) + 4(0.20)(4/4) + 0(0.01)(0/6) + 1(0.02)(1/6) + 2(0.04)(2/6) + 3(0.10)(3/6) + 4(0.11)(4/6) + 5(0.12)(5/6) + 6(0.10)(6/6) = 2.783 seats. Therefore, the average crowdedness (CR) for a passenger is 2.783/3.5 = 0.795, or nearly 80%. Now that we understand this, doesn’t everyone feel better about airline travel?
Contrarian Strategies for NCAA Tournament Pools: A Cure for March Madness? Jarad B. Niemi, Bradley P. Carlin, and Jonathan M. Alexander
E
very March, the National Collegiate Athletic Association selects 65 Division I men’s basketball teams to compete in a single-elimination tournament to determine a single national champion. Due to the frequency of upsets that occur every year, this event has been dubbed “March Madness” by the media, who cover the much-hyped and much-wagered-upon event. Though betting on NCAA games is illegal, the tournament existing tempts people to wager money in online and office pools in which the goal is to predict—prior to its onset—the outcome of every game. A prespecified scoring scheme, typically CHANCE
35
Table 1— Assumed Win Probabilities, Idealized Four-Team Tournament A
B
C
D
A
-
.57
.70
.78
B
.43
-
.64
.73
C
.30
.36
-
.60
D
.22
.27
.40
-
The (x; y) entry is the probability that team x beats team y.
Table 2 — Assumed Opponents’ Choices, Idealized Four-Team Tournament Round Team
1
2
A
.90
.75
B
.75
.20
C
.25
.05
D
.10
.00
The (x; y) entry is the proportion of opponent sheets that have team x winning in round y.
assigning more points to correct picks in later tournament rounds, is used to score each entry sheet. The players with the highest-scoring sheets win predetermined shares of the total money wagered. Given that one wants to play an office or online pool, the question at hand is: How to make picks in order to maximize profit? Many strategies exist for filling in a pool sheet, based on everything from the teams’ AP rankings to the colors of their uniforms. One sensible mathematical approach might be to try to maximize your sheet’s expected score and assume this will, in turn, maximize profit. This approach can, indeed, be profitable when the scoring scheme is complex, particularly when it awards a large proportion of the total points for correctly predicting upsets. Tom Adams’ web site, www. poologic.com, provides a Java-based implementation of this approach for a variety of pool scoring systems. This web site also can produce the highest expected score sheet subject to the constraint that the champion is a particular team. In a tournament, a pool sheet consists of picks for the winner of every game, where these winners can only come from 36
VOL. 21, NO. 1, 2008
the winners of the previous round. We define the probability of a sheet as the product of the win probabilities for each game chosen on that sheet. Since the probability of an individual game outcome turns out to be well-approximated using a normal approximation, which we describe below, this calculation is tedious but straightforward. Most office pool scoring schemes are relatively simple, awarding a set number of points for each correct pick in a given tournament round. In such cases, the sheet that maximizes expected score will typically predict few upsets, and thus have too much in common with other bettors’ sheets to be profitable. What’s more, pool participants tend to “overback” heavily favored teams—a fact it seems a bettor could use to competitive advantage. Betting strategies that attempt to choose reasonable entries while simultaneously seeking to avoid the most popular team choices are sometimes referred to as contrarian strategies. To quantify the amount a given sheet has in common with the other sheets entered in a pool, we define the similarity of a sheet as the sum of the similarity of each game chosen by that sheet to the picks for that game on every other sheet. To compute the similarity of a single game, we multiply the points available for that game by the proportion of people in the pool who chose that team to win that game. After summing the individual similarities, we normalize the statistic by dividing by the maximum possible points available. This provides a statistic that ranges from 0 to 1; it is 0 when the sheet has no picks in common with any other sheet and 1 when the sheet has exactly the same picks as every other sheet in the pool.
To illustrate the concepts of probability and similarity, consider a fourteam tournament where team A plays team D and team B plays team C in the first round, and the winners of these games play each other in a second-round game for the championship. We adopt a simple scoring scheme where we award one point for each correct first-round pick and two points for a correct secondround (championship) pick. In order to calculate the probability of a particular set of picks, we need the probability that each team will beat each other team in the pool. Suppose we adopt the probabilities shown in Table 1 (we illustrate how these can be calculated from published betting lines or team computer ratings below). In order to calculate similarity, we instead need to know how all the other players in our pool have made their picks. The necessary summary of these picks is shown in Table 2, where the entries are the proportions of opponent sheets that have the team in that row winning their game in the round indicated by the column. So, for example, we see that 0.20 (20%) of our opponents have team B winning the championship. Now consider two sheets entered into this pool. The first sheet picks A and B in the first round, followed by A winning the championship. The second sheet picks A and C in the first round, followed by C winning the championship. The first sheet appears to be a ‘safe’ strategy, as A and B are better than their first-round opponents and A is better than B. The second sheet appears to be more of a long-shot, as it predicts C will defeat two superior teams. Using our definitions above, the first sheet has probability 0.28 and similarity 0.79, while the second
Table 3 — Exploratory Data Analysis of Chicago Office Pool Sheets Year Participants
2003 113
2004 138
2005 167
1
86 (76%)
61 (44%)
137 (82%)
2
14 (12%)
58 (42%)
18 (11%)
3
4 (4%)
10 (7%)
5 (3%)
4
7 (6%)
3 (2%)
4 (2%)
Champions Bet by Seed
Seed is the rank of the team in one of the four regions of the country.
and 7.5% for the first-place through fifth-place finishers, respectively.
Simulating Return on Investment
Outcome versus Speed
Figure 1. Histogram of game outcomes (favorite’s score minus underdog’s score) minus the imputed point spreads (based on final Sagarin ratings), 2003–2005 NCAA tournament data
has probability 0.13 and similarity 0.31. The second entry has less than half the probability of the first, but also enjoys a much lower similarity score. A player submitting this sheet has a smaller chance of winning the pool, but figures to share the winnings with far fewer opponents if these predictions do pan out. The previous example assumed knowledge of both the individual game win probabilities and our opponents’ betting behavior. In practice, either or both of these may be difficult to estimate. In the remainder of this article, we discuss a contrarian method to increase profit without precise knowledge of opponents’ bets. This method’s objective is to identify teams that have a high probability of winning, but are
likely to be “underbet,” relative to other teams in the pool.
Available Data Our main source of data is three years’ worth of betting sheets and actual tournament results for an ongoing Chicagobased office pool, summarized in Table 3. “Champions bet by seed” indicates how many people chose each of the top four seeds to win the tournament, with the corresponding percentages in parentheses. For example, in 2003, 86 out of 113 sheets (76%) had either Kentucky, Arizona, Oklahoma, or Texas (the four #1 seeds that year) winning the championship. The pattern appears consistent, except for 2004, when 44% of the sheets had a #1 seed winning and 42% of the sheets had a #2 seed winning. This was due to 22% of the sheets having Connecticut (a #2 seed) as their champion. In that year, Connecticut was widely regarded as the best team in its region and did end up winning the tournament. The scoring scheme for this office pool awards 1, 2, 4, 8, 16, and 32 points for correctly predicted victors in rounds 1–6, respectively. The score for each sheet is then the sum of the points earned for each correctly predicted game. The percentage of the total pot awarded is 45%, 22.5%, 15%, 10%,
A number of numerical methods can be used to analyze tournament data. Perhaps the simplest approach would be to enumerate all possible tournaments, determine probabilities for each, and then obtain expected winnings for each sheet as it competes with opponent sheets. Unfortunately, because there are 63 games (excluding the pretournament “play-in” game between the 64th and 65th teams selected), there are 263 possible tournament outcomes. For this reason, one of the most useful tools in analyzing tournaments is to simulate a large number of tournaments, using the resulting relative frequencies of the outcomes to reduce the computation but preserve realism. After simulating these tournaments, we can calculate the return on investment of each sheet. If we think of the outcome of a basketball game as the number of points scored by the favorite minus the number scored by the underdog, it turns out a histogram of these outcomes looks like a normal distribution (bell curve) centered at the point spread, the amount by which the betting public expects the favorite to win. Figure 1 provides this histogram for our data—the 63 x 3 = 189 games in the 2003, 2004, and 2005 tournaments. After subtracting the point spread from each game outcome (favorite’s score minus underdog’s score), we do get an approximately normal distribution centered on 0. Thus, in the long run, the bettors seem to “get it right”: half the time, the favorite wins by more than expected (“covers the spread”) and half the time not. About 95% of games land within roughly 24 points (two standard deviations) of the spread (i.e., between -24 and 24 in Figure 1). This means there is a simple formula (using the area under the correct bell curve) for converting a point spread to the probability that the favorite will win the game. Sadly, true point spreads will not be available for every possible match-up prior to the tournament. However, many computer ratings are designed so the difference between two teams’ ratings is an estimate of the point spread for
CHANCE
37
Figure 2. Histograms of simulated ROI for all pool sheets across years and rating systems
a game between these two teams at a neutral site. This is true of all the ratings used below, though they differ in their emphases. Vegas ratings are based on only point spreads and over/under betting lines in the first round of the tournament. Elochess ratings are based on the win-loss results of all games in the regular season. Predictor ratings use the point differentials in all games in the regular season. Finally, Sagarin ratings are a combination of Elochess and Predictor. To simulate a tournament, we start by picking one of the rating systems. Looking at a particular match-up between two teams, we calculate the favorite’s rating minus the underdog’s rating. Using a bell curve centered at this difference, we calculate the area under the curve and to the right of zero. This gives us the probability that the favorite will win that game. We then draw a uniform random number between zero and one, for example using 38
VOL. 21, NO. 1, 2008
Figure 3. Scatterplot of similarity versus log(probability) with ROI indicated by shading, Predictor ratings
Similarity
Log (Probability)
Figure 4. Filled contour plot of ROI by similarity and log(probability), Predictor ratings
randn in Microsoft Excel. If this number is less than the probability, our simulation says the favorite won that game, otherwise the underdog won it. We can repeat this process for all of the Round 1 games, followed by the resulting Round 2 match-ups and so on to simulate an entire 63-game tournament. Now, for each simulated tournament, all office pool sheets for that year can be scored, ranked, and awarded prizes as described previously. Repeating this process over many simulated tournaments, the return on investment (ROI) for each sheet may be calculated as the following: total won – total invested total invested This calculation is standardized, so each sheet costs $1. A ROI of zero indicates a break-even strategy; whereas, a negative (positive) value indicates a losing (winning) strategy. We will calculate ROI for each actual sheet for each year and probability model, as well as certain ‘optimal’ sheets.
Table 4 — Simulated ROI for Various Sheet Selection Methods and True Probability Models 2003 Method
S
P
E
2004 V
S
P
E V
2005 S
P
E
V
Maximum Expected Score
-0.2 2.5 0.8 14 3.2 3.1 3.4 2
0.3 3.2 2.3 14
Contrarian
3.7 2.5 3.5 14 4.9 3.1 5.4 7
3.6 3.2 2.7 16
Contrarian Motivation Before developing a contrarian strategy, an important question is whether the idea has demonstrable merit. In this section, we show that most sheets in an office pool have low ROI and that maximizing point total methods also do not have
high ROI. We then turn to the problem of producing contrarian sheets with improved ROI. We simulated 1,000 tournament outcomes for each of the four rating systems and each year of our data. We then evaluated how the 418 sheets from CHANCE
39
our 2003–2005 Chicago office pools would have fared in these simulated tournaments. Thus, we are evaluating sheets not based on what actually happened, but on what was likely to have happened according to our rating systems. Histograms of these results can be seen in Figure 2. The rows correspond to years and the columns to different rating systems. The histograms then provide the proportion of sheets falling into each ROI category. For example, in 2003 using Predictor as the rating, about 37% of the players had a simulated ROI between –1 and –0.5; one player had a simulated ROI between 3.5 and 4. Also shown on each histogram is the percentage of sheets having an ROI below zero, indicating a losing investment, and the percentage above one, a substantial (at least money-doubling) winning investment. Note that in all 12 cases, at least 60% of the strategies are losers in the long run, while the proportion that double one’s money or better rarely exceeds 15%. Figure 3 investigates the differences between those pool sheets consistently near the top and bottom of the simulated ROI distributions in Figure 2 by plotting similarity versus log(probability) using the Predictor rating for the sheets in our data set. The plotting character indicates the sheet’s year, while its shading indicates its simulated ROI (with darker shading corresponding to higher ROI). The figure suggests the first requirement for a high ROI is to have a relatively high probability, as there are few dark points with log-probability less than –30. However, low similarity also appears to be a general characteristic of high ROI sheets. This relationship is further clarified by the filled contour plot in Figure 4, which indicates that given a sheet’s log-probability, low similarity tends to maximize ROI, and given a sheet’s similarity, high probability tends to maximize ROI. As mentioned above, previous thinking in this area has focused on identifying sheets that maximize score. To further illustrate that this method may not deliver a sheet with high expected ROI, we derived the maximizing sheet for each year. We then repeated our ROI simulation under all four rating systems. The expected score-maximiz-
40
VOL. 21, NO. 1, 2008
Table 5 — Actual (A) versus Expected (S, P, E, V) Champion Picks, 2003–2004
2003 Team
A
S
P
E
V
Kentucky
58
18
15
19
15
Arizona
15
13
13
12
12
Kansas
9
10
19
6
0
Oklahoma
8
6
5
6
4
U*
P
Illinois
5
2
3
2
0
Texas
5
9
7
9
40
V
Syracuse
4
8
4
15
1
E
Florida
3
5
4
4
2
Pittsburgh
2
11
11
9
2
Dayton
1
0
0
1
0
Indiana
1
0
0
0
0
Maryland
1
2
4
1
0
Louisville
1
4
6
2
4
Other
0
26
23
27
32
S
2004 Team
A
S
P
E
V
UConn
30
24
12
23
5
Kentucky
27
6
8
6
29
OK State
23
14
8
15
4
Duke
19
24
31
18
24
Stanford
12
5
5
6
13
Gonzaga
5
5
9
3
17
Pitt
4
4
3
9
0
Georgia Tech
3
9
10
9
2
St. Joseph’s
3
15
12
18
17
Texas
3
3
3
3
0
Wisconsin
2
2
3
1
0
Syracuse
2
1
0
2
0
Michigan St.
1
0
0
0
0
Wake Forest
1
2
4
2
7
Cincinnati
1
1
5
1
2
*The most underappreciated team is indicated in column U.
U*
P
SEV
Table 5 (continued) — Actual (A) versus Expected (S, P, E, V) Champion Picks, 2005 North Carolina
1
2
4
1
2
Maryland
1
1
2
1
3
Other
0
19
19
18
13
2005 Team
A
S
P
E
V
Illinois
83
31
15
67
18
North Carolina
38
32
56
20
51
Duke
13
18
21
13
3
Oklahoma St.
12
10
14
4
0
U P
Washington
3
11
7
9
37
EV
Wake Forest
3
14
10
7
28
S
Kentucky
3
7
4
8
1
Gonzaga
2
1
0
2
0
Florida
2
2
4
1
0
Michigan St.
2
3
3
3
0
Boston College
1
1
0
2
0
Arizona
1
3
1
4
0
Georgia Tech.
1
1
2
0
0
Louisville
1
6
10
6
1
Kansas
1
6
4
2
0
Oklahoma
1
5
5
2
22
Other
0
15
11
17
5
*The most underappreciated team is indicated in column U.
ing sheet was entered into the simulated pools and its ROI performance evaluated in a competition with that year’s actual sheets. The average ROI of the maximum expected score sheets are displayed in the first row of Table 4, where the column headings indicate the Sagarin (S), Predictor (P), Elochess (E), and Vegas (V) rating systems. We see many high ROI values, but also one negative and two moderate values. Since this method does not take opponents’ bets into account, it is affected by how many opponents happen to have similar sheets. Moreover, its performance may degrade further over time, as more players discover www.poologic.com and other sites that can perform these same calculations with just a few mouse clicks.
Contrarian Strategy To increase our ROI, we will use information about how our 2003–2005 opponents bet to pick an underbet champion (i.e., the championship game’s most underappreciated team) in each year, and then simply use a maximum score strategy to fill in the remainder of our sheet. As before, our calculations may vary with the rating system we are using. If the resulting contrarian sheet does not perform well, this bodes ill for the practical setting, where opponent behavior can only be guessed. Table 5 contains information about the teams chosen to win the championship in each year. The column headed A gives the actual number of sheets that chose that team to win the championship. Subsequent columns give the number of sheets expected to pick that team as champion (i.e., the probability that team wins the championship times the number of people in the pool) under the S, P, E, and V ratings. The most underappreciated team in the championship (i.e., the team with the biggest difference between expected and actual championship picks) under the four probability models is indicated by the corresponding letter in the rightmost column labeled U. From Table 5, we can see that, in general, the heaviest favorites have more people choosing them than the probability models expect. An exception to this rule arises from an apparent “Dukehating factor,” as even when Duke is a favorite, it tends not to be overbacked. However, Kentucky seems overappreciated in 2003 and 2004, and the extreme devotion to Illinois in 2005 is not too surprising in this Chicago-based pool. Returning then to our quest for a high ROI sheet, we simulate ROI for a sheet taking the most underbet champion and then score-maximizing for all previous games subject to this constraint. The results are displayed in Table 4 in the row marked “contrarian.” Comparing these results to the maximum expected score results, we can see that in four of 12 cases, the ROI is the same, and in the remaining eight cases, it is higher for the underbet champion sheet. Surprisingly, the average ROI under the Predictor model is the same in all three years using both maximum score and contrarian methods, as the underappreciated champion happens to also be the most probable champion in each year. CHANCE
41
However, in all but one of the remaining cases, the contrarian approach offers an often substantial improvement. These results confirm our intuition that if we can guess how our opponents select their champions, we may be able to improve our ROI by being contrarian.
Discussion We have shown that a contrarian strategy improved simulated ROI over straight score- maximization strategies in NCAA tournament pools with standard scoring schemes. Our approach requires only that the user select a contrarian champion and fill in the rest of his sheet using maximization of expected score subject to this constraint, free software for which is available at www.poologic.com. Without further modeling of opponent betting strategy, one needs to make an educated guess about which team will be the most underbet in the championship. With this educated guess, one could obtain a good sheet with minimal effort using the poologic calculator. One ad hoc rule for most pools is to avoid the heaviest favorites (say, the two or three #1 seeds with the highest AP rankings), as they are typically overbacked. Another ad hoc rule is to avoid local teams. In 2005, we correctly guessed bettors in our Chicago-based pool would overback Illinois because it was both a ‘local’ team (it received heavy media coverage in Chicago) and one of the two heaviest favorites. Other ad hoc rules may arise from experience with one’s own pool; we will certainly be looking carefully at Duke in future years. We caution that all our empirical results are based on our 2003–2005 Chicago pool data. Illinois’ 2005 success might have hurt a contrarian in Chicago, but helped one in Connecticut. Finally, you may be wondering how we’ve been doing as real-life contrarians. Our results have been mixed. Our strategy was not a winner in 2004, 2005, or 2007, but did come through in 2006. Specifically, underbet champion St. Joseph’s did not quite make the Final Four in 2004 (losing to Oklahoma State on a three-pointer at the buzzer), and 2005 and 2007 were certainly not years to be contrarian, with two heavy favorites 42
VOL. 21, NO. 1, 2008
(North Carolina and Illinois in 2005 and Florida and Ohio State in 2007) successfully arriving at the championship game in those years. In 2006, however, we would have won our pool if UCLA had won the championship game; as it was, we still finished in third place. Over those four years, our total return on investment is small but positive. We find this encouraging enough that we look forward to being contrarians again during the 2008 version of March Madness. Editor’s note: CHANCE does not encourage wagering on sports. Before placing a bet, you are encouraged to make sure it is legal to do so. The probability calcualtions, simulations, and data analysis in this article should be of general interest to statisticians.
Further Reading Breiter, D.J. and Carlin, B.P. (1997). “How To Play Office Pools if You Must.” CHANCE, 10: 324–345. Carlin, B.P. (1996), “Improved NCAA Basketball Tournament Modeling via Point Spread and Team Strength Information.” The American Statistician, 50:39–43. Clair, B. and Letscher, D. (2005). “Optimal Strategies for Sports Betting Pools.” Department of Mathematics, Saint Louis University. Kaplan, E.H. and Garstka, S.J. (2001). “March Madness and the Office Pool.” Management Science, 47:369–382. Metrick, A. (1996). “March Madness? Strategic Behavior in NCAA Basketball Tournament Betting Pools.” Journal of Economic Behavior & Organization, 96:159–172. Niemi, J.B. (2005). “Identifying and Evaluating Contrarian Strategies for NCAA Tournament Pools.” Master’s thesis, Division of Biostatistics, University of Minnesota. Schwertman, N.C.; McCready, T.A.; and Howard, L. (1991). “Probability Models for the NCAA Regional Basketball Tournaments.” The American Statistician, 45:35–38. Schwertman, N.C.; Schenk, K.L.; and Holbrook, B.C. (1996). “More Probability Models for the NCAA Regional Basketball Tournaments.” The American Statistician, 50:34–38. Stern, H. (1991). “On the Probability of Winning a Football Game,” The American Statistician, 45:179–183.
Gender Differences at the Executive Level: Perceptions and Experiences Kris Moore, Dawn Carlson, Dwayne Whitten, and Aimee Clement
T
his survey of men and women executives sought to answer questions about differing perceptions of women in the workplace and experiences of men and women executives. The researchers focused on two major arenas. First, an examination was done of the difference in gender regarding perceptions of working for a woman, what makes a successful executive woman, and personal aspects of a businesswoman. Second, an investigation into the difference men and women perceive in their experiences of work-family conflict and enrichment and in multiple dimensions of satisfaction was performed.
Methodology A survey of 2,000 questionnaires was equally distributed to female and male (inferred by name) executives selected using a systematic sampling from an alphabetical list of approximately 75,000 people listed in the 2004 Standard & Poor’s Register of Corporations, Directors, and Executives and from approximately 56,000 listed in the 2005 Dun & Bradstreet Reference Book of Corporate Management. The number of possible individuals (131,000) divided by the desired number of 2,000 resulted in the selection of every 65 or 66 people. Of the 2,000 individuals selected, 153 surveys were returned due to an incomplete address or because the person was no longer at the address. In addition, 197 “not participating” forms were completed and returned. A total of 292 people responded for an overall response rate of 17.7%, with 100 (11.7%) males responding and 192 (22.5%) females responding. Of the completed responses, eight were removed due to having salaries under $50,000, which we did not feel represented our executive sample, resulting in a final sample size of 284. There were 204 respondents to the first mailing and 80 respondents to the second. One test for nonresponse bias assumes nonrespondents resemble late respondents. Early responders were those responding to the first mail-out, while late responders were those responding to the second mail-out. A comparison of the two groups using a t-test for two independent samples resulted in insignificant differences, which is indicative of CHANCE
43
60 60 Male: 52% 50 50
Female: 37.6%
Percentage of Respondents
40 40
Female: 28%
30 30
Male: 23% 20 20
10 10
100
Male: 16%
Female: 15.6%
Female: 18.8%
Male: 9%
$50,001-$100,000 - $ 1 0 0 ,0 0 0
$ 5 0 ,0 0 1
$100,001-$150,000 - $ 1 5 0 ,0 0 0
$150,001-$200,000 $ 150, 001 - $200 , 000
$ 1 0 0 ,0 0 1
over o v e$200,000 r $ 200, 000
Figure 1. Salaries by gender G ra p h 1 S a la ri e s b y G e n d e r
little or no nonresponse bias. Other information was not available to compare respondents to nonrespondents. On average, males were about 4.6 years older than female respondents, had 5.6 more years of experience, and were more likely to be married. The distribution of education level for males and females was similar. As seen in Figure 1, the distribution of incomes was significantly different. The median income for women was approximately $173,000; whereas, the median income for men was more than $200,000.
Working for a Woman The initial question in the survey was a follow-up from a survey conducted 40 years ago. The question asked individuals how comfortable they felt working for a woman. For the questionnaire, a scale of 1 to 5 was used, with 1 representing “strongly disagree,” 3 representing “neutral,” and 5 representing “strongly agree.” Based on the data collected in 2005, about 71% of men indicated they were comfortable working for a woman. This percentage has increased steadily over a 40-year period. In 1965, only 27% of men expressed comfort working for a woman. Seventy-six percent of women in 2005 expressed comfort working for a woman. Interestingly, this proportion has varied little from 1965 to the present. Thus, even in today’s society, women feel slightly more comfortable working for women than men do, although 44
VOL. 21, NO. 1, 2008
the gap has greatly narrowed and is no longer significantly different. Nonetheless, this discrepancy between how men and women feel working for a female executive prompted further investigation in an attempt to understand the differing perceptions of males and females.
Perceptions of Successful Executive Women Three questions in the survey asked the executives about their perceptions of women at the executive level. Females were more likely than males to perceive that different expectations exist for executive-level women and that women must be exceptional to succeed in today’s business world. Furthermore, female respondents believe more often that women are judged more critically than men at the executive level. These differences in perception are not surprising. A 2003 Catalyst survey (www.catalyst.org) revealed a perceptual gap between Fortune 1000 CEOs and female executives with regard to barriers facing businesswomen. Between 60% and 80% of females surveyed identified several organizational, cultural, and personal factors that hinder women from reaching the executive level. The CEOs, mostly male, acknowledged these challenges but considered them far less significant. These survey findings are consistent with our research, suggesting that female executives perceive a woman’s business success differently than do male executives.
Perceptions of Personal Aspects of Businesswomen According to sources cited in the references, because masculinity remains the socially preferred role for managers, a businesswoman may climb higher on the corporate ladder if she adopts a masculine gender role. Stereotypical male traits such as assertiveness and individualism may be considered key for business success; whereas, stereotypical feminine traits such as empathy and relationship-focus may be deemed less important. Our study reveals (Item 4 in Figure 4) that women in this study were more likely than men to believe that highly competitive businesswomen are viewed unfavorably. Regardless of how competitive women are perceived in a business organization, research suggests executive women display more masculinity than lower-level female managers. Some research suggests that feminine disposition and experience may be an asset for an organization, enabling a businesswoman to add value. Juggling two roles—at home and at work—improves a female manager’s ability to multitask, enriches her interpersonal skills, and develops her leadership experience, consequently enhancing her effectiveness as a manager. Furthermore, a woman brings beneficial diversity to her workplace because her personal interests and individual experiences provide fresh ideas and new cultural aspects. Our findings, seen in items 5 and 6 of Figure 4, suggest women see more uniqueness and value in the organizational contributions of women than men do. Female executives were more likely than males to believe women make valuable contributions to
management. Similarly, females believed more strongly than males that a woman’s unique temperament represents a strong advantage for an organization. Thus, females more strongly agree that it is hard for women in business, but that women have something unique to offer.
Experience of Work-Family Conflict and Enrichment Next, we looked into how males and females experience the workplace differently. Work-family conflict occurs when role pressures from work and family are incompatible in some respect. Work-family enrichment occurs when experiences in one role improve the quality of life in the other role. Thus, work can spill over both negatively and positively onto family and family can spill over both negatively and positively onto work. Prior work-family conflict research shows that women believe work interferes with family significantly more than men believe it does. Although both genders clock comparable hours spent
1001 0 0 909 0 Female: 75%
808 0
Female: 76%
Male: 71%
707 0 606 0 505 0 404 0 303 0
Male: 27%
202 0 101 0 0
0
6
1965
Figure 2. Comfortable working for a woman
2005 Graph 2 Comfortable Working for a Woman
2005
CHANCE
45
Strongly Agree
Strongly 55 agree
Female: 3.79
4
4
Female: 3.53
Female: 3.35.
Neutral
Neutral
3
3
Male: 2.98
Male: 2.88
Male: 2.84
2
2
1
1
Strongly Strongly Disagree disagree 00
Q1: I believe there are different
Q1: I believe of there are different expectations women than men at expectations of women than this level. men at this level.
Q2: A woman has to be exceptional Q2: Ainwoman to be to succeed businesshas today.
Figure 3. Differing perceptions of successful executive women
exceptional to succeed in business today.
Q3: I feel women are judged more Q3: I feel arelevel. judged critically thanwomen men at my more critically than men at my level.
G rap h 3
Strongly Strongly Agre agree
55
tive W o m en D ifferin g Per cep tio n s o f Su ccessfu l Execu Female:
Male: 4.38
4.59 Female: 3.89
Female: 3.81
44
Male: 3.47
Male: 3.26 Neutral Neutral
33
22
11
Strongly Strongly Disagree disagree
00 Q4:Highly Highlycompetitive competitive women Q4: women are are often viewed unfavorably often viewed unfavorably in in business today. business today.
Figure 4. Unique personal aspects of business women
46
Q5: Women canand anddodo make Q5: W omen can make unique and valuable contributions unique and valuable to management. contributions to management.
Graph 4 Unique Personal Aspects of Business Women
VOL. 21, NO. 1, 2008
Q6: TheQ6: unique temperament of The unique a woman is a strong advantage for temperament of a woman is the organization. a strong advantage for the organization.
Strongly Strongly agree Agree
55
Female: 3.72
44 Male: 3.32 Neutral Neutral
33
Female: 3.6
Male: 3.9
Female: 3.58
Male: 3.44
Female: 3.46
Male: 3.01
22
11
Strongly Strongly Disagree
disagree
00 Q7: Q 7 : MMy y w work o r k t a k takes e s u p tup im etime t h a t I’d I ’d lik like e t o s to p e n spend d o n that on n o n - w o rk a c t iv it ie s . nonwork activities.
Q8: Q 8 :After A f t e r w work, o r k , I c o mI ecome h o mtoo e t o tired o t ir e d to t o ddo o home some s o m things e o f t h e I’d t h in like g s I ’dto do. of the lik e t o d o .
Figure 5. Work–life conflict and enrichment
Q9: QMy helps 9 : Mhome y h o m e life lif e h e lp s m e refeel la x aready n d me relax and for e l r e a dday’s y f o r t hwork. e ne x t thef enext d a y ’s w o r k .
Q10: QTalking 1 0 : T a lkwith in g w someone it h s o m helps e o n e ame t h o deal m e at home with h e lp s mat e work. d e a l w it h problems p r o b le m s a t w o r k .
Graph 5 Work-Life Conflict and Enrichment
on paid work, women report considerably more hours of family work. This is consistent with gender-role expectations of women bearing the primary responsibilities of the household. Furthermore, consistent with gender-role expectation of role responsibilities, this translates into identity issues. Men are more likely to get their identity from the work domain, as they see this as their primary responsibility. Questions 7 and 8 in Figure 5 tap into work to family conflict and support the notion that females experience higher levels of work interference with family than do males. Enrichment, or the positive side of the work-family interface, has received little empirical attention to date, and there is no support for gender differences. Our research, shown in questions 9 and 10 of Figure 5, examined family to work enrichment and found inconclusive results. In the first question, it appears men are able to get more relaxation to transfer from home to work than women. Again, this may be due to gender-role expectations; the fact that men have lower role expectations in the family domain may allow them to get greater relaxation in that domain. Question 10, however, found no differences in the experience of family to work enrichment, which may, in part, be because men and women do not differ on this form of enrichment. Nonetheless, it is clear that males and females deal with the competing demands of work and family uniquely.
Experience of Satisfaction: Job, Family, and Work-Life Balance In order to further examine the experience of the work domain and how it differs for male and female executives, we examined three forms of satisfaction: job, family, and work-life balance. Not surprisingly, no significant difference was found in job satisfaction between male and female respondents. This finding is consistent with most empirical studies in this area. Female and male managers most often report similar levels of job satisfaction, as gender demonstrates no significant relationship with overall job satisfaction. However, males reported more satisfaction with family life and with their level of work-life balance than females. One explanation for the differences, as before, involves gender-role expectations. Women may perceive their role in the home carries with it a significant responsibility to care for their families; whereas, men may expect to bear fewer “secondshift” responsibilities at home. Men, however, may experience greater satisfaction in these areas because they perceive less inconsistency between their role expectations and their ability to fulfill those expectations. Further, due to the fact that many of the nonwork role expectations fall to women, it is more difficult for them to balance competing demands and they are more likely to have lower satisfaction with their level of
CHANCE
47
Strongly Strongly agree
55
Agree
44
Male: 4.04
Female: 4.03
Male: 4.28 Female: 3.88
Male: 3.55 Female: 3.16
Neutral Neutral
33
22
11
Strongly Strongly disagree
Disagree
00 Q11: Q11:All Allininall, all,I Iam amsatisfied satisfied with my job. with my job.
Figure 6. Job, family, and work–life balance
Q12: Overall, Q12: OverallI Iam amsatisfied satisfied with my family life. with my family life.
Graph 6 Satisfaction: Job, Family, and Work-Life Balance
work-life balance. Thus, males generally have higher family and balance satisfaction, but no significant difference in gender is found for job satisfaction.
Conclusion The current study was undertaken with the purpose of examining different perceptions of males and females regarding women in business today. We found that over the last 40 years, the gap has narrowed between men and women and their comfort regarding executive women. However, there are still significant differences between men and women’s perceptions regarding several aspects of women in today’s business world. First, female respondents consistently perceived more workplace challenges and difficulties for women than male respondents. Females perceived that women face different expectations, more critical judgment, and the need to be exceptional in order to succeed more often than males. Second, we found that women see more uniqueness and value in the organizational contributions of women than men do. Third, this study found that females experience higher levels of work interference with family conflict than do males. Results were inconsistent in terms of family to work enrichment. Lastly, no significant difference was found in job satisfaction between female and male respondents. However, men did experience greater family satisfaction and satisfaction with work-life balance.
48
VOL. 21, NO. 1, 2008
Q13: I am Q13: I amvery verysatisfied satisfiedwith with my level work-life balance. my of level of work-life balance.
As females join the executive ranks of organizations, the gap between males and females narrows. We looked at the gap that still remains in both perceptual and experiential factors. First, we found there are still a significant gap and a great deal of variance in the perception each gender holds regarding women. Second, we found that gender differences still exist regarding how men and women incorporate the competing demands of the family domain, with women finding this more challenging than men.
Further Reading Gutek, B.A.; Searle, S.; and Klepa, L. (1991). “Rational Versus Gender Role Explanations for Work-Family Conflict.” Journal of Applied Psychology, 76(4):560–568. Manning, T. (2002). “Gender, Managerial Level, Transformational Leadership, and Work Satisfaction.” Women in Management Review, 20: 368–390. Kirchmeyer, C. (1998). “Determinants of Managerial Career Success: Evidence and Explanation of Male/Female Differences.” Journal of Management, 24:673–692. Wellington, S.; Kropf, M.B.; and Gerkovich, P.R. (2003). “What’s Holding Women Back?” Harvard Business Review, June: 18–19. Ruderman, M.; Ohlott, P.; Panzer, K.; and King, S. (2002). “Benefits of Multiple Roles for Managerial Women.” Academy of Management Journal, 45(1):369–398.
Further Evidence of Gender Differences in High School–Level Computer Literacy Christopher Sink, Matthew Sink, Jonathan Stob, and Kevin Taniguchi
T
he so-called digital divide is a well-documented problem among various U.S. subpopulations, including those defined by ethnicity, age, and economic level. Classroom educators, however, may not fully recognize these disparities in technological skills as they extend to school-age boys and girls. Joel Cooper’s recent review of the computer technology literature,
published in the Journal of Computer Assisted Learning, documents this alarming fact. Specifically, research continues to show that women and girls are less proficient in information technology skills and report less positive attitudes toward technology than men and boys. The gender gap emerged more than two decades ago with the widespread use of personal computers in schools. As
CHANCE
49
far back as 1985, articles published in the professional journal Sex Roles documented this educational concern. For instance, one seminal study by Gita Wilder, Diane Mackie, and Cooper explored gender attitudes toward computer technology with school children. Males, beginning in kindergarten and persisting through high school, reported better attitudes toward computing than their female peers. Interestingly, this attitude gap became quite noticeable around the fifth grade, just as students were probably relying more on their computing skills to do their classroom assignments. Another investigation by Marlaine E. Lockheed showed male adults and children use computers more often for programming and game-playing than females. No major gender differences were reported, however, in the use of various computer applications. Also, Mark Fetler reported in a 1985 article published in Sex Roles the findings of a statewide survey of computer science and computer literacy knowledge, attitudes, and experiences conducted with sixth- and 12th-grade California public school students. Not only did the boys consistently outperform the female respondents, they also reported more experience with computers, both at home and at school. Perhaps this latter finding explains, in part, why boys had stronger positive attitudes toward computer usage than their female peers. More recent computer literacy– related research provides further evidence of the digital divide. For example, Eric C. Newburger reported on a study by the U.S. Census Bureau in 2000 in which males spent more time (hours) on their home computers, largely using the computer for recreational purposes, than did females. High-school males also employed the computer more at school than females, suggesting that adolescent boys tend to use computers more for academic aims. In 2003, Matthew DeBell of the U.S. Department of Education studied the rates of computer use by children up to 12th grade, reporting that 91% of each gender used the computer on a regular basis and at about the same rate. DeBell failed to discuss which gender spent more time on different computer applications, nor did he examine computer literacy issues between genders. Reasons for the digital divide between boys and girls are multifarious, ranging from students’ cultural histories to vari50
VOL. 21, NO. 1, 2008
ous psychosocial explanations. Cooper’s literature review largely attributes the gender differences to computer anxiety situated in social developmental differences between boys and girls, societal stereotypes of what is appropriate for the two genders, and gender-specific attributional patterns. He states, “These factors are intertwined to create the expectation that computers are the province of boys and men, not girls and women.” The stereotype phenomenon Cooper discusses was further explored in two studies with middle-school students— “Images of Self and Others as Computer Users: The Role of Gender and Experience,” published in the Journal of Computer Assisted Learning in 2006, and “Gender and Computers II: The Interactive Effects of Knowledge and Constancy on Gender-Stereotyped Attitudes,” published in Sex Roles in 1995—showing the boys in the sample were far less negatively biased toward computer science overall than the female group. Linked to these explanations is the relationship between computer experience and computer attitudes. A major European study of elementary school teachers by Johan van Braak, Jo Tondhur, and Martin Valcke showed gender was strongly correlated with different computer experience variables, with male teachers reporting on average more computer experience and more intensive use of technology and higher favorable general computer attitudes. This latter investigation provided further insight into why girls may be less inclined to attain higher levels of computer literacy and appreciate information technology; male teachers, as opposed to their female counterparts, integrated computers more often into their primary school classrooms. As a result, boys may be modeling their male teachers, seeing computers as a valuable tool for learning. In sum, schoolchildren’s computer skills and attitudes appear to be influenced by their home and learning environments, culture, peer groups, and various psychosocial factors. The purpose of our study was to examine potential gender differences in computer literacy among high-school students attending small, private Christian high schools. While multiple studies have been conducted on this topic with different age groups and grade levels, none that we could locate examines the issue with this specific student population.
We assumed the gender digital divide exists in this type of private school, but there was no clear empirical evidence to substantiate this belief. Thus, our study was designed to test two research hypotheses: there is a statistically significant difference in overall computer literacy scores between female and male high-school students and significant differences would be found between boys and girls on various subdimensions of the computer literacy survey.
Method
Design and Participants A survey research method was used to test these suppositions using intact groups of 9th (n = 27), 10th (n = 22), 11th (n = 20), and 12th graders (n = 15) in a small, private, faith-based high school in a suburban area of Seattle, Washington. About 81% (84/104 = 0.81) of the entire school population (N=104) completed the survey. Computer literacy scores from two surveys were unusable due to missing data. Of the respondents, 43 were female. Gender frequency for each grade was largely similar. Ages ranged from 14 to 19 years. Instrumentation A researcher-constructed, 21-item, fillin-the-blank computer literacy measure (see supplemental material at www.amstat. org/publications/chance) was designed to test six key dimensions of adolescent school and home computer skill sets. To discriminate between different male and female computer literacy skill levels, the following computing domains were included: word processing, internet knowledge, instant messaging (i.e., IM)/ email use, downloading music, gaming, and computer programming. As an indicator of the instrument’s content validity, survey items were based, to some extent, on previous computer literary research and feedback from the school’s computer teacher and other computer-proficient educators in the building. An attempt also was made to maximize ecological validity of the measure; that is, the survey items needed to reflect those computer literacy skills considered to be important in the current school setting and the local community and colleges. Participants answered items in four sections: (1) basic demographic information; (2) general computer usage; (3) five-point Likert scale (1 = very limited skills to 5 = very
Table 1— Descriptive Statistics and T-Test Results for Computer Literacy Survey Dimensions Internet Use
IM/ Emailing
Gaming**
Computer Programming**
F
M
F
M
F
8.23
4.81
6.08
0.67
3.21
0.95
3.62
3.45
4.65
1.86
3.77
1.65
Survey Areas
Word Processing
Downloading Music**
Gender
M
F
M
F
M
F
M
Mean
4.41
3.76
5.77
3.88
7.59
7.24
SD
3.03
3.12
3.80
3.64
4.17
4.32
Note. ** p < .001. M = Male (n = 40), F = Female (n = 42); t-tests (two-tailed) were computed for each of these six dimensions with the alpha levels (p < .008) corrected for possible Type I error using the Bonferroni method.
strong skills) items asking for estimates of computer skill level in each of six areas above; and (4) the 21 computer literacy test items. Each computer literacy skill dimension (i.e., six specific areas and one general skill dimension) was comprised of one simple (s), intermediate (i), and difficult (d) question. Questions were scored as two points for a correct answer, one point for a partially correct response, and zero points for an incorrect answer. Each intermediate and difficult item was then weighted as follows: i = raw score x 2 pts and d = raw score x 3 pts. For instance, a correct answer on a simple question received a score of two and a correct answer on a difficult question received six points (2[raw score] x 3[weighting] = 6). Subjectivity in scoring was kept to a minimum by ensuring all scorers reached consensus on the points allotted to each survey question. Weighted scores on each dimension could range from 0 to 12. Thus, students received a total (General Computer Literacy) score and six dimension scores. A draft version of the instrument was piloted with a random sample of six junior-high students and five high-school teachers, including the instructor for the computer lab. Adjustments to the survey were made based on their feedback. Data Collection Procedures Tests were largely individually administered to participants by one of three trained student researchers. The administration process and procedures were closely monitored by the first author. About 10 of the surveys were groupadministered in the computer lab or library/study hall room. Most students required five to 10 minutes to complete the test.
Women in Computer Science: The Experience at Carnegie Mellon University In the 1990s, Carnegie Mellon University’s Computer Science Department made fundamental changes to improve their program for all students and to bring more women undergraduates into computer science. The story of what they did and results to date can be found in the following: Margolis, J., and Fisher, A. (2002). Unlocking the Clubhouse: Women in Computing. Cambridge, Massachusetts: The MIT Press. See the papers link in the right column of http://women.cs.cmu.edu.
Results To determine students’ current patterns of computer usage, background questions were asked. All respondents possessed a home computer. The majority of students (n = 74, 88%) used their home computer most frequently (10 used school computers most often). Section 2 asked respondents to estimate how much time they spent in each of the six skill areas by rank, ordering which of the six areas they used from the greatest (1) to the least amount of time (6). For males and females combined, of those dimensions marked most often used, internet was ranked first (n = 45), instant messenger/email second (n = 17), word processing third (n = 3), gaming fourth (n = 8), downloading music fifth (n = 1), and programming sixth (n = 0). Some gender-specific rankings were found. For example, girls reported they did more music downloading than boys. Frequency counts further suggested boys and girls are relatively similar in their internet use, emailing/instant messaging, gaming, and programming. Both genders do minimal programming.
Items from Section 3 asked for students’ perceptions of their skill level in six tasks and general computer literacy. Aggregating male and female rankings across grade levels, overall students were most confident in their instant messaging and emailing skills, followed by internet usage, word processing, gaming, general skills, and programming. Gender-specific frequency analyses of computer skill levels indicated males in general were more confident than females. Males rated themselves as possessing strong skills more frequently than did females in internet, downloading music, gaming, and general computer literacy. Girls and boys indicated similar levels of emailing/instant messaging proficiencies. Not unexpectedly, no student reported that programming was a strong skill area. Descriptive statistics calculated for each computer literacy survey dimension are summarized in Table 1. Clear gender differences are evident, especially for music downloading, gaming, and programming (see Figure 1). Because the general skills dimension was found to be unreliable, correlating weakly with CHANCE
51
Gender Gender
A Males Females Males
9.00
B
M
8.00
Females
M
7.00
Mean
5.00 4.00
M
M
6.00
F
M F
F
F M
3.00 2.00 1.00
F
F
0.00
Word Processing
Internet Use
Instant Downloading Messaging/ Music Emailing
Gaming
Computer Programming
Survey Dimension Figure 1. Means Plotted by Gender
Table 2 — Intercorrelation Matrix Among Computer Literacy Survey Dimensions—Females (n = 42) Literacy Survey Dimensions
Internet
IM/Email
Downloading Music
Word Processing
-.04
.41**
.19
.14
.36*
.53**
.19
.18
.11
.02
-.04
.09
Internet IM/Email Downloading Music
Gaming -.08
Gaming
Computer Programming .28
.23
Note. * p < .05, ** p < .01; IM = instant messaging
Table 3 — Intercorrelation Matrix Among Computer Literacy Survey Dimensions—Males (n = 40) Literacy Survey Dimension
Internet
IM/Email
Downloading Music
Gaming
Computer Programming
Word Processing
.05
.53**
.42**
.68**
.39**
.62**
.67**
.48**
.42**
.42**
.32**
.63**
.46**
Internet IM/Email Downloading Music Gaming Note. ** p < .01; IM = instant messaging
52
VOL. 21, NO. 1, 2008
.57**
.49**
actual performance scores (rmean = .12, p > .05), it was dropped from subsequent analyses. In general, participants answered correctly the questions designated as “simple,” but provided largely incorrect answers for the more challenging items. This observation provided further evidence that weighting item responses was needed. To examine whether these differences were statistically significant, six t-tests (two-tailed) were computed using the formula that assumes unequal variances. The family-wise alpha level was adjusted using the Bonferroni method from .05 to .008 in an attempt to correct for possible Type I errors. Significant gender differences favoring males were found for three dimensions: downloading music [t(82) = 4.23, p < .001, Cohen’s d = .99], gaming [t(82) = 7.01, p < .001, Cohen’s d = 1.53], and programming [t(82) = 3.54, p < .001, Cohen’s d = 0.82]. Results from the correlational analyses by gender are presented in Table 2 for females and Table 3 for males. Correlations near zero were not significant; otherwise, most correlations were positive in direction and statistically significant (p < .05). Compared to the girls’ response patterns, boys who were knowledgeable in one skill area tended to be more proficient in one of the other dimensions. Moreover, boys’ scores across the survey dimensions generally produced higher correlations.
Discussion The supposition underlying this research project was that high-school male students would generate higher computer literacy scores than their female peers. The results presented above suggest this hypothesis was substantiated. The boys had, on average, higher scores across each of the computer literacy categories. This trend was especially evident in the programming category, which was the most difficult, as well as in the gaming and downloading music dimensions. The findings here support earlier genderrelated computer research conducted in other educational settings, where girls seem to lag behind boys in certain computer skills. Moreover, the results tentatively suggest adolescent females do not explore the full range of computer applications as much as males seem to.
These findings provide evidence that the computer classes at this particular high school should concentrate more on improving the computer literacy skill base of students, especially female learners. In programming—the one skill area most likely to be needed in college, but not commonly learned in high school—female scores were significantly lower than male scores. Although girls may require alternative ways to learn programming skills, both groups would benefit from further instruction and practical experience in this subject matter. Girls in computer classes might try interfacing with A.L.I.C.E. (Artificial Linguistic Internet Computer Entity) to learn computer programming. Or, they might try creating their own “chat bot” through the use of AIML (Artificial Intelligence Markup Language). Examples of how teachers have used A.L.I.C.E. in their coursework are available at www. alicebot.org/Scholarly.html. (In a 2007 Journal of Educational Computing Research article, Cathy Bishop-Clark, Jill Courte, Donna Evans, and Elizabeth V. Howard provide evidence of the value of using the ALICE programming language with introductory college computing students.) Moreover, in order to better prepare for college-level technology courses and demands, the school should consider offering higher-level computer courses and teach students helpful tips and shortcuts for better computer fluency. Limitations inherent in this causal comparative study with intact groups diminish the generalizability of its findings. In conclusion, this study’s findings provide further evidence that girls in both public and private schools continue to lag behind boys in their computer literacy skills, particularly in the most challenging areas such as computer programming. Strategic interventions by computer teachers are clearly needed to remedy this deficiency. Also, for comparison purposes, standardized tests of computing literacy for the general highschool population need to be developed to supplement existing school-based evaluation tools.
Further Reading Bishop-Clark, C.; Courte, J.; Evans, D.; and Howard, E.V. (2007). “A Quantitative and Qualitative Investigation of Using A.L.I.C.E. Programming to
Improve Confidence, Enjoyment, and Achievement among Nonmajors.” Journal of Educational Computing Research, 37:193–207. Cooper, J. (2006). “The Digital Divide: The Special Case of Gender.” Journal of Computer Assisted Learning, 22(5):320–334. DeBell, M. (2005). “Rates of Computer and Internet Use by Children in Nursery School and Students in Kindergarten through Twelfth Grade: 2003.” National Center for Education Statistics. Issue Brief. Jessup, MD: U.S. Department of Education. Fetler, M. (1985). “Sex Differences on the California Statewide Assessment of Computer Literacy.” Sex Roles, 13(3/4):181–191. Lockheed, M.E. (1985). “Women, Girls, and Computers: A First Look at the Evidence.” Sex Roles, 13(3/4):115–122. Mercier, E.M.; Barron, B.; and O’Conner, K.M. (2006). “Images of Self and Others as Computer Users: The Role of Gender and Experience.” Journal of Computer Assisted Learning, 22:335–348. Newman, L.S., and Cooper, J. (1995). “Gender and Computers II: The Interactive Effects of Knowledge and Constancy on Gender-Stereotyped Attitudes.” Sex Roles, 33(5/6):325–351. Newburger, E.C. (2000). “Home Computers Internet Use in the United States (U.S. Census Bureau).” Retrieved November 28, 2007, from www.census.gov/population/www/ socdemo/computer.html. Stone, J.; Hoffman, M.; Madigan, E.; and Vance, D. (2006). “Technology Skills of Incoming Freshman: Are First-Year Students Prepared?” The Journal of Computing Sciences in Colleges, 21(6):117–121. van Braak, J.; Tondhur, J.; and Valcke, M. (2004). “Explaining Different Types of Computer Use Among Primary School Teachers.” European Journal of Psychology of Education, XIX:407–422. Wilder, G.; Mackie, D.; and Cooper, J. (1985). “Gender and Computers: Two Surveys of Computer-Related Attitudes.” Sex Roles, 13(3/4):215–228.
CHANCE
53
Visual Revelations
Howard Wainer,
Column Editor
Until Proven Guilty: False Positives and the War on Terror Sam Savage and Howard Wainer
P
analys i s
rint and electronic media have been filled with debate concerning the tactics employed in the War on Terror. These must invariably walk the line between maintaining civil liberties and screening for possible terrorists. Discussions have typically focused on issues of ethics and morality. Is it ethical to eavesdrop on phone and email conversations; to use ethnic profiling in picking possible terrorists for further investigation; to imprison suspects without legal recourse for indefinite amounts of time; to make suspects miserable in an effort to get them to reveal cabals and plots? While we believe these are important questions to ask, we are surprised by how little of the debate has dealt with the likely success of these tactics. Given the obvious social costs, the efficacy of such surveillance programs must be clearly understood if a rational policy is to be developed. Perhaps the biggest barrier to public understanding of this problem surrounds the issue of false positives. Not only is this subject not intuitive, but getting it wrong can result in counter-productive policies. Bayesian analysis is the customary tool for determining the rate of false positives, and we will illustrate its use on a particularly vexing problem: the use of wiretaps to uncover possible
CHANCE
55
Target Represents All Who Would Get Reported 1% of 299,997,000=2,999,970 Falsely Reported Nonterrorists
99% of 3,000=2,970 True Terrorists
Size of Bull’s Eye=2,970 Size of Target=2,999,970+2,970=3,002,940 Chance of Bull’s Eye= 2,970+3,002,940=0.1% or 1 in 1,000
Figure 1. A target whose total area is 2,999,970+2,970 and whose bull’s eye has an area of 2,970, representing those terrorists who would be identified by wire tapping. (From The Flaw of Averages by Sam Savage, in preparation for John Wiley & Sons, all rights reserved.)
terrorists living in our midst. The use of such wiretaps has been hotly debated. To evaluate their effectiveness, we must consider both the chance we will correctly identify a terrorist if we have one on the other end of the line and the overall prevalence of terrorists in the population we are listening in on. We shall start with the latter. How many terrorists are currently in the United States? (By terrorists, we mean hard-core extremists intent on mass murder and mayhem.) Let us assume for argument’s sake that among the 300,000,000 people living in the United States, there are 3,000 terrorists. Or, in other words, the prevalence is one terrorist per 100,000. Once this case is understood, it will be easy to generalize for other numbers. Now, consider a magic bullet for this threat: unlimited wiretapping tied to advanced voice analysis software on everyone’s line that could detect wouldbe terrorists within the utterance of their 56
VOL. 21, NO. 1, 2008
first three words on the phone. The software would automatically call in the FBI, as required, to arrest and question those who triggered the terror detector. Let’s assume the system was 99% accurate. That is, if a true terrorist was on the line, it would notify the FBI 99% of the time, while for nonterrorists, it would call the FBI (in error) only 1% of the time. Although such detection software probably could never be this accurate, it is instructive to think through the effectiveness of such a system if it did exist.
The False Positive Problem When the FBI gets a report from the system, what is the chance it has identified a true terrorist? There are two possibilities. Either there has been the correct report of a true terrorist or the false report of a nonterrorist. Of the 3,000 true terrorists, 99%—or 2,970— would be correctly identified. Of the 299,997,000 nonterrorists (300 million
minus the 3,000 terrorists), only 1%—or 2,999,970—would be falsely reported. This may be thought of as a target whose total area is 2,999,970+2,970 and whose bull’s eye has an area of 2,970, as shown in Figure 1. Thus, the chance of correctly identifying a true terrorist is analogous to hitting the bull’s eye with a dart—roughly one in a thousand, even with a 99% accurate detector. If there were fewer than 3,000 terrorists, this probability would decrease still further. And even if the number of terrorists went up tenfold to 30,000, the chances of a correct identification would still be only 1 in 100. What looked at first like a magic bullet doesn’t look as attractive once we realize that about 999 out of 1,000 suspects are innocent. This is the “false positives” problem. But, of course, we started with an extreme case for dramatic effect. In reality, the bad news is that your terrorist detector would be nowhere near 99%
effective. But the good news is that you would be much more selective in who you wire tapped in the first place. So, in reality, the accuracy would be lower, but the prevalence could be expected to be higher. Suppose you can sufficiently narrow your target population to the point that the prevalence is up to one actual terrorist per 100 people wiretapped. Also, assume a 90% effective test. That is, a true terrorist will be correctly identified 90% of the time and an innocent person 10% of the time. The chance of a false positive is still 91.7%. That is, even when someone triggers an arrest, the odds are 11 to 1 they are not a terrorist. Table 1 shows the probability that someone implicated by a terror screening system is actually innocent, based on the prevalence of terrorists in the screened population and the accuracy of the test. In this table, the accuracy is defined as both the probability a true terrorist is identified as such and an innocent person is identified as innocent. There is no need for these two probabilities to be the same, and later we provide a calculator that will solve the general case of asymmetric accuracy. Upon calculating this table, even the authors were surprised to find that if 1 person in 10 were an actual terrorist, and if the screen were 90% accurate, you would still have as many innocent suspects as guilty ones. For example, consider a battle with insurgents who make up 10% of a population. Table 1 implies that if you call in air strikes on suspected enemy positions based on targeting intelligence that is 90% accurate, you will be killing as many civilians as combatants. This result suggests arresting and prosecuting terrorists in our midst is a real challenge. A fascinating web site at Syracuse University provides real statistics in this area. According to the Transactional Records Access Clearinghouse at http://trac.syr.edu/tracreports/ terrorism/169, in recent federal criminal prosecutions under the Justice Department Program of International Terrorism, roughly 90% of the cases brought in have been declined for further action.
The Cost of False Positives It is tempting for politicians to play off our fears of horrific, but extremely unlikely, events. When they do, it is
Table 1— Percentage of False Positives by Prevalence and Indicator Accuracy Accuracy of Screen Prevalence Number Screened per Actual Terrorist
50%
60%
70%
80%
90%
100,000
100.0%
100.0%
100.0%
100.0%
100.0%
10,000
100.0%
100.0%
100.0%
100.0%
99.9%
1,000
99.9%
99.9%
99.8%
99.6%
99.1%
100
99.0%
98.5%
97.7%
96.1%
91.7%
10
90.0%
85.7%
79.4%
69.2%
50.0%
easy for the nonstatistically trained to fall for faulty logic. For example, a few years ago, a supporter of an anti-missile system for protecting the United States from rogue states argued that the specter of a nuclear weapon destroying New York was so horrible that the U.S. government should stop at nothing to deter it. Oddly, he didn’t bring up the fact that it is much more likely such a weapon would be delivered to New York by ship, and that a missile attack—aside from being much more expensive and difficult—is the only delivery method providing a definitive return address for our own nuclear response. Interestingly, recent studies have shown that an effective program for detecting weapons of mass destruction smuggled on ships would cost about as much as an anti-missile system. So, here, the question is easy. Should we defend against a likely source of attack, or a rare one? But, in general, the decision will be how much incursion of our personal freedom we are willing to endure per life saved. In making this calculation, we must be mindful of the extent to which the harsh treatment of innocents can create terrorists. For example, in a 2003 memo, then U.S. Defense Secretary Donald Rumsfeld asked, “Are we capturing, killing, or deterring and dissuading more terrorists every day than the madrassas and the radical clerics are recruiting, training, and deploying against us?” We must add to our cost function the chance that the prosecution of the innocent will bolster the recruiting efforts of our enemies. Is this probability algebra limited to the War on Terror? Consider what would be the likely outcome if we had universal
AIDS testing. Because the test is far from perfect, we would surely find that the number of false positives would dominate the number of AIDS cases uncovered. And how much of the agony associated with receiving such an incorrect diagnosis would compensate for finding an otherwise undetected case? Our point is not that all testing, whether for disease or terrorism, is fruitless; only that we should be aware of the calculus of false positives and use whatever ancillary information is available and suitable to shrink the candidate population and probability of errors enough so that the false positive rates fall into line with a reasonable cost function.
The False Positive Probability Calculator Calculating the rate of false positives involves comparing the ratio of areas in the target of Figure 1. This is usually done with Bayes’ formula, which is known to all statisticians, but apparently few policymakers. Thus, as a service to mankind, we have created a false positive calculator in Microsoft Excel, available for free download at www.FlawOfAverages.com. This calculator was created with XLTree (see www.AnalyCorp.com). First, a probability tree was generated based on the prevalence and probabilities shown in Figure 2. This tree was then inverted (flipped) to perform what is known as Bayesian Inversion. Figure 3 displays the formula section of the calculator, along with the other outputs in the dark boxes and intermediate calculations in grey boxes. Last, we are certainly not the first statisticians to use the tools of our trade CHANCE
57
False Positive Probability Calculator Inputs
Prevalence of Trait X in Screened Population 1 in 100 Probability That Trait X Is Correctly Detected 90% Probability That Lack of Trait X Is Correctly Detected 95%
Probability of False Positive
84.62%
Figure 2. False positive calculator, www.flawofaverages.com Inputs output Other calculations
Probability that trait X is correctly detected
Probability Tree
Probability of X Person screened has Trait X
Test is positive for X negative for X
1.00%
Person screened is OK 99.00%
positive for X negative for X
90%
0.90%
10%
0.10%
5%
4.95%
95%
94.95% 9 4 .0 5 %
Joint probabilities must sum to 100%
Probabililty that lack of trait X is correctly detected
Probability of lack of X Probability of positive test
Probabililty of true positive
Inverted Probability Tree Test is Positive for X
negative for X
5.85%
94.15%
Probability of negative test
© Copyright 2007, Sam Savage
Person screened has trait X
15.38%
0.90%
OK
84.62%
4.95%
has trait X
0.11%
0.10%
OK
99.89%
0.40%
Probabililty of false positive
Probabililty of false negative
Probabililty of true negative
Model Developed using XL Tree®, available at www.AnalyCorp.com
Figure 3. Trees underlying the false positive calculator. Courtesy of Sam Savage.
to look into the topic of terrorism. Nor even the first to use the pages of CHANCE to do so. The challenge is not to convince other statisticians, but to get decisionmakers to take both Type I and Type II errors into consideration when making policy. Toward this end, we believe embodying these ideas in a widely available calculator increases the chances of holding decisionmakers accountable. So, the next time you become aware of a politician or bureaucrat about to make a decision that may 58
VOL. 21, NO. 1, 2008
bring more harm through false positives than benefit though true ones, we suggest you send them the link to the calculator.
Further Reading Banks, D. (2002). “Statistics for Homeland Defense.” CHANCE, 15(1):8–10. Rumsfeld, D. (2005). “Rumsfeld’s Waron-Terror Memo.” USA Today, www.usatoday.com/news/washington/ executive/rumsfeld-memo.htm.
Savage, S. (In preparation for John Wiley & Sons). The Flaw of Averages. www. FlawOfAverages.com. Stoto, M.A.; Schonlau, M.; and Mariano, L.T. (2004). “Syndromic Surveillance: Is It Worth the Effort?” CHANCE, 17:19–24. Wein, L.M.; Wilkins, A.H.; Baveja, M.; and Flynn, S.E. (2006). “Preventing the Importation of Illicit Nuclear Materials in Shipping Containers.” Risk Analysis, 26(5):1377–1393.
Editorial: Oasis or Mirage? David A. Freedman
I
n a randomized controlled experiment, investigators assign subjects to treatment or control, for instance, by tossing a coin. In an observational study, the subjects assign themselves. The difference is crucial because of confounding. Confounding means a difference between the treatment group and the control group, other than the causal factor of primary interest. The confounder may be responsible for some or all of the observed effect that is of interest. In a randomized controlled experiment, near enough, chance will balance the two groups. Thus, confounding is rarely a problem. In an observational study, however, there often are important differences between the treatment and control groups. That is why experiments provide a more secure basis for causal inference than observational studies. When there is a conflict, experiments usually trump observational studies. However, experiments are hard to do, and well-designed observational studies can be informative. In social science and medicine, a lot of what we know—or think we know—comes from observational studies. Most studies on smoking are observational. Taken together, they make a powerful case that smoking kills. A great many lives have been saved by tobacco-control measures, which were prompted by observational studies. There are a few experiments; the treatments (e.g., counseling) were aimed at getting smokers to quit. Paradoxically, results from the experiments are inconclusive.
Even with studies on smoking, confounding can be a problem. Smokers die at higher rates from cirrhosis than nonsmokers. Cigarettes cause heart disease and cancer, but they do not cause cirrhosis. What explains the association? The confounder is drinking. Smokers drink more, and alcohol causes cirrhosis. Here, confounding can be handled by sorting people into groups by the amount they drink and the amount they smoke. At each level of drinking, there will be little association between smoking and the death rate from cirrhosis. By contrast, at each level of smoking, there will be a strong association between drinking and cirrhosis. This sort of analysis can be done by cross-tabulation. The key idea is making comparisons within smaller and more homogeneous groups of subjects. However, large samples are required. Furthermore, confounding variables are often hard to spot. (Generally, a confounder has to be associated with the effect, and with the factor thought to be causal.) With more variables, cross-tabulation gets complicated and the sample gets used up rather quickly. Applied workers may then try to control for the confounders by statistical modeling, which is our chief topic.
What Is a Statistical Model? A statistical model assumes a relationship between the effect and the primary variable the investigators think of as the cause, as well as potential confounders. The objective is to get
CHANCE
59
statistical (if not experimental) control over the confounders, isolating the effect of the primary variable. Applied workers tend to use relationships that are familiar and tractable; linearity often plays a key role. Certain numerical features of the model are estimated from the data, for instance coefficients in a linear combination of variables. The investigators determine whether such coefficients are statistically significant, that is, hard to explain by chance. If the coefficient of the primary variable is significant and has the right sign, the causal hypothesis has been “verified” by the data analysis.
The Search for Significance Significance testing is an integral part of modeling. This creates problems because significant findings can be due to chance. If investigators test at the 5% level and nothing is going on except chance variation, then 5% of the “significant” findings will be due to chance. In short, with many studies and tests, significant findings are bound to crop up. Journals often look for significant findings; authors oblige. There is tacit agreement to ignore contradictory results found along the way, and search efforts are seldom reported. In consequence, published significance levels are difficult to interpret. Given the null hypothesis, the chance of a spurious but significant finding can be held to a desired level, such as 5%. Given a significant finding, however, the chance of the null hypothesis being true is ill-defined—especially when publication is driven by the search for significance.
The Validity of the Models Fitting models is justified when the models derive from strong prior theory, or the models can be validated by data analysis.
60
VOL. 21, NO. 1, 2008
(This is trickier than it sounds: high R2s and low p-values don’t do much to justify causal inference.) In social science and medicine, the picture is often untidy. Do we have the right variables in the model? Are variables measured with reasonable accuracy? Is the functional form correct? What about assumptions on error terms? Does causation run in the direction assumed by the model? These questions seldom have satisfactory answers, and the list of difficulties can be extended. Assumptions behind models are rarely articulated, let alone defended. The problem is exacerbated because journals tend to favor a mild degree of novelty in statistical procedures. Modeling, the search for significance, the preference for novelty, and the lack of interest in assumptions—these norms are likely to generate a flood of nonreproducible results.
Case Studies (i) Vitamins, fruits, vegetables, and a low-fat diet protect in various combinations against cancer, heart disease, and cognitive decline, according to many observational studies. After controlling for confounders by modeling, investigators find reductions in risk that are statistically significant. Dozens of big experiments have been done to confirm the findings. Surprisingly—or not—the experiments generally contradict the observational data. On balance, vitamin supplements are not beneficial. The low-fat diet rich in fruits and vegetables is not beneficial. (By contrast, there is good evidence to show the Mediterranean diet does protect against heart failure.) The chief problem with the observational studies seems to be confounding. People who eat five helpings of fruits and vegetables a day are different from the rest of us in ways that are hard to model. (ii) Hormone replacement therapy protects against heart disease. According to its proponents, consistent evidence from more than 40 epidemiologic studies demonstrates postmenopausal women who use estrogen therapy after menopause have significantly lower rates of heart disease than women who do not take estrogen. However, large-scale experiments show that hormone replacement therapy is, at best, neutral. Again, the most plausible explanation is confounding. Women who take hormones are different from other women in ways that are not picked up by the models. Medical opinion changes slowly in response to data. Believers in hormone replacement therapy claim if you adjust using the “right” model, the observational studies agree with the experiments. Skeptics might reply that without the experiments, the modelers wouldn’t know when to stop. In any case, the degree of agreement between the two kinds of studies is rather imperfect, even after the modelers have done what they can. (iii) Get-out-the-vote campaigns. There are many attempts to mobilize voters by nonpartisan telephone canvassing. Do these campaigns increase the rate at which people vote? Statistical modeling suggests a big effect, but experimental evidence goes the other way. (iv) Welfare programs. For two decades, investigators have compared experimental and nonexperimental methods for evaluating job training programs and the like. There are substantial discrepancies, and there does not seem to be any
analytic method that reliably eliminates the biases in the observational data. (v) Meta-analysis is often proposed as a way to distill the truth from a body of disparate studies. Systematic reviews of the literature can be useful. However, formal meta-analysis of observational studies—with effect sizes, confidence intervals, and p-values—often comes down to an unconvincing model for results from other models. Even at that, measurement problems can be intractable because many published reports lack critical detail. Such difficulties are rarely acknowledged. And what about unpublished reports? Conventional solutions to the “file drawer” problem depend on another layer of unconvincing assumptions. From a critical perspective, a great deal of meta-analysis appears to be problematic. From another perspective, the software is easy to use and results are welcome in the journals—especially when effects are highly significant and go in the right direction. Needless to say, proponents of vitamins, low-fat diets, hormone replacement therapy, matching, modeling, and meta-analysis will disagree with every syllable (this sentence apart).
How Should Experimental Data Be Analyzed? Experimental data are frequently analyzed through the prism of models. This is a mistake, as randomization does not guarantee the validity of the assumptions. Bias is likely unless the sample is large, and standard errors are liable to be wrong. Before investigators turn to modeling, they should compare rates or averages in the treatment group and control group. As is to be expected, experiments have problems of their own. One is crossover. Subjects assigned to treatment may refuse, while subjects assigned to control may insist on doing something else. The intention-to-treat analysis compares results for those assigned to the treatment group with those assigned to the control group. This should be the primary analysis, as it avoids bias by taking advantage of the experimental design. Intention-to-treat measures the effect of assignment, rather than treatment. Under some circumstances, the effect of treatment can be estimated, even if there is crossover. After the intention-to-treat tables, there is room for secondary analyses that might illuminate the results or generate hypotheses for future investigation. For instance, investigators might fit models, or look for subgroups of subjects with unusually strong responses to treatment.
What About Subgroup Analysis? Richard Peto says you should always do subgroup analysis and never believe the results. (The same might be true of modeling.) In a large-scale study, whether experimental or observational, the principal tables should be specified in advance of data collection. After that, of course, investigators should look at their data and report what they see. However, the analyses that were not pre-specified should be clearly differentiated from the ones that were. Such recommendations might even apply to small-scale studies, although repeating the study may be an adequate corrective.
Whose Data Are They Anyway? Social scientists are often generous in sharing their data. However, replicating the results—in the narrow sense of reproducing the coefficient estimates—is seldom possible. The data have been cleaned up, the exact form of the equations has been lost … In the medical sciences, it is seldom possible to get the data, or even any detail on the modeling, except for the version number of the statistical package. Studies are often run according to written protocols, which is good, but deviations from protocol are rarely noted. Some agencies make data available to selected researchers for further analysis, but the conditions are restrictive. The equations and data should be archived and publicly available. The scope of data analysis should not be restricted. Confidentiality of personal information needs to be protected, but this is a tension that can be resolved. Deviations from protocol should be disclosed. So should the search effort. Assumptions should be identified. Journal articles should explain which assumptions have been tested, and why untested assumptions are plausible. Authors have responsibilities here; so do the journals and funding agencies. Empirical studies are expensive and usually funded by tax dollars. Why shouldn’t taxpayers have access to the data? Even if crass arguments about money are set aside, isn’t transparency a basic scientific norm?
Peer Review Takes Care of It? Some experts think peer review validates published research. For those of us who have been editors, associate editors, reviewers, or the targets of peer review, this argument may ring hollow. Even for careful readers of journal articles, the argument may seem a little farfetched.
The Perfect Is the Enemy of the Good? There are defenders of the research practices criticized here. What do they say? Some maintain there is no need to worry about multiple comparisons. Indeed, worrying in public reduces the power of the studies and impedes the ability of the epidemiologist to protect the public, or the ability of the social scientist to assist the decisionmaker. Only one thing is missing: an explanation of what the p-values might mean. Other defenders love to quote variations on George Box’s old saw: No model is perfect, but some models are useful. A moment’s thought generates uncomfortable questions. Useful to whom, and for what? How would the rest of us know? If models could be calibrated in some way, objections to their use would be much diminished. (As ongoing scholarly disagreements might suggest, calibration is not the easiest of tasks in medicine and the social sciences.) Until a happier future arrives, imperfections in models require further thought, and routine disclosure of imperfections would be helpful. A watermark might be a good interim step. Samples were shown earlier in this article.
Further Reading Supplemental material can be found at www.amstat.org/ publications/chance.
CHANCE
61
Puzzle Corner
Thomas B. Jabine,
Column Editor
Statistical Spiral Cryptic No. 9 For this issue, we offer another spiral cryptic puzzle. The answers to the clockwise clues are to be entered in order, starting with space 1 and ending at space 100. The answers to the counterclockwise clues go in the opposite direction, starting with space 100. Words in one direction overlap those in the other direction. This isn’t as difficult as it might seem; when you get a word in one direction, it will help you with two or more words in the opposite direction. As usual, each clue for an entry contains two elements: a reasonably straightforward definition and a cryptic indication. The latter will be in forms such as anagrams, hidden words (parts of two or more words), puns, double definitions, homophones, and other arcane devices. It isn’t always obvious which part of the clue is the definition and which is the cryptic indication. Punctuation and capitalization are often meant to mislead. A one-year (extension of your) subscription to CHANCE will be awarded for each of two correct solutions chosen at random from among those received by the column editor by May 1, 2008. As an added incentive, a picture and short biography of each winner will be published in a subsequent issue. Please mail your completed diagram to Thomas B. Jabine, CHANCE Puzzle Corner Column Editor, 3158 Gracefield Road, Apt. 506, Silver Spring, MD 20904, or send him your solution (a listing of the answers to the clockwise clues will suffice) by
Past Winner Stephen Walter earned his PhD at the University of Edinburgh. After appointments at the University of Ottawa and Yale University, he joined the Department of Clinical Epidemiology and Biostatistics at McMaster University, Canada, where he is now professor. Walter collaborates with clinicians in evidence-based medicine, internal medicine, and developmental pediatrics. He also works with epidemiologists working in environmental health, cancer etiology, and screening. He is interested in related areas of biostatistical methodology, particularly the design and analysis of medical research studies. Email:
[email protected] 62
VOL. 21, NO. 1, 2008
email (
[email protected]), to arrive by May 1, 2008. Please note that winners of the puzzle contest in any of the three previous issues will not be eligible to win this issue’s contest.
Results of the ‘Two-Word Loops’ puzzle This puzzle, which appeared in CHANCE, Vol. 20, No. 3, Page 64, asked readers to submit a list of all the two-word loops they could find. A two-word loop was defined as consisting of two seven-letter words, with the last three letters of each word being the first three letters of the other word. The example given was desires/resides. We have only one winner, Stephen Walter, who was the only reader to send in an entry. Was this puzzle too difficult, or just not to your taste? I would appreciate your views on this. Walter submitted a list of 54 pairs, of which I was able to verify 38 on the One Look Dictionary web site. Here they are:
ailment/entrail amperes/restamp bedotes/testbed calamus/musical cleaves/vesical derider/derider* desirer/rerides desires/resides genomes/mesogen heritor/torcher illudes/deskill inerter/tercine inerter/termine inerter/terrine ingenit/nithing ingesta/staging ingesta/staking ingesta/staling ingesta/staning
ingesta/staring ingesta/stating ingesta/staving ingesta/staying inglobe/obeying ingoing/ingoing* ingrave/avering insides/deskins insides/desmins insides/destins insures/reskins ionises/session manakin/kinsman manikin/kinsman pinxter/terrapin reddens/ensured redfins/insured redskin/kindred ternata/atafter
* A single word that can be used to form a loop was considered acceptable.
Column Editor: Thomas B. Jabine, 3158 Gracefield Road, Apt. 506, Silver Spring, MD 20904;
[email protected].
Statistical Spiral Cryptic No. 9 1
2
3
4
5
6
7
8
9
10
36
37
38
39
40
41
42
43
44
11
35
64
65
66
67
68
69
70
45
12
34
63
84
85
86
87
88
71
46
13
33
62
83
96
97
98
89
72
47
14
32
61
82
95
100
99
90
73
48
15
31
60
81
94
93
92
91
74
49
16
30
59
80
79
78
77
76
75
50
17
29
58
57
56
55
54
53
52
51
18
28
27
26
25
24
23
22
21
20
19
Clockwise 1. Teaser transformed a saint. 7. Note: likely to be dependable. 15. Greek god is back for a rest. 18. Measure inner covering for bonnet maker. 26. Alliance has second agreement. 32. Unusual standard for sculptor. 35. Like a wren, e.g., with a cosine transformation 41. From Rev. Eugene about Arthur’s wife. 50. Collect a type of cue. 54. Turn lights down, fail to understand. 59. Trajectory of environment. 64. Pitch to New England. 68. Most frequently a mold is broken. 73. Illuminated writings. 76. Wanderer, returning, plays Jason Bourne. 81. Beggar has ingredient for foo yung. 84. Romans were concealing the response. 90. After 100, I’m covered by an agreement for one-tenth. 97. Bound by a misguided plea.
Counterclockwise 100. Father and Ella make Spanish delicacy. 94. One thousand diamonds for rodents. 90. Barrymore took another card. 86. Disorganized gangs Ed captured. 79. Laotian mold used in production of wine. 68. Composer not confined to irregular time. 61. Foolishly, bid more for commonplace item. 54. Lob rope carelessly, a real mistake! 47. Unruffled by endless affair. 43. Copper and obliging servant mixed up about selective breeding. 36. So parson upset some choir members. 28. Get used to anger about Northeastern University. 23. In Boston, I’ll be unwilling. 19. With lean imp, form peer group. 12. In samba, I’ll find release. 8. Remove from Vera’s envelope. 3. Soak Butler, we hear.
CHANCE
63
Results for Statistical Cryptic Puzzle No. 21 Because there was no Puzzle Corner in Vol. 20, No. 4, we also present in this issue the results of the puzzle that appeared in Vol. 20, No. 2. Much to my surprise, I did not receive any correct solutions to Statistical Cryptic Puzzle No. 21. There were only two submissions and each had one error. We have declared these two submitters to be our winners. What happened? Were the clues too obscure? Was there insufficient time prior to the deadline? Or were the theme terms relating to statistical confidentiality unfamiliar to you? I would appreciate your feedback.
Solution to Statistical Cryptic Puzzle No. 21 This puzzle appeared in CHANCE, Vol. 20, No. 2, Spring 2007, pp. 63–64. The answers to the clues are: Across: 1. DRILL BIT [rebus: drill + bit] 5. LOSES [anagram: o (nothing) less] 8. STEPMOM [anagram: Mme. Post] 9. BUREAU [double definition: dresser, agency] 10. OLEO [rebus with container: o(le)o 11. LITERAL [anagram: all tire]
12. SANDAL [reverse rebus: lad + NA’s] 15. SOFIA [rebus with container: S(of + i)A 17. ATTIC [double definition: Athenian, loft] 18. PRIVACY [anagram: a VIP cry] 19. REESE [hidden word: fREE SEats] 20. EASES [rebus: creases – cr] 21. RESORT [container: res(or)t] 24. DISTAFF [rebus: DI + staff] 26. EMMA [truncated word: (l)emma] 27. SCORNS [container: s(corn)s] 28. DISTILL [container with reversal: di(sti)ll 29. OASIS [rebus: OAS + is] 30. SYRINGES [anagram: Yes, grins] Down: 1. DISCLOSURE RISK [anagram: Kurd’s oil crises] 2. IDENTIFIERS [anagram: define it, sir] 3. LAMER [double definition: Debussy compostion, of less substancce] 4. IMMOLATION [anagram: Moonlit, I am] 5. LOBOS [rebus with reversal: lob + os] 6. SARAN [rebus: SA + ran] 7. PUBLICUSE FILES [anagram: Blue fuse clip is] 13. DATA SHARING [anagram: hard against] 14. GRANDDADDY [rebus: GR + and + daddy] 16. APE [reversal: EPA] 17. AYE [homophone: eye, aye] 22. OGRES [hidden word with reversal: LoSER GOes] 23. TESTS [rebus with reversal: tes + TS (Eliot)] 25. SUSHI [hidden word: CenSUS HIreling]
Past Winners A. Narayanan is a principal scientist in the Consumer Research Analytics group of The Procter & Gamble Company. He lives with his wife (Sunitha) and two children (Ritu and Arvind) in West Chester, Ohio. He does modeling at work and remodeling at home. Though he would prefer to be playing at Wimbledon, he loves watching tennis and plays with his children. He enjoys one thing for sure: cryptic crossword puzzles. He also likes to assemble jigsaw puzzles (with the help of his children, of course).
64
VOL. 21, NO. 1, 2008
Stephen Ellis is a statistician at the Duke Clinical Research Institute in Durham, North Carolina. He received his Bachelors degree in mathematics and physics from Washington University before earning a PhD in statistics from Stanford University. Previous work experience includes that with MathSoft (now Insightful Corp.) in Seattle, WA. Away from work, he enjoys bicycling, basketball, and playing cards.