Editor’s Letter Mike Larsen, Executive Editor
Dear Readers, This issue of CHANCE contains articles about history, proba...
28 downloads
400 Views
5MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Editor’s Letter Mike Larsen, Executive Editor
Dear Readers, This issue of CHANCE contains articles about history, probability, statistical modeling, messy data, risk, graphics, and sports. Actually, several of the articles concern sports in one context or another. The sports articles are not just for the sports enthusiast, however. These articles explore diverse statistical topics and methods and illustrate a variety of concerns in probability and statistics. The first couple of articles concern historical themes and the development of statistical practice. Brian Clauser discusses the relationship between R. A. Fisher and Karl Pearson and the impact of strict adherence to set significance levels for hypothesis testing. The author suggests personal relations, copyright restrictions, and economic conditions related to World War I influenced the format of tables and subsequent statistical practice. The ‘file drawer problem’ and the registration of clinical trials prior to initiation are two related modern issues. In a related article, Stephen Stigler presents his view of the origin of the 5% significance level standard for hypothesis testing. As I mentioned, many articles have sports themes. Two international contributions concern winter sports. Moudud Alam, Kenneth Carling, Rui Chen, and Yuli Liang investigate the progression of young skiers in Sweden. Their growth curve modeling adjusts for age and gender effects and produces individualized evaluations of progress. It is conceivable that such methods would be useful in health or education studies. Three of the authors of this paper are graduate students. Bill Hurley’s sports example is the National Hockey League. His substantive topic is the birthday matching problem and probability calculations. Did you know birthdays of hockey players in the NHL are becoming increasingly nonuniform? Read the article to find out why and what impact this has on probability calculations. Rachel Croson, Peter Fishman, and Devin Pope compare golf and poker. Is poker a game primarily of skill or of luck? The authors confront the difficulty of answering this question when data are available on only the top players. Philip Price discusses a variant of a betting pool for college basketball’s March Madness single elimination tournament. Here, there are data that could be useful in estimating probabilities, expectations, and variability. These data and
the questions raised in the article could be used to motivate teaching introductory concepts. Can your students devise a better strategy? Brian Schmotzer examines the relation of leads and time remaining to the eventual winner of a college basketball game. The data are noisy, so the author employs smoothing techniques. He then translates solutions into simple rules and equations that you can use in real time. Katherine McGivney, Ray McGivney, and Ralph Zegarelli address the same question, but for professional basketball. The authors employ logistic regression models to describe the performance of simple rules. Simulation is used to further evaluate the rules. Both articles are possible only because the authors made considerable efforts to produce usable data. In Mark Glickman’s Here’s to Your Health column, Stephanie Land critically comments on failings of risk perception and the challenge of effectively communicating risks. Clear comparative graphical presentations of risk are part of her story. In the Visual Revelations column, Paul Velleman and Howard Wainer take a look at blood sugar measurements and diabetes. The data here are for one individual. Outliers and trends are investigated in detail. In this article, teachers of statistics will find comparisons of measures of center and examples of smoothing noisy time-varying data. The issue is completed with a new puzzle from Jonathan Berkowitz in his Goodness of Wit Test column and the announcement of a CHANCE graphics contest. As always, a one-year (extension of your) subscription to CHANCE will be awarded for each of two correct puzzle solutions chosen at random from among those received by the deadline. The graphics contest similarly will award one-year (extensions of) subscriptions to winners as described in the announcement. We look forward to your puzzle and graphics entries! And, as always, I look forward to your comments, suggestions, and article submissions. Enjoy the issue! Mike Larsen
CHANCE
5
War, Enmity, and
Statistical Tables Brian E. Clauser
“ I am surely not alone in having suspected that some of Fisher’s major views were adopted simply to avoid agreeing with his opponents.” — Leonard J. Savage, Annals of Statistics, 1976
A
lthough the use of strict nominal levels for significance testing (e.g., .01 or .05) has passed in and out of vogue, it is beyond question that this approach to statistics has had a profound impact on 20th century science. Numerous authors have raised concerns about the theoretical appropriateness and practical impact of this approach. In fact, a recent book by Stephen T. Ziliak and Deirdre N. McCloskey, The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives (Economics, Cognition, and Society), goes so far as to argue that the emphasis on significance testing (rather than on measures of effect size) has damaged the study of economics, psychology, and medicine, and, in the process, cost society both jobs and lives.
6
VOL. 21, NO. 4, 2008
The File-Drawer Effect Among the unintended and problematic consequences of statistical testing using fixed levels of significance is the potential for this practice to lead to publication bias. If experiments that fail to achieve the nominal level of significance are unlikely to become part of the literature, significant results arising from type I error may appear to provide a basis for sound conclusions. This phenomenon, referred to as the file-drawer effect, has been recognized as a theoretical problem in the psychological literature for decades. Theodore Sterling, in 1959, was among the first to document the impact nominal levels of significance have had on the psychological literature. He reviewed 362 articles selected at random from four journals.
From left: R. A. Fisher, Karl Pierson, and William Gosset
Of the 294 articles that included significance tests, only eight reported nonsignificant results; 97% reported results that were significant at a level of p≤.05. Emphasis on these nominal levels of significance has been explicitly espoused by both journal editors and the influential Publication Manual of the American Psychological Association. As editor of the Journal of Experimental Psychology, Arthur Melton stated in 1962 that papers were unlikely to be accepted if they did not report a statistically significant result. Later, the Publication Manual supported strict adherence to nominal levels of significance when it advised authors, “Do not infer trends from data that fail by a small margin to meet the usual levels of significance.” Although there has been active debate more
recently about the appropriate application of tests of statistical significance, the impact of this practice continues to be evident in numerous fields. In medical research, such as drug trials, publication bias literally may be a matter of life and death. When pharmaceutical companies support hundreds of drug trials, many of the significant findings reported for the smaller number of studies that make it to publication are likely to have occurred by chance. This is so serious a problem that the Food and Drug Administration (FDA) has instituted reporting requirements that require all trials to be recorded, regardless of whether the outcome results in publication. Similarly, the International Committee of Medical Journal Editors, which represents the
CHANCE
7
More recent texts are designed for use with computer programs that automatically provide the specific level of sigmatica nificance canc associated with the calculated test statistic, b but the use of strict levels of statistical significance canc for interpreting the results continues. David M Moore and George McCabe, in Introduction to the Pr Practice of Statistics, identify a four-step process for interp interpreting the results of a statistical test: 1) state the null hyp hypothesis, 2) calculate the test statistic, 3) identify the associa associated level of significance, and 4) state a conclusion. They go o on to say, “One way to do this is to choose a level of significcance α .” What they don’t offer is an alternative approach to d drawing conclusions. editors of hundreds of journals—including the New England Journal of Medicine, The Lancet, and the Journal of the American Medical Association—has instituted a policy stating a clinical trial must have been registered before the first subject was recruited to qualify for publication. In an editorial announcing the policy, the committee stated that individuals who have agreed to participate in clinical trials “deserve to know that the information that accrues from their altruism is part of the public record, where it is available to guide decisions about patient care, and deserve to know that decisions about their care rest on all of the evidence, not just the trials that authors decided to report and that journal editors decided to publish.” A recent paper in the New England Journal of Medicine used these reporting requirements to examine the extent of publication bias for studies of selected antidepressant medications. The authors concluded that publication of a study was dependent on whether it produced a significant result. Out of 74 registered studies, 51 were published and 23 were not. Of the studies viewed by the FDA as having significant positive results, 37 were published and only one was not. Meta-analysis indicated the publication bias increased the effect size for the 12 studied medications from 11% to 69% (32% overall). This suggests a strong bias toward publication for studies that show significant positive results and a sizable change in the resulting interpretation of the efficacy of the medications.
Textbook Prescriptions Emphasis on the use of strict levels of significance for interpreting experimental results also has been reflected in introductory statistical textbooks (used to train future researchers). In Elementary Statistics, originally published in 1954 and republished in 1968, Janet Spence, Benton Underwood, Carl Duncan, and John Cotton present an example to illustrate the use of Student’s t-statistic. Results are presented for an experiment with five degrees of freedom that results in a t-statistic of 2.19, and the reader then is instructed to compare this value to the value in the table associated with a .05 level of significance (2.34). The conclusion that follows is, “In short, we cannot reject the null hypothesis.” The authors are being consistent with the previously noted advice from the Publication Manual and teaching by example that the values in the table (in this text, .05 and .01) are the values that matter. In this example, the t-statistic was associated with a probability of .06, but there is no way for the reader to know this.
8
VOL. 21, NO. 4, 2008
Origin of the 0.05 Cut-Off Given the widespread and potentially problematic influence of nominal levels of significance described so far, it is of interest to consider the origin of the use of specific levels of statistical significance as strict guidelines for interpreting the results of experiments. Historically, this development is of particular interest because the originators of the statistical procedures recommended the application of rough guidelines, not strict cut-offs. For example, Howard Wainer and Daniel Robinson summarize Fisher’s approach in an article published in Educational Researcher by stating: Throughout Fisher’s work, he used statistical tests to come to one of three conclusions. When p (the probability of obtaining the sample results when the null hypothesis is true) is small (less than .05), he declared that an effect had been demonstrated. When it is large (p is greater than .2), he concluded that if there is an effect, it is too small to be detected with an experiment this size. When it lies between these extremes, he discussed how to design the next experiment to estimate the effect better. Ultimately, it may be impossible to provide a definitive answer to the question of why these nominal levels of significance have become so widespread, but one possibility is that this trend in statistical practice arose out of historical conditions essentially unrelated to statistical theory. It may be that specific levels of statistical significance are used simply because that is how R. A. Fisher structured the tables included in his 1925 monograph, Statistical Methods for Research Workers. Furthermore, it may well be that it was war and personal enmity that brought about the existence of these tables in the first place. That is, Fisher may have introduced tables structured around these levels of significance for the sole purpose of creating a new and alternative format for his tables—a format created not because Fisher believed the new format was advantageous, but because he was prevented by copyright restriction from publishing the tables in their original form. Finally, it may be that the enforcement of those copyright restrictions resulted in part from the economic difficulties faced by the journal Biometrika after World War I and in significant part from the copyright being controlled by Karl Pearson. Pearson disliked Fisher and never would have considered allowing him to reproduce the original tables.
Figure 1. A section from “Student’s” table of the z statistic Reprinted from “The Probable Error of a Mean,” “Student” (1908), Biometrika.
Figure 2. Section from Fisher’s table of the t-statistic Reprinted from Statistical Methods for Research Workers, R. A. Fisher (1925), Oliver & Boyd.
Though all of this is speculation, it is speculation based on historical evidence. The argument is presented in two parts. First, it is suggested that Fisher’s modification of the format of the statistical tables appearing in his monograph changed the way the tables were read and unavoidably moved toward a focus on specific levels of significance. Second, it is argued that the change in the format of the tables can be traced to a variety of historical factors, but that there is little evidence to support Fisher’s claim that he viewed the format as superior.
The Tables Consider, for example, two statistical tables that existed before 2 Fisher’s monograph: Pearson’s χ distribution and “Student’s” t (or as “Student” originally referred to it, z) distribution. In both Pearson’s table from the 1900 publication in Philosophical Magazine and the table in “Student’s” 1908 Biometrika paper, rows and columns are defined in terms of the observed values, and the body of the table is comprised of probabilities. Figure 1 shows a section of the table from Biometrika. With this table, the researcher completes the calculations, finds the row corresponding to the sample size and the column corresponding to the magnitude of the statistic, and thus identifies the level of significance within the body of the table. So, for example,
with a sample size of five and a z equal to 1.0, the associated probability level is 0.9419. Interpolation is required for the user to identify the magnitude of the statistic that would be associated with a .01 or .05 significance level. Fisher inverted this design so the user selects the level of significance and identifies the value of the statistic associated with that level. With the earlier design, the user read the level of significance associated with the results. With Fisher’s approach, shown in Figure 2, the user finds the value of the statistic required to achieve a specified level of significance. So, for example, with five degrees of freedom, a significance level of .05 is achieved when t is at least 2.571. With this change, specified levels of significance take on a new and central importance. Fisher acknowledged that the original tables could not be reproduced due to copyright restrictions, but he described the new format of his tables as that “which experience has shown to be more convenient.” It appears to be a matter of historical fact that Fisher is the first to have published tables in this form. It is also clear that Fisher’s versions of the tables have become ubiquitous. It is impossible to prove that Fisher’s table led to the use of established levels of significance, but it is clear the new format facilitated their use, and it is difficult to believe they did not influence practice.
CHANCE
9
choice of format was the result of copyright restriction and not attributable to any advantage to the reader. During the years leading to the publication, Fisher discussed his interest in publishing the tables—in their original form—in correspondence with Gosset. Although the letters Gosset received from Fisher have been lost, Fisher carefully filed his correspondence from Gosset, and Gosset’s comments strongly suggest that, as late as the end of 1923, Fisher hoped to publish Student’s table in its original format. In July of that year, Gosset responded to an apparent inquiry about the possibility of reproducing the table from the 1908 Biometrika paper: As to “quoting” the table in Biometrika it depends just what you mean by quoting. I imagine they have the copyright and would be inclined to enforce it against anyone. The journal doesn’t now pay its way though it did before the war and they are bound to make people buy it if they possibly can. I don’t think, if I were editor, that I would allow much more than a reference! Subsequently, Gosset referred to the table he and Fisher had been working on, which ultimately appeared in Metron, saying, “I take it that this table is, if it gets finished in time, to be published in your book.” Fisher’s continued interest in publishing the table in its original form appears to be confirmed by a reply from Gosset, written at a point when Gosset was considering offering this new table to Pearson for publication in Biometrika: R. A R A. Fisher
Fisher and Gosset One opportunity for a direct comparison of how the two forms of the table are interpreted comes from the writings of William Gosset and Fisher. Gosset (who published under the name “Student”) provided an illustration of the use of his table in his 1908 paper. The example compared the additional hours of sleep gained from the use of two separate medications. After discussing the effect of each of the medications in comparison to no medication, Gosset stated: I take it the real point of the authors was that [medication] 2 is better than [medication] 1. This we must test by making a new series, subtracting 1 from 2. The mean values of this series are +1.58, while the S.D. is 1.17, the mean value being 1.35 times the S.D. From the table the probability is 0.9985, or the odds are about 666 to 1 that [medication] 2 is the better soporific. Gosset makes no comparison of his results to nominal levels of significance. By comparison, Fisher uses the same data set to illustrate the use of the t-test in Statistical Methods. He calculates t to be equal to 4.06 and then concludes, “For n=9 only one value in a hundred will exceed 3.250 by chance, so the difference between the results is clearly significant.” The format of the table makes it necessary for Fisher to draw a conclusion relative to this single nominal level (p≤.01). When Fisher published his 1925 monograph, the copy2 rights to Pearson’s χ table and Student’s z table were owned by Biometrika and controlled by Karl Pearson. Fisher was unable to publish the tables in their original form without Pearson’s permission. Fisher’s comment about the advantage of this new form of the table aside, there is considerable evidence that the 10
VOL. 21, NO. 4, 2008
Re your postscript about publication, I quite agree: when the thing is put together I will either send it or take it to K.P. [Karl Pearson] and will make it quite clear that you wish to have the right of publication in case you wish to include it in any book that you may be bringing out. Clearly, Gosset assumed Fisher intended to publish the tables in their original form in December of 1923. By July of 1924, Fisher had instructed his publisher to forward proofs of Statistical Methods to Gosset; these reflected the new form of the tables. Apparently, Fisher’s change of view occurred as publication approached. In this context, it is also worth noting that Fisher continued to work with Gosset on the new table. That table ultimately was published after Statistical Methods, but it was published in the earlier format.
Fisher and Pearson The impact of the war and the continuing economic difficulties for Biometrika to which Gosset referred may have been real factors in restricting use of the copyright, but when matters relating to Pearson arose, it is clear Gosset engaged in a level of diplomacy that recognized the extremely strained relationship between Pearson and Fisher. In a 1922 letter, Gosset commented, “In most of your differences with Pearson I am altogether on your side …” Few professional antagonisms have been more public than that between Fisher and Pearson. It is difficult to identify the original source of the antagonism. From the start, Pearson may have felt threatened by Fisher’s mathematical ability. In 1912, Gosset received a letter from Fisher providing a mathematical proof of Student’s formula for the probable error of the mean. Such a proof was missing from Student’s 1908 paper and was beyond Gosset’s mathematical ability. Gosset forwarded the proof to Pearson for his opinion with the suggestion that the
proof be published in Biometrika. Pearson responded, “I do not follow Mr. Fisher’s proof …” The relationship between Fisher and Pearson certainly was not helped by Fisher’s2 identification of an error in the way Pearson applied the χ test in his 1900 paper. In Fisher’s article, he identified the importance of degrees of freedom in interpretation of the table; the suggestion that Pearson had failed to understand what was arguably his own most important contribution to mathematical statistics must have been a considerable affront. Regardless of the origin of this animosity, it was clear Pearson took his anger to the grave and that even Pearson’s death did not end Fisher’s animosity. Pearson’s final Biometrika publication was an attack on a paper by R. S. Koshal, who had argued in print for the superiority of Fisher’s maximum likelihood estimation over Pearson’s method of moments. Although Koshal’s paper was the immediate target of the attack, Pearson was explicit about his attack being directed at Fisher. The paper had a single author, Koshal, but Pearson refers to it as the Koshal-Fisher paper. Pearson complained about the methodology used by Koshal and identified the fact that some of the tabulated results contained computational errors. Pearson, of course, pointed these errors out: “Had Professor Fisher … investigated … Koshal’s arithmetic, he would probably have been less dogmatic about the importance of the Koshal-Fisher results.” Pearson then could not resist repeatedly reminding his readers of these errors: “Two of which be it remembered are in error,” “… there are blunders in the arithmetic,” “… owing to faulty arithmetic,” and “… also springs from the occurrence of blunders in Koshal’s arithmetic.” Shortly after Pearson’s death, Fisher responded in the pages of the Annals of Eugenics with a paper titled “Professor Karl Pearson and the Method of Moments.” Fisher identified what he considered to be errors in the procedure implemented by Pearson. Based on these errors, he referred to Pearson as a “clumsy mathematician” and went on to say, “Had it not been for his arrogant temper, his taste for numerical exemplification might well have saved him from serious theoretical mistakes.” Judging the magnitude of Pearson’s errors to be 20 times those of Koshal’s, Fisher continued, “The English language does not seem to possess a word 20 times as forcible as ‘blunder.’ May we hope that some Eastern tongue known to Koshal is more amply provided.” In 1950, more than a decade later, Fisher’s introduction to this same paper reprinted in Contributions to Mathematical Statistics showed his hostility had not diminished: “Pearson was an old man when it occurred to him to attack Koshal, but it would be a mistake to regard either the errors or the venom of that attack as a sign of failing powers. … If peevish intolerance of free opinion in others is a sign of senility, it is one which he had developed at an early age. Unscrupulous manipulation of factual material is also a striking feature of the whole corpus of Pearson’s writing …”
Summary History does not provide the sort of randomized experiments Fisher would have advocated; we cannot know whether the loose guidelines advocated by Fisher for interpreting levels of significance would have evolved into the strict rules they
became if Fisher had published the tables in their original form. It is difficult to believe the format did not have an impact on the way the tables were used and the specific values perceived. Was Pearson’s mutual dislike of Fisher the factor that led to Fisher’s development of a new format for statistical tables (highlighting levels of nominal significance)? Stephen Stigler, in an article beginning on Page 12, suggests that the complexity of presenting the tables for Fisher’s F statistic may have led to Fisher’s choice of format for his new tables. Surely, Fisher’s decision was influenced by a combination of factors, but Fisher’s own words suggest that copyright restriction led to the construction of the new tables. Fisher acknowledged in Statistical Methods for Research Workers that2 he did not present the previously available table of the χ owing to copyright restriction. His correspondence with Gosset during the period in which he was working on this monograph similarly made clear his interest in publishing the original form of Student’s table. Was it inevitable that the owner of the copyright for these tables would prevent publication by others? The simple answer is “no.” To this day, many of us learn statistics from texts that credit Fisher as the source of the tables; certainly, Fisher and his publisher—Oliver and Boyd—were not in the business of losing money. Fisher allowed G. Udny Yule and M. G. Kendall to reproduce his table of the z distribution when he was still producing revised editions of his own monograph, and he allowed the same consideration to nearly 200 other authors. Clearly, it is possible that the commonly used nominal levels of significance would have gained prominence without the series of events described in this article. But, the available evidence suggests the conditions created by World War I and the subsequent economic difficulties of Biometrika—combined with Pearson’s tight-fisted nature and, most of all, his animosity for R. A. Fisher—produced the circumstances that led to Fisher’s tables establishing nominal levels of significance.
Further Reading Box, J.F. (1978) R. A. Fisher: The Life of a Scientist. New York: John Wiley & Sons. De Angelis, C., Drazen, J.M., Frizelle, F.A., Haug, C., Hoey, J., Horton, R., et al. (2004) “Clinical Trial Registration: A Statement from the International Committee of Medical Journal Editors.” Annals of Internal Medicine, 141: 477–478. Gigerenzer, G. (2004) “Mindless Statistics.” Journal of SocioEconomics, 35:587–609. Pearson, E.S. (1990) “Student”: A Statistical Biography of William Sealy Gosset. (R. L. Plackett & G. A. Barnard, eds.). Oxford: Clarendon Press. Rosenthal, R. (1979) “The File Drawer Problem and Tolerance for Null Results.” Psychological Bulletin, 86:638–641. Turner, E.H., Matthews, A.M., Linardatos, E., Tell, R.A., Rosenthal, R. (2008) “Selective Publication of Antidepressant Trials and Its Influence on Apparent Efficacy.” New England Journal of Medicine, 358:252–260. Wainer, H. and Robinson, D.H. (2003) “Shaping Up the Practice of Null Hypothesis Significance Testing.” Educational Researcher, 32:22–30. CHANCE
11
Fisher and the 5% Level Stephen Stigler
S
urely R. A. Fisher played a major role in the canonization of the 5% level as a criterion for statistical significance, although broader social factors were involved. Fisher needed tables for his 1925 book and, evidently, Karl Pearson would not permit the free reproduction of the Biometrika tables, so Fisher computed his own. Fisher found it convenient to table values in the extremes for levels such as 10%, 5%, 2%, 1%—roughly halving the level with each step. One simple explanation for the format he selected lies in the fact that the book introduced “analysis of variance,” or ANOVA. For most readers, this would be their first exposure to ANOVA, and Fisher needed a way to make the new test accessible— essentially the F-test, although he preferred to work in terms of z = log(F). The table here was entirely novel, requiring entry via two parameters: the numerator and denominator degrees of freedom (df). It would have been impractical to provide a full table of the distribution for each pair of values: With the 10 levels of both dfs he wished to include, 100 tables would have been required if he gave the same level of detail he gave for his normal distribution table, or 10 tables if he gave the reduced level of detail that Gosset gave in his 1908 table for the t-distributions. So, Fisher initially settled on only giving one table for the 5% point. Once that was decided, it is not implausible that Fisher chose (in a book for practical workers) to make the other tables conform to that same simple format. This was not a huge task, and it had the bonus of casting all assessments of significance in the same accessible form. The first edition (1925) of Fisher’s book Statistical Methods for Research Workers had six tables: I. and II. Tables of the inverse cumulative normal distribution (of z in terms of P, where P= F(–z)+1–F(z) = Pr{|Z|>z} and Z has a standard normal distribution). He gave this for P = .01 to .99 (increments of .01) and for P = .001, .0001, ..., .000000001.
12
VOL. 21, NO. 4, 2008
III. Percent points y, where P= 1–F(y), for chi-square, df = 1, 2, ..., 30, and P = .99, .98, .95, .90, .80, .70, .50, .30, .20, .10, .05, .02, .01. IV. Percent points for the t-distributions, df = 1, 2, ..., 30, ∞, and P = .9, .8, .7, .6, .5, .4, .3, .2, .1, .05, .02, .01. V. Percent points for the correlation coefficient r, for n = 1 (1) 20 (5) 50 (10) 100 and for P = .1, .05, .02, .01. He also gave (as Table V (B)) the hyperbolic tangent transformation of r. VI. Table VI gave only the P = .05 percent points for the distribution of z (the log of the F-statistic) by numerator df and denominator df, for df = 1, 2, 3, 4, 5, 6, 8, 12, 24, ∞. By the third edition (1930), he had added a table giving the 1% points and enlarged the range of denominator df considerably. Note that only Fisher’s Table VI strongly emphasized the 5% point. The others gave varying degrees of extended coverage, especially for the Normal, t, and chi-square distributions, where they gave a pretty good idea of each whole distribution. Later editions of Statistical Methods for Research Workers (from the seventh of 1938) moved all the tables from the end of the book and interspersed them through the text. All these tables and more were given in Fisher and Frank Yates’ book, Statistical Tables for Biological, Agricultural and Medical Research. There, the table for (essentially) the F-distribution was expanded to include a range of values from the 20% to 0.1% points. My own view is that while Fisher’s initial Table VI (but only that table) fixed attention at the 5% level (rather than, say, 6%, 10%, or 2%), that fixation is largely the result of a social process extending back well before Fisher. Even in the 19th century we find people such as Francis Edgeworth taking values “like” 5%—namely 1.5%, 3.25%, or 7%—as a criterion for how firm evidence should be before considering a matter seriously.
Odds of about 20 to 1, then, seem to have been found a useful social compromise with the need to allow some uncertainty, a compromise between (say) .2 and .0001. That is, 5% is arbitrary (as Fisher knew well), but fulfils a general social purpose. People can accept 5% and achieve it in reasonable size samples, as well as have reasonable power to detect effect-sizes that are of interest. In my 1986 book, The History of Statistics, I speculate that the lack of such a moderate standard of certainty was among the factors that kept Jacob Bernoulli and Thomas Bayes from publishing. The use of Fisher’s tables only served to make the choice more specific. One may look to Fisher’s table for the F-distribution and his use of percentage points as leading to subsequent abuses by others. Or, one may consider the formatting of his tables as a brilliant stroke of simplification that opened the arcane domain of statistical calculation to a world of experimenters and research workers who would begin to bring a statistical measure to their data analyses. There is some truth in both views, but they are inextricably related, and I tend to give more attention to the latter, while blaming Fisher’s descendents for the former. After all, a perceptive 1919 article warning of the potential misuse of what we now call statistical significance by the psychologist Edwin G. Boring is ample evidence that the abuse predated Fisher.
Further Reading Boring, Edwin G. (1919) “Mathematical vs. Scientific Significance.” Psychological Bulletin, 16:335–338. Fisher, Ronald A. (1925) Statistical Methods for Research Workers (first ed.), Edinburgh: Oliver & Boyd. Fisher, Ronald A., and Yates, Francis (1938) Statistical Tables for Biological, Agricultural and Medical Research (first ed.), London: Oliver & Boyd. Stigler, Stephen M. (1986) The History of Statistics: The Measurement of Uncertainty Before 1900, Cambridge, Mass.: Harvard University Press.
How to Determine the Progression of Young Skiers? Moudud Alam, Kenneth Carling, Rui Chen, and Yuli Liang
Skiers at the ages of 14–17 as for-runners in Svenska Skidspelen 2006. The style of skiing is skate.
S
ports are popular among children in most countries, and many parents enjoy watching their children participate in sports as leisure activities. Various surveys suggest children play sports because it provides an arena for social relations, and as the child grows older, the stimulus from getting positive feedback of training in terms of race results plays a greater role. A sad aspect of sports and children is the high drop-out rate in the early teen-age years. To the extent the drop-out is caused by an emerging interest in other activities or a greater focus on studies, it is fine. But a substantial number of teen-agers drop out because they do not get positive feedback from the training in terms of racing results. Sports associations could probably take various measures to reduce the drop-out rate, and an ongoing debate concerns such measures.
One explanation for the high drop-out frequency is that puberty affects individuals differently and the differential effects impact performance in sports. It is well-known that puberty occurs at somewhat different ages for individual teenagers. The risk of dropping out of sports could be high for individuals who experience puberty late because they might perform worse than their peers. That is, it is possible they would be dejected by comparatively weak racing results. Val Abbassi summarized male and female progression curves and the puberty effect in an article in American Academy of Pediatrics titled “Growth and Normal Puberty.” According to this work, the female curve is at its steepest around the age of 12. For males, it is around 14 years of age and steeper than for females. These curves are conventionally assumed to approximate the progression in sports. During the period of rapid change, the estimation of the progression curve is challenging.
CHANCE
13
7
● ●
6
●
5
●
●
●
●
4
Velocity (m/s)
●
●
● ●
2
3
●
10
12
14
16
18
Age Figure 1. Average skiing velocity for Julia Forsmark, computed from her participation in two annual races in six consecutive years
Table 1—Number of Skiers Divided by Gender and Age Age
Boys
Girls
Total
18
23
15
38
17
25
24
49
16
32
25
57
15
27
32
59
Total
107
96
203
Repeated Race Measures Are Unreliable in Skiing To get to the core of the problem, consider a girl named Julia Forsmark whose progression in skiing velocity is shown in Figure 1. The figure shows the average racing velocities in two races each year from the age of 10 to the age of 15. (She turned 15 in 2008.) Julia’s slowest speed was recorded at 10 years, as one would expect. However, her fastest race was at the age of 12, when she reached almost seven meters per second. Is her recent training scheme bearing no fruit? Or did she have an effect of early puberty and then level off? Or are the data unreliable? The answer is that the data are unreliable in the sense that there are confounding variables that need to be taken into consideration. To understand why, one needs to know a bit about cross-country racing. Obviously, the velocity of skiing depends on the skill of the skier, but many other factors will influence the speed in a particular race. Wind and temperature are two such factors; strong wind and low temperatures generally decrease the speed. Snow condition is an even more important factor, as the friction of the skies on the snow depends on the type of snow. Fresh snow and a low dew point will make the skies glide slowly on the snow surface; whereas, icy, granular snow will make the skies glide fast. 14
VOL. 21, NO. 4, 2008
How the skies interact with the snow condition is also affected by the ski-waxing. The classical style of skiing uses glide-wax on one part of the ski, which is always in contact with the snow, and grip-wax on another part of the ski, which is in contact with the snow only when the skier is kicking forward. In freestyle skiing (or skate), only a glide-wax is used, as the skies are constantly in contact with the snow. The skier must prepare the waxing before the race; she cannot change the skies during the race. Consequently, the race result will be contingent on whether the applied wax was optimal for the conditions at hand. Yet another factor that will influence the speed is the profile of the track (e.g., the total climbing meters and the sharpness of curves along the track). Most races take place in forests, and the organizers may need to modify the tracks from year to year depending on ongoing timber logging and the snow depth. Hence, the tracks for the same race might differ over time. The Swedish Ski Association and International Ski Federation (FIS) provide recommendations for the profile of a track, but the race organizers are fairly free to decide the details. Another influencing factor is the actual distance, as the velocity is computed as the ratio of the stipulated distance over racing time. The stipulated racing distances are, at best, indicative of the actual length of the track. Indeed, due to the complicating factors, cross-country ski organizers are not careful in setting a track that matches the stipulated distance. Even though it is hard to tell, we would not be surprised if a stipulated racing distance of 3 kilometers was revealed to be anything between 2.5 and 3.5 kilometers. The problem of unreliable race data due to confounding factors extends to other sports. The reader might, for instance, want to reflect upon potential confounders in mountain and road biking, rowing, and cross-country running. A distinction can be made between sports. Some sports, such as swimming and indoor track and field, have standardized race environments. Others, such as those mentioned in this article, are subject to more confounding factors and variability in conditions. For the former class of sports, the repeated race data are reliable because the race environment is standardized. Therefore, it is fairly simple to determine an adolescent’s progression. For the latter class of sports, it is more difficult to measure progression.
Examination of Young Skiers Selected from Two Races The race data for Julia shown in Figure 1 came from her participation in Lilla SS, which has been held annually in midFebruary since 1974 in Falun, and Morapinglan, which has been held annually around New Years Eve since the 1980s in Mora. Her race data come from 2003–2008 (i.e., six years of data). We have selected all cross-country skiers ages 15–18 who participated in one or both of the races in 2008. Thereafter, we have traced the skiers’ results back to 2003. Table 1 shows the number of skiers in the sample, divided by gender and age. There are several reasons why the two races—Lilla SS and Morapinglan—were selected. Lilla SS started the same year the World Championship in skiing (cross-country skiing, ski-jumping, and Nordic combined) was held in Falun. Lilla SS benefited from a fast technological development of time recording and electronic recording due to the staff members involved with the championship. From the beginning, the race
Table 2 — Median Velocity in Lilla SS 2004 (m/s) Age 10
11
12
13
14
15
16
17
18
Boys
2.87
3.28
3.51
3.68
3.85
5.25
5.33
5.40
5.40
Girls
2.79
3.21
3.27
3.53
3.68
4.44
4.68
4.80
4.80
Table 3— Deciles of Boys’ (18 Years Old) Velocities in Lilla SS 2004 and MP 2008 Age 10%
20%
30%
40%
50%
60%
70%
80%
90%
Lilla SS 2004
5.07
5.14
5.21
5.35
5.40
5.46
5.61
5.71
5.81
MP 2008
5.12
5.18
5.26
5.32
5.40
5.53
5.58
5.63
5.76
has attracted skiers ages 7–20, who have tried to return every year. The same is true for Morapinglan. Moreover, Mora and Falun are geographically close and belong to the same skiing district. Therefore, the likelihood of a skier participating in both races is high. The races also have in common the fact that skiers are racing individually in the sense that one skier at a time departs and her racing time is recorded. An important consequence is that the skier’s racing time can be regarded as independent, which renders the statistical modeling simpler. For the 203 skiers, we have 967 racing results, which means not every skier participated in all 12 occasions. We have six or more repeated measures for 35% of the skiers, and the average number of racing times per skier is 4.8. We do not believe this will bias the results because the main reasons for a skier to not participate are illness, the race falling on the same date as another important race, or the skier being committed to another sport having an event on the same date. Sports competitions for adolescents are usually organized in age classes, and this is true for skiing. Conventionally, participants born in the same year would compete in the same class. As a consequence, there might be an age difference of one year between the participants, and the data obtained from Lilla SS and Morapinglan makes it possible to deduce the year of birth for the skiers in the sample. Such a measure of age is too crude for a valid estimate of the progression curve. Luckily, we have had access to the skiing clubs’ member registers, from which we have retrieved the date of birth of all the skiers in the sample. We used this information jointly with the race dates to compute the age of the skiers, measured in days and converted into fractions of years.
Standardization of the Velocity Figure 1 showed that Julia skied her fastest race at the age of 12 (in Morapinglan) of almost 7 m/s. In Lilla SS, the same year, her velocity was 4 m/s. These two values could not possibly describe the difference in her performance that year, but rather the difference in the confounding factors of the two races. To render the velocities of the 12 race occasions comparable, we standardized the velocity by using Lilla SS in 2004 as the reference race. That occasion was selected as
reference because there is reason to believe the race distances are accurate and other conditions were fairly stable. Moreover, the number of participants in each gender and age class was sufficiently large. Table 2 shows the median velocity for each racing class by age and gender. The median is calculated based on all participants, not only the 203 skiers included in this study of progression. The skiers that are age 17 and 18 race in the same class at Lilla SS, which is the reason the medians are equal for these two age groups. The median velocity was calculated instead of the mean because it is fairly common to find outliers (e.g., breaking a ski pole during the race would make the velocity very low and aberrant). If we believed the confounders altered the distribution of velocity by shifting its center, it would be natural to add or subtract the races’ medians. However, we believe the variance of the velocity is also affected by the confounders. We therefore compute the standardized velocity by scaling the skier’s velocity at one race occasion by a factor being the ratio of the median velocity in Lilla SS 2004 for the relevant racing class and the corresponding median velocity at the race occasion. After standardizing the velocity, we believe the distribution should be about the same for the 12 race occasions. This belief is a consequence of seeing no reason why the ability of the skiers should differ between the races. One potential problem with scaling the velocity might be that the standardizing only works for the center of the distribution, but fails to render the extremes—such as the top velocities—comparable. To check this, we compared the deciles of the standardized velocities by age and gender. Table 3 provides one example of this checking procedure. It shows the deciles for 18-year-old boys at Lilla SS in 2004 and the deciles for boys at the same age in Morapinglan in 2008. We find the standardization has made the distributions similar for the 12 race occasions and different classes. Figure 2 shows box plots of the standardized velocities for each age. Most noteworthy is the substantial jump in velocity that appears to occur at about 14. For the boys, the jump is high; whereas, it seems the girls have a more prolonged and weaker progression around 14. CHANCE
15
6
6
●
5
5
●
●
● ● ●
● ● ●
4
Velocity (m/s)
● ●
4
Velocity (m/s)
●
●
●
● ●
●
● ●
3
3
●
● ●
●
●
10
2
2
●
11
12
13
14
15
16
17
18
10
11
12
13
14
15
16
17
18
Age Age Figure 2. The distribution of the 203 skiers’ standardized velocity by age and gender. Boys to the left and girls to the right.
where the parameters have a meaningful interpretation in this context. Parameters β 1 and β 2 are the lower and upper asymptotes for the response variable. The upper asymptote is β 2, which states the velocity the skier will eventually reach (provided she continues the training and racing). The lower asymptote β 1 refers to the lowest velocity of the skier (as a toddler, presumably) and can be taken as another example of the danger of extrapolating outside the range of data. The parameter β 3 is the age of the skier when she has reached halfway to her highest velocity—we think of it as identifying the age at puberty. The parameter β 4 expresses the rate of progression at puberty. To illustrate the role of β 3 and β 4 on the shape of the curve, we provide three examples in Figure 3. In the example we set β 1 and β 2 to equal three and six, which correspond roughly to the estimates for the boys’ progression. The four-parameter logistic model can be estimated by use of the freeware R, even though we extend the model to allow 16
VOL. 21, NO. 4, 2008
5.5 5.0 4.5
Velocity (m/s)
(1)
4.0
B 2 B1 1 exp[(B3 Age) / B 4 ]
3.5
Velocity B1
3.0
Figure 2 suggests the relationship between velocity and age is neither linear nor quadratic. Instead, we use the four-parameter logistic (FPL) model, which is suitable for modeling an S-shaped relationship. Details about this model can be seen in Mixed Effect Models in S and S-PLUS by Jose Pinherio and Douglas Bates. The nonlinear logistic growth function is defined as
6.0
The Four-Parameter Logistic Model
10
12
14
16
18
20
Age Figure 3. Examples of the four-parameter logistic curve. B 3 and B 4 are set to 14 and 1 (solid line), 14 and 5 (dashed line), and 15 and 1 (dotted line).
for random effects. As it is reasonable to believe the individual progression curve might differ from individual to individual in the sample, the random effects for the four parameters are required for estimating an individual curve for each skier.
Girls
4.0
B1
3.518 (0.032)
3.179 (0.044)
B2
5.327 (0.034)
4.871 (0.063)
B3
14.044 (0.041)
14.098 (0.087)
B4
0.219 (0.017)
0.723 (0.056)
4.5
Boys
3.5
Fixed effects
3.0
Velocity (m/s)
5.0
5.5
Table 4—Fixed Effects and Random Effects Estimates (Standard Errors in Parentheses)
10
12
14
16
18
Age Figure 4. Typical progression curves for boys (solid curve) and girls (dashed curve)
Random effects (S j; j = 1,2,3,4) Boys
Girls
B1i
0.199
0.186
B 2i
0.277
0.248
B3i
0.294
0.343
B 4i