Statistics in Psychology An Historical Perspective Second Edition
This page intentionally left blank
Statistics in ...
235 downloads
2763 Views
14MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Statistics in Psychology An Historical Perspective Second Edition
This page intentionally left blank
Statistics in Psychology An Historical Perspective SecondEdition
Michael Cowles York University, Toronto
2001
LAWRENCE ERLBAUM ASSOCIATES, PUBLISHERS London Mahwah,NewJersey
Copyright © 2001by Lawrence Erlbaum Associates, Inc. All rights reserved.No part of this bookmay bereproduced in any form, by photostat, microform, retrieval system, or any other means, without the prior written permissionof the publisher. Lawrence Erlbaum Associates, Inc., Publishers 10 IndustrialAvenue Mahwah,NJ 07430
Cover designby Kathryn Houghtaling Lacey
Library of Congress Cataloging-in-Publication Data Cowles, Michael, 1936-Statisticsin psychology: anhistorical perspective / Michael Cowles.-2nd ed. p. cm. Includes bibliographical references (p.) andindex. ISBN 0-8058-3509-1(c: alk. paper)-ISBN 0-8058-3510-5(p: alk. paper) 1. Psychology—Statisticalmethods—History. 2. Social sciences— Statisical methods—History.I. Title. BF39.C67 2000 150'.7'27~dc21
00-035369
The final camera copyfor this workwaspreparedby theauthor,and thereforethe publisher takesno responsibilityfor consistencyor correctnessof typographical style. to make publicationof this kind of scholarshippossible. However, this arrangement helps Books publishedby Lawrence Erlbaum Associates areprintedon acid-freepaper,andtheir bindingsarechosenfor strengthanddurability. Printedin the United Statesof America 1 0 9 8 7 6 5 4 3 21
Contents
Preface Acknowledgments 1
ix xi
The Development of Statistics
1
Evolution, Biometrics,and Eugenics The Definition of Statistics 6 Probability 7 The Normal Distribution 12 Biometrics 14 Statistical Criticism 18 2
Science,Psychology, and Statistics Determinism 21 Probabilisticand Deterministic Models Scienceand Induction 27 Inference 31 Statisticsin Psychology 33
3
1
Measurement In Respectof Measurement SomeFundamentals 39 Error in Measurement 44
21
26
36 36
Vi
CONTENTS
4
The Organization of Data
47
The Early Inventories 47 Political Arithmetic 48 Vital Statistics 50 Graphical Methods 53 5
Probability
56
The Early Beginnings 56 The Beginnings 57 The Meaning of Probability 60 Formal Probability Theory 66 6
Distributions The Binomial Distribution The Poisson Distribution The Normal Distribution
7
68 68 70 72
Practical Inference
77
Inverse Probability and the Foundationsof Inference FisherianInference 81 Bayesorp< 05? . 83 8
77
Sampling and Estimation
85
Randomnessand Random Numbers 85 Combining Observations 88 Samplingin Theory and in Practice 95 The Theory of Estimation 98 The Battle for Randomization 101 9
Sampling Distributions The Chi-SquareDistribution The t Distribution 114 The F Distribution 121 The Central Limit Theorem
105 705
124
CONTENTS
10 Comparisons, Correlations, and Predictions
VJJ
127
Comparing Measurements 727 Galton'sDiscovery of Regression 129 Galton' s Measureof Co-relation 138 The Coefficient of Correlation 141 Correlation - Controversies and Character 146 11 Factor Analysis Factors 154 The Beginnings 156 Rewriting the Beginnings The Practitioners 164
154
162
12 The Design of Experiments
171
The Problemof Control / 71 Methodsof Inquiry 1 73 The Concept of Statistical Control 176 The Linear Model 181 The Design of Experiments 752 13 Assessing Differencesand Having Confidence
186
FisherianStatistics 186 The Analysis of Variance 187 Multiple Comparison Procedures 795 ConfidenceIntervals and SignificanceTests 199 A Note on 'One-Tail' and Two-Tail' Tests 203 14 Treatments and Effects: The Rise of ANOVA The Beginnings 205 The Experimental Texts 205 The Journalsand thePapers 207 The Statistical Texts 272 Expected Means Squares 273
205
Viii
CONTENTS
15 The Statistical Hotpot Times of Change 216 Neyman andPearson 217 Statisticsand Invective 224 Fisher versus Neyman and Pearson Practical Statistics 233
216
228
References
236
Author Index
254
Subject Index
258
Preface
In this secondedition I have made some corrections to errors in algebraic expressions that I missedin the first edition and Ihavebriefly expandedon some sections of the original where I thought such expansion would make the narrative cleareror more useful. The main changeis the inclusion of two new chapters;one onfactor analysisand one on therise of the use ofANOVA in psychologicalresearch.I am still of the opinion that factor analysis deserves its own historical account,but I am persuaded that the audiencefor such a work would be limited werethe early mathematical contortions to be fully explored. I have triedto providea brief non-mathematical background to its arrival on the statistical scene. I realized thatmy accountof ANOVA in the first edition did not dojustice to the story of its adoptionby psychology,and largely due to myre-readingof the work of Sandy Lovie(of the University of Liverpool, England)and Pat Lovie (of Keele University, England), who always writepapersthat 1 wish 1 had written, decidedto try again.I hope thatthe Lovies will not be toodisappointed by my attempt to summarize their sterling contributions to the history of both factor analysisand ANOVA. As before, any errors and misinterpretationsare my responsibility alone.I would welcome correspondence that points to alternative views. I would like to give special thanksto the reviewersof the first edition for their kind commentsand all those who have helpedto bring aboutthe revival of this work. In particular Professor Niels Waller of Vanderbilt University must be acknowledgedfor his insistentand encouraging remarks.I hope thatI have ix
X
PREFACE
deserved them.My colleaguesand manyof my students at York University, Toronto, have been very supportive. Those students, both undergraduate and graduate,who have expressed their appreciation for my inclusion of some historical backgroundin my classeson statistics and method have givenme enormous satisfaction. This relatively short account is mainly for them, and I hopeit will encourage some of them to explore someof these mattersfurther. There seemsto be a slow realization among statistical consumers in psychology that there is moreto theenterprise than null hypothesis significance testing,and other controversies to exerciseus. It isstill my firm belief that just a little more mathematical sophistication and just a little more historical knowledge woulddo agreatdeal for the way wecarry on ourresearch business in psychology. The editors and production peopleat Lawrence Erlbaum, ever sharp and efficient, get onwith the job andbring their expertiseandsensible advice to the project and I very much appreciate their efforts. My wife has sacrificeda great dealof her time and given me considerable family, even yet,put help with the final stagesof this revisionand she and my up with it all. Mere thanksare not sufficient. Michael Cowles
Acknowledgments
I wish to expressmy appreciationto a numberof individuals, institutions,and publishers for granting permissionto reproduce material that appearsin this book: Excerpts from Fisher Box, J. (c) 1978,R. A. Fisher the life of a scientist,from Scheffe, H. (1959) Theanalysisof variance,and from Lovie, A. D & Lovie, P. Charles Spearman, Cyril Burt, and theoriginsof factor analysis. Journalof the History of the Behavioral Sciences,29, 308-321.Reprintedby permissionof John Wiley& Sons Inc.,the copyright holders. Excerpts from a numberof papersby R.A. Fisher. Reprinted by permissionof Professor J.H. Bennett on behalf of the copyright holders,the University of Adelaide. Excerpts, figures andtables by permissionof Hafner Press,a division of Macmillan Publishing Companyfrom Statistical methods for research workers by R.A. Fisher. Copyright(c) 1970 by Universityof Adelaide. Excerpts from volumes of Biometrika. Reprintedby permission of the Biometrika Trusteesand from Oxford University Press. Excerpts from MacKenzie, D.A. (1981). Statisticsin Britain 1865-1930. Reprintedby permissionof the Edinburgh University Press.
XI
Xii
ACKNOWLEDGMENTS
Two plates from Galton, F. (1885a). Regression towards mediocrity in hereditarystature.Journal of theAnthropological Instituteof Great Britain and Ireland, 15, 246-263.Reprintedby permissionof the Royal Anthropological Institute of GreatBritain andIreland. ProfessorW. H. Kruskal, Professor F. Mosteller,and theInternational Statistical Institute for permissionto reprint a quotationfrom Kruskal,W., & Mosteller, F. (1979). Representative sampling, IV: The historyof the conceptin statistics, 1895-1939.International StatisticalReview,47, 169-195. Excerptsfrom Hogben,L. (1957). Statistical theory. London: Allen andUnwin; Russell,B. (1931). Thescientific outlook. London:Allen and Unwin; Russell, B. (1946). Historyof western philosophyand itsconnection withpolitical and social circumstancesfrom theearliest timesto thepresentday. London: Allen and Unwin; von Mises, R. (1957). Probability, statisticsand truth. (Second revised English Edition preparedby Hilda Geiringer) London: Allenand Unwin. Reprintedby permissionof Unwin Hyman,the copyright holders. Excerpts from Clark, R. W. (1971). Einstein,the life and times. New York: World. Reprintedby permissionof Peters, Fraser and Dunlop, Literary Agents. Dr D. C. Yalden-Thomsonfor permissionto reprint a passagefrom Hume,D. (1748). An enquiry concerning human understanding. (In D. C. YaldenThomson (Ed.). (1951). Hume, Theory of Knowledge. Edinburgh: Thomas Nelson). Excerpts from various volumesof the Journal of the American Statistical by permissionof the Boardof Directorsof theAmerican Association. Reprinted Statistical Association. An excerpt reprintedfrom Probability, statistics,and data analysisby O. Kempthorneand L. Folks, (c) 1971 Iowa StateUniversity Press, Ames, Iowa 50010. Excerptsfrom De Moivre, A. (1756).Thedoctrine of chances:or, A methodof calculating the probabilities of eventsin play. (3rd ed.),London:A. Millar. Reprintedfrom the editionpublishedby theChelseaPublishingCo.,New York, (c) 1967with permissionand Kolmogorov,A. N. (1956). Foundationsof the theory of probability. (N. Morrison, Trans.). Reprinted by permissionof the Chelsea Publishing Co.
ACKNOWLEDGMENTS
Xiii
An excerptreprinted with permission of Macmillan Publishing Companyfrom An introduction to the study of experimental medicineby C. Bernard (H. C. Greene,Trans.),(c) 1927 (Original work published in 1865)and from Science and human behaviorby B. F. Skinner (c) 1953 by Macmillan Publishing Company, renewed 1981 by B. F. Skinner. Excerpts from Galton, F. (1908). Memoriesof my life. Reprintedby permission of Methuenand Co. An excerptfrom Chang, W-C. (1976). Sampling theories andsamplingpractice. In D. B. Owen (Ed.), On thehistory of statisticsandprobability (pp.299~315). Reprintedby permissionof Marcel Dekker, Inc.New York. An excerpt from Jacobs,J. (1885). Reviewof Ebbinghaus's Ueberdas Gedachtnis. Mind, 10, 454-459 and from Hacking, I. (1971). Jacques Bernoulli's Art of Conjecturing. British Journalfor the Philosophyof Science, 22,209-229.Reprintedby permissionof Oxford University Pressand Professor Ian Hacking. Excerpts from Lovie, A. D. (1979). The analysisof variancein experimental psychology: 1934-1945. British Journal of Mathematical and Statistical Psychology,32, 151-178and Yule, G. U. (1921). Reviewof W. Brown and G. H. Thomson, The essentialsof mental measurement. British Journal of Psychology,2, 100-107and Thomson,G. H. (1939)The factorial analysisof humanability. I. The present position and theproblems confronting us. British Journal of Psychology,30, 105-108. Reprintedby permissionof the British Psychological Society. Excerptsfrom Edgeworth,F. Y. (1887). Observations and statistics:An essay on the theory of errors of observationand the firstprinciples of statistics. Transactions of the Cambridge Philosophical Society,14, 138—169 and Neyman,J., & Pearson,E. S.(1933b). The testingof statistical hypotheses in relation to probabilitiesa priori. Proceedingsof the Cambridge Philosophical Society, 29, 492-510.Reprintedby permissionof the Cambridge University Press. Excerpts from Cochrane,W. G. (1980). Fisherand theanalysisof variance.In S. E. Fienberg,& D. V. Hinckley (Eds.).,R. A. Fisher: An Appreciation (pp. 17-34)and from Reid,C. (1982). Neyman -fromlife. Reprintedby permission of Springer-Verlag,New York.
XJV
ACKNOWLEDGMENTS
Excerpts and plates from Philosophical Transactionsof the Royal Societyand Proceedingsof the Royal Societyof London. Reprintedby permission of the Royal Society. Excerpts from LaplaceP. S. de(1820).A philosophical essayon probabilities. F. W. Truscott, & F. L. Emory, Trans.). Reprinted by permissionof Dover Publications,New York. Excerpts from Galton, F. (1889). Natural inheritance, Thomson, W. (Lord Kelvin). (1891). Popular lecturesand addresses,and Todhunter,I. (1865). A history of the mathematical theoryof probability from thetime of Pascal to that of Laplace. Reprintedby permissionof Macmillan and Co., London. Excerpts from various volumesof the Journal of the Royal Statistical Society, reprinted by permissionof the Royal Statistical Society. An excerptfrom Boring, E. G. (1957). Whenis humanbehavior predetermined? The Scientific Monthly, 84, 189-196.Reprintedby permissionof the American Association for the Advancementof Science. Data obtainedfrom Rutherford, E., & Geiger, H. (1910). The probability variations in the distribution of a particles. Philosophical Magazine,20, 698-707. Material used by permissionof Taylor and Francis, Publishers, London. Excerptsfrom various volumesof Nature reprintedby permissionof Macmillan Magazines Ltd. An excerpt from Forrest,D. W. (1974). Francis Galton:Thelife and work of a Victorian genius.London: Elek. Reprinted by permissionof Grafton Books, a division of the Collins PublishingGroup. Excerpts from Fisher,R. A. (1935, 1966,8th ed.). The design of experiments. Edinburgh: OliverandBoyd. Reprintedby permissionof the Longman Group, UK, Limited. Excerptsfrom Pearson,E. S. (1966). The Neyman-Pearson story:1926-34. Historical sidelights on an episode in Anglo-Polish collaboration.In F. N. David (Ed.). Festschrift for J. Neyman. London:Wiley. Reprinted by permissionof JohnWiley and Sons Ltd., Chichester.
ACKNOWLEDGMENTS
XV
An excerpt from Beloff, J. (1993) Parapsychology:a concise history. London: Athlone Press.Reprinted with permission. Excerpts from Thomson,G. H. (1946) Thefactorial analysisof human ability. by permissionof The Athlone London: Universityof London Press. Reprinted Press. An excerpt from Spearman,C. (1927) The abilities of man. New York: Macmillan Company. Reprintedby permissionof Simon & Schuster, the copyright holders. An excerpt from Kelley, T. L. (1928) Crossroadsin the mind of man: a study of differentiable mental abilities. Reprinted with the permissionof Stanford University Press. An excerpt from Gould, S. J. (1981) The mismeasureof man. Reprinted with the permissionof W. W. Norton & Company. An excerptfrom Boring, E. G. (1957) Whenis human behavior predetermined? Scientific Monthly, 84,189-196.And from Wilson,E. B. (1929) Reviewof The abilities of man. Science,67, 244-248.Reprintedwith permission from the American Associationfor the Advancementof Science. An excerpt from Sidman, M. (1960/1988) Tacticsof scientific research: inpsychology.New York: Basic Books. Boston: Evaluating experimental data Authors Cooperative (reprinted). Reprinted with the permissionof Dr Sidman. An excerpt from Carroll, J. B. (1953) An analytical solutionfor approximating 18, 23-38. Reprinted with simple structurein factor analysis. Psychometrika, the permissionof Psychometrika. Excerptsfrom Garrett H. E. & Zubin, J. (1943) The analysis of variancein psychological research.Psychological Bulletin, 40, 233-267and from Grant, D. A. (1944) On "The analysisof variance in psychological research." Psychological Bulletin, 41, 158-166. Reprintedwith the permission of the American Psychological Association. An excerpt from Wilson, E. B. (1929) Reviewof Crossroadsin the mind of man. Journal of General Psychology, 2, 153-169. Reprinted with the permissionof the Helen Dwight Reid Educational Foundation. Published by Heldref Publications, 1319 18th St. N.W. Washington, D.C.20036-1802.
This page intentionally left blank
1
The Development of Statistics
EVOLUTION, BIOMETRICS, AND EUGENICS The central concernof the life sciencesis thestudy of variation. To what extent does this individualor group of individualsdiffer from another? Whatare the reasonsfor the variability? Can thevariability be controlled or manipulated? Do the similarities that exist spring from a common root? Whatare theeffects of the variation on the life of the organisms?Theseare questionsaskedby biologists and psychologists alike. The life-science disciplines are definedby the different emphases placed on observed variation,by the natureof the particular variablesof interest,and by the ways in which the different variables contributeto the life and behaviorof the subject matter. Change and diversity in nature reston an organizing principle,the formulationof which hasbeen saidto be thesingle mostinfluential scientific achievementof the 19th century:the theory of evolutionby meansof natural selection.The explicationof the theory is attributed, rightly,to Charles Darwin (1809-1882).His book The Origin of Specieswas published in 1859, but a numberof other scientistshadwritten on theprinciple, in whole or in part, and thesemenwereacknowledgedby Darwin in later editionsof his work. Natural selectionis possible because there is variation in living matter. The struggle for survival within and acrossspeciesthen ruthlessly favorsthe individuals that possessa combination of traits and characters, behavioral and physical,that allows themto copewith the total environment, exist, survive, and reproduce. Not all sourcesof variability are biological. Many organisms to agreateror 1
2
1. THE DEVELOPMENT OF STATISTICS
lesser extent reshape their environment, their experience, and therefore their reshapingof the environment behavior through learning.In human beings this has reachedits most sophisticatedform in what hascome to becalled cultural evolution. A fundamental feature of the human condition,of human nature,is our ability to processa very great deal of information. Human beings have originality and creative powers that continually expand the boundariesof knowledge. And, perhaps most important of all, our language skills, verbal and written, allow for the accumulationof knowledge and itstransmission from generationto generation. The rich diversity of human civilization stemsfrom cultural, aswell as genetic, diversity. Curiosity about diversityand variability leadsto attemptsto classify and to measure.The orderingof diversity and theassessment of variation have spurred the developmentof measurementin the biological and social sciences,and the applicationof statisticsis onestrategyfor handlingthe numericaldataobtained. As science has progressed,it has become increasingly concerned with quantification as a means of describing events.It is felt that precise and economical descriptions of eventsand therelationships among them are best achievedby measurement. Measurement is the link between mathematics and science,and theapparent(at anyrate to mathematicians!) clarityand order of mathematicsfoster the scientist's urgeto measure.The central importanceof measurementwasvigorously expoundedby Francis Galton(1822-1911):"Until the phenomenaof any branchof knowledge have been submitted to measurementand numberit cannot assume the statusand dignity of a Science." Thesewords formed partof the letterhead of the Departmentof Applied Statistics of University College, London,an institution that received much intellectual and financial supportfrom Galton. And it is with Galton,who first formulated the method of correlation, thatthe common statisticalprocedures of modern social science began. The nature of variation and thenature of inheritancein organisms were much-discussedand much-confused topicsin the second halfof the 19th century. Galtonwasconcernedto makethe studyof heredity mathematical and to bring order into the chaos. Francis Galton was Charles Darwin's cousin. Galton's mother was the daughterof Erasmus Darwin (1731-1802) by his secondwife, and Darwin's father wasErasmus'sson by hisfirst. Darwin, who was 13yearsGalton'ssenior, had returned homefrom a 5-year voyageas thenaturaliston board H.M.S. Beagle(an Admiralty expeditionary ship)in 1836 and by 1838 had conceived of the principle of natural selectionto accountfor someof the observationshe hadmadeon theexpedition.The careersandpersonalitiesof GaltonandDarwin were quite different. Darwin painstakingly marshaled evidence and singlemindedly buttressedhis theory, but remaineddiffident about it, apparently
EVOLUTION, BIOMETRICS AND EUGENICS
3
uncertain of its acceptance.In fact, it was only the inevitability of the announcementof the independent discovery of the principle by Alfred Russell to publish, some20 yearsafter he had Wallace (1823~1913) that forced Darwin formed the idea. Gallon,on theother hand, though a staid andformal Victorian, was notwithout vanity, enjoyingthe fame and recognition broughtto him by his many publicationson abewildering varietyof topics. The steady streamof lectures, papersand books continued unabated from 1850 until shortly before his death. The notion of correlated variationwas discussedby the new biologists. Darwin observesin The Origin of Species: Many laws regulate variation, some few of which can bedimly seen,...I will changes here only alludeto what may becalled correlated variation. Important in the embryo or larva will probably entail changes in the mature animal. . . Breeders believe that long limbs arealmost always accompanied by an elongated head .. .catswhich areentirely whiteandhave blue eyesaregenerally deaf... it appears that white sheep and pigs are injured by certain plants whilst ... (Darwin, 1859/1958,p. 34) dark-coloured individuals escape
Of course,at this time,the hereditary mechanism was unknown, and, partly in an attempt to elucidate it, Galton began,in the mid-1870s,to breed sweet peas.1 The resultsof his studyof thesizeof sweetpeaseeds overtwo generations were publishedin 1877. Whena fixed size of parentseedwas compared with the mean sizeof theoffspring seeds, Galton observed thetendency thathecalled then reversionand later regressionto the mean. The meanoffspring size is not as extremeas theparental size. Large parent seeds of a particular size produce seeds that have a mean size thatis larger than average, but not aslarge as the parentsize. The offspring of small parentseedsof a fixedsize havea mean size that is smaller than average but nowthis mean sizeis not assmall asthat of the fixed parent size. This phenomenon is discussed laterin more detail.For the moment,suffice it to say that it is an arithmetical artifact arisingfrom the fact that offspring sizesdo not match parental sizes absolutely uniformly. In other words, the correlation is imperfect. Galton misinterpreted this statistical phenomenon as areal trend towarda reductionin population variability. Paradoxically, however, it led to theformation of the Biometric Schoolof heredity and thus encouraged the development of a great many statistical methods. 1
Mendel had already carriedout his work with ediblepeasand thus begunthe scienceof genetics. The resultsof his work were publishedin a rather obscure journal in 1866 and thewider scientific world remained obliviousof them until 1900.
4
1. THE DEVELOPMENT OF STATISTICS
Over the next several years Galton collected data on inherited human characteristicsby the simple expedientof offering cash prizes for family records. From these data hearrivedat theregression linesfor hereditary stature. Figures showing these lines are shownin Chapter10. A common themein Galton's work, and later that of Karl Pearson (1857-1936),was aparticular social philosophy. Ronald Fisher (1890-1962) also subscribedto it, although, it must be admitted, it wasnot, assuch,a direct influence on hiswork. Thesethreemen are thefoundersof what are nowcalled classical statisticsand allwere eugenists. They believed that the most relevant and important variablesin humanaffairs are inherited. One'sancestorsrather than one'senvironmental experiences are theoverriding determinants of intellectual capacityand personalityas well as physical attributes. Human wellbeing, human personality, indeed human society, could therefore, they argued, be improvedby encouragingthe most ableto have more children than the least able. MacKenzie (1981)andCowan(1972,1977)have argued that much of the early work in statisticsand thecontroversies that arose among biologists and statisticians reflectthe commitmentof the foundersof biometry, Pearson being the leader,to theeugenics movement. In 1884, Galtonfinanced andoperatedan anthropometric laboratory at the International Health Exhibition. For a chargeof threepence, members of the public were measured. Visual and auditory acuity, weight, height, limb span, strength,and anumberof other variables were recorded. Over 9,000 data sets at thecloseof theexhibition,the equipmentwastransferred were obtained, and, to the South Kensington Museum where data collection continued. Francis Galton was anavid measurer. Karl Pearson(1930) relates that Galton's first forays intothe problem of correlation involved ranking techniques, although he wasaware that ranking methods couldbe cumbersome.How could onecomparedifferent measuresof anthropometric variables?In a flash of illumination, Galton realized that characteristics measured on scales basedon their own variability (we would now say standard score units) could be directly compared. This inspiration is certainly one of themost importantin the early yearsof statistics.He recalls the occasionin Memoriesof my Life, publishedin 1908: As these linesare being written,the circumstances under which I first clearly graspedthe important generalisation that the laws of hereditywere solely in statistical unitsare vividly recalled to concerned with deviations expressed my memory. It was in thegroundsof Naworth Castle, where an invitation had been givento ramblefreely. A temporary shower drove me toseekrefuge in a reddish recessin the rock by the side of the pathway. Therethe idea flashed
EVOLUTION, BIOMETRICS AND EUGENICS
5
acrossme and Iforgot everythingelsefor amomentin mygreatdelight.(Galton, 1908, p. 300)2 This incident apparently took place in 1888,and beforethe year wasout, Co-relationsand Their MeasurementChiefly From Anthropometric Datahad been presented to the Royal Society.In this paper Galton defines co-relation: "Two variable organsare said to be co-relatedwhen the variation of one is accompaniedon theaverageby more or less variationof the other, and in the samedirection" (Gallon, 1888,p. 135). The last five words of the quotation indicate thatthe notion of negative correlationhad notthen been conceived, butthis briefbut important paper shows that Galtonfully understoodthe importanceof his statistical approach. Shortly the picture with encouragement from some, thereafter, mathematicians entered but by no meansall, biologists. Much of the basic mathematics of correlation had,in fact, already been developedby the time of Gallon's paper,but theutility of the procedure itself in this contexlhad apparently eluded everyone. It was Karl Pearson, Gallon's disciple andbiographer,who, in 1896,set theconcepton asound mathematical foundation and presented statistics with the solutionto the problem of representing covariationby meansof a numerical index,the coefficient of correlation. From thesebeginnings springthe whole corpusof present-daystatistical techniques. George Udny Yule (1871~1951),aninfluential statisticianwho was not a eugenist,and Pearson himself elaborated the conceptsof multiple and partial correlation.The general psychologyof individual differencesand researchinto the structureof human abilitiesand intelligence relied heavilyon correlationaltechniques. The firsl third of the20th centurysaw theintroduction of factor analysis throughthe work of Charles Spearman (1863-1945),Sir Godfrey Thomson (1881-1955),Sir Cyril Burt (1883-1971),and Louis L. Thurstone(1887-1955). A further prolific and fundamentallyimportant streamof development arises from the work of Sir Ronald Fisher.The techniqueof analysisof variancewas developed directlyfrom themethodof intra-class correlation- anindex of the extentto which measurements in thesame category or family arerelated, relative to other categoriesor families. 2
Karl Pearson (1914 -1930)in thevolume published in 1924, suggested that this spot deserves a commemorative plaque. Unfortunately, it looks asthoughthe inspiration canneverbe somarked, for Kenna (1973), investigating the episode, reports that: "In the groundsof Naworth Castle there arenot anyrocks,reddishor otherwise, which could provide arecess,..."(p. 229),and hesuggests that the location of the incident might have been Corby Castle.
6
1. THE DEVELOPMENT OF STATISTICS
Fisher studied mathematicsat Cambridgebut also pursued interests in biology andgenetics.In 1913he spentthe summer workingon afarm in Canada. He worked for a while with a City investment company andthen found himself declaredunfit for military service because of his extremely poor eyesight.He turned to school-teachingfor which he had notalent and which he hated. In 1919 he had theopportunity of a postat University Collegewith Karl Pearson, then headof the Departmentof Applied Statistics,but choseinsteadto develop a statistical laboratoryat theRothamsted Experimental Station near Harpenden in England, wherehedeveloped experimental methods for agricultural research. Over the next several years, relations between Pearson and Fisher became increasingly strained. They clashed on a variety of issues. Someof their and some hindered,the developmentof statistics.Had disagreements helped, they been collaboratorsand friends, rather than adversaries and enemies, statisticsmight havehad aquite different history. In 1933 Fisher became Galton Professorof Eugenicsat University Collegeand in 1943 movedto Cambridge, wherehe wasProfessorof Genetics. Analysisof variance, whichhas hadsuch far-reachingeffects on experimentationin the behavioral sciences, was developed through attempts to tackle problems posed at Rothamsted. It may befairly said thatthe majority of textson methodologyand statistics in the social sciences are theoffspring (diversityandselection notwithstanding!) of Fisher's books, Statistical Methodsfor ResearchWorkers3 first publishedin 1925(a), and TheDesignof Experimentsfirst publishedin 1935(a). In succeeding chapters thesestatistical conceptsareexaminedin more detail and their development elaborated, but first the use of theterm statisticsis explored a little further. THE DEFINITION OF STATISTICS In an everyday sense when we think of statisticswe think of facts and figures, of numerical descriptionsof political andeconomic states (from which the word is derived), and ofinventories of the variousaspectsof our social organization. The history of statistical proceduresin this sensegoesback to the beginnings of human civilization. When trade and commerce began, when governments imposed taxes,numerical records were kept.The counting of people, goods, andchattelswasregularlycarriedout in theRoman Empire,theDomesday Book attempted to describethe state of England for the Norman conquerors,and government agenciesthe world over expenda great deal of money and 3
Maurice Kendall (1963) says of this work, "It is not aneasy book. Somebody once said to read it unlesshe hadread it before" (p. 2). that no student should attempt
PROBABILITY
7
energyin collectingand tabulating suchinformation in the present day. Statistics are usedto describeand summarize,in numerical terms,a wide varietyof situations. But thereis anothermore recently-developed activity subsumed under the term statistics:the practiceof not only collectingandcollating numerical facts, but also the processof reasoningabout them. Going beyond the data,making inferencesand drawing conclusions with greateror lesserdegreesof certainty in an orderly and consistentfashion is the aim ofmodern applied statistics.In this sense statistical reasoning did not begin until fairly late in the 17th century and then onlyin a quite limited way. The sophisticated models now employed, are often complex,are allless than100 backedby theoretical formulations that years old. Westergaard (1932) points to the confusionsthat sometimes arise becausethe word statisticsis usedto signify both collectionsof measurements and reasoning about them, and that in former times it referred merelyto descriptionsof statesin both numericaland non-numerical terms. In adoptingthe statisticalinferential strategythe experimentalistin the life sciencesis acceptingthe intrinsic variabilityof the subject matter.In recognizing a rangeof possibilities,the scientist comesfour-squareagainstthe problem of deciding whetheror not theparticular set of observationshe or she has collected can reasonablybe expectedto reflect the characteristicsof the total range. This is the problem of parameter estimation, the task of estimating from a considerationof the measurements made population values (parameters) on a particular population subset - the sample statistics.A second taskfor inferential statisticsis hypothesis testing, the processof judging whetheror not a particular statistical outcome is likely or unlikely to be due tochance. The statistical inferential strategy depends on aknowledgeof probabilities. This aspectof statisticshasgrown out of three activities that,at first glance, appearto bequite different but in fact have somecloselinks. They areactuarial prediction, gambling,and error assessment. Each addresses the problemsof making decisions,evaluating outcomes, and testing predictionsin the face of uncertainty,and eachhascontributedto thedevelopmentof probability theory. PROBABILITY Statistical operations areoften thoughtof aspractical applications of previously developed probability theory. The fact is, however, that almost all our presentday statistical techniques have arisen from attemptsto answerreal-life problems of prediction and error assessment, and theoretical developments have not always paralleled technical accomplishments. Box (1984) has reviewed the scientific context of a rangeof statistical advances and shown thatthe fundamental methodsevolved from the work of practisingscientists.
8
1. THE DEVELOPMENT OF STATISTICS
John Graunt, a London haberdasher, born in 1620, is credited withthe first attemptto predict andexplain a numberof social phenomena from a consideration of actuarialtables.He compiled histablesfrom Bills of Mortality, the parish accountsof deaths that were regularly, if somewhat crudely, recorded from the beginningof the 17th century. Graunt recognizesthat the question mightbe asked:"To what purpose tends all this laborious buzzling,andgroping?To know, 1. thenumberof the People? 2. How many Males,andFemales?3. Howmany Married,andsingle?"(Graunt, 1662/1975,p. 77), and says:"To this 1might answerin general by saying, that the reasonof these Enquiries, areunfit to trouble those, who cannot apprehend themselvesto askthem." (p. 77). Graunt reassured readers of this quite remarkable work: The Lunaticksarealso but few, viz. \ 58 in229250though I fear manymorethan are set down in ourBills ... So that, this Casualty being so uncertain,1 shall not force my self to make any inference fromthe numbers, and proportions we finde in ourBills concerning it: onely I dareensureany man atthis present,well in his Wits, for one in thethousand, that he shall not die aLunatick in Bedlam,within thesesevenyears,becauseI finde not aboveone inabout one thousandfive hundredhavedone so. (pp. 35-36)
Here is an inference basedon numerical dataand couchedin terms not so very far removedfrom thosein reportsin the modern literature. Graunt's work wasimmediately recognizedasbeingof great importance,and theKing himself (CharlesII) supportedhis electionto the recently incorporated Royal Society. A few years earlierthe seedsof modern probability theory were being sown in France.4 At this time gamblingwas apopular habitin fashionable society and a range of gamesof chancewas being played. For experienced players the oddsapplicableto various situations must have been appreciated, but no formal methods for calculating the chances of various outcomeswere available. Antoine Gombauld,the Chevalierde Mere, a "man-about-town" and gambler with a scientific and mathematical turnof mind, consultedhis friend, Blaise Pascal (1623-1662),a philosopher, scientist,and mathematician, hoping that
4 But note that thereare hints of probability conceptsin mathematics going back at leastas far as the12th centuryand that Girolamo Cardano wrote Liber de Ludo Aleae,(The Book on Games of Chance)a century beforeit was publishedin 1663 (see Ore, 1953). There is also no doubt that quite early in human civilization, therewas anappreciationof long-run relative frequencies, randomness,anddegreesof likelihood in gaming,andsome quiteformal conceptsare to befound in GreekandRoman writings.
PROBABILITY
9
he would be able to resolve questionson calculation of expected(probable) frequency of gains and losses,as well as on thefair division of the stakesin games that were interrupted. Consideration of these questions led to correspondence between Pascal and hisfellow mathematician Pierre Fermat (1601 -1665). 5 No doubt their advice aided de Mere's game. More significantly, it was from this exchange that some of the foundationsof probability theoryand combinatorial algebra were laid. Christian Huygens(1629-1695)published,in 1657, a tract On Reasoning With Gamesof Dice (1657/1970), which waspartly basedon thePascal-Fermat correspondence, and in 1713, Jacques Bernoulli's (1654-1705)book The Art of Conjecture developeda theory of gamesof chance. Pascalhad connectedthe studyof probabilitywith the arithmetic triangle (Fig. 1.1), for which he discoverednew properties, although the triangle was known in China at least five hundred years earlier. Proofs of the triangle's properties were obtained by mathematical induction or reasoningbyrecurrence.
FIG. 1.1
5
Pascal's Triangle
Poisson(1781-1840),writing of this episodein 1837 says,"A problem concerning games of chance, proposed by a man of theworld to an austere Jansenist, was theorigin of the calculus of probabilities" (quotedby Struik, 1954,p. 145). De Mere was certainly "a man of theworld" and Pascaldid become austere andreligious, but at thetime of de Mere'squestionsPascalwas in his so-called "worldlyperiod"(1652-1654).I am indebtedto my father-in-law, the late Professor F.T.H. Fletcher, for many insights intothe life of Pascal.
10
1. THE DEVELOPMENT OF STATISTICS
Pascal'striangle, as it isknown in the West,is a tabulationof the binomial coefficients that may beobtainedfrom the expansionof (P + Q)n where P = Q = I The expansionwas developedby Sir Isaac Newton(1642-1727),and, independently, by the Scottish mathematician, James Gregory (1638-1675). Gregory discoveredthe rule about 1670. Newton communicated it to theRoyal Society in 1676, although later that year heexplained thathe had firstformulated it in 1664 whilehe was aCambridge undergraduate. The example shownin Fig. 1.1 demonstrates that the expansionof (~ + | )4 generates,in the numerators of the expression,the numbersin the fifth row of Pascal'striangle. The terms of this expression also give us the fiveexpected frequencies of outcome(0, 1, 2,3, or 4heads)or improbabilitieswhena fair coin is tossedfour times. Simple experiment will demonstrate that the actual outcomesin the"real world"of coin tossing closely approximate the distribution thathas been calculatedfrom a mathematicalabstraction. During the 18th centurythe theory of probability attractedthe interest of many brilliant minds. Among themwas a friend and admirer of Newton, Abraham De Moivre (1667-1754).De Moivre, a French Huguenot,was internedin 1685after the revocationby LouisXIV of the Edict of Nantes,an edict which hadguaranteed tolerationto French Protestants. He wasreleasedin 1688, fled to England, and spent the remainderof his life in London. De Moivre TheDoctrine published what mightbedescribedas agambler's manual, entitled of Chancesor a Method of Calculating the Probabilities of Eventsin Play. In the second editionof this work,publishedin 1738,and in arevised third edition published posthumouslyin 1756, De Moivre (1756/1967) demonstrated a method, whichhe had firstdevisedin 1733,of approximatingthe sum of avery n in (P +Q)" is very large(animmensely large numberof binomial terms when laborious computationfrom the basic expansion). It may beappreciated thatas n grows larger,the numberof terms in the expansionalsogrows larger. The graph of the distribution beginsto resemble a smooth curve (Fig. 1.2), a bell-shaped symmetrical distributionthat held great interest in mathematical terms but little practical utility outsideof gaming. It is safeto saythat no other theoretical mathematical abstraction has had such an important influence on psychology and the social sciencesas that bell-shaped curve now commonly knownby thename that Karl Pearson decided on- thenormal distribution-a\thougl\he was not the first to use the term. Pierre Laplace(1749-1827)independently derived the function andbrought together much of the earlier workon probabilityin Theorie Analytiquedes Probabilites, publishedin 1812.It was hiswork, aswell ascontributionsby many others, that interpretedthe curve as the Lawof Error andshowed thatit could be applied to variableresults obtainedin multiple observations.One of the firstapplications of the distribution outsideof gaming was in the assessmentof errors in
FIG. 1.2
The Binomial Distribution for N = 12 and the Normal Distribution
11
12
1. THE DEVELOPMENT OF STATISTICS
astronomicalobservations.Later theutility of the "law" in errorassessment was extendedto land surveyingand evento range estimation problems in artillery fire. Indeed, between 1800 and 1820 the foundationsof the theory of error distribution were laid. (1777-1855),perhapsthe greatest mathematician of all Carl Friedrich Gauss time, also made important contributions to work in this area. He was a consultantto thegovernmentsof Hanoverand ofDenmark when they undertook geodeticsurveys. The function that helpedto rationalize the combination of observationsis sometimes calledthe Laplace-Gaussian distribution. Following the work of LaplaceandGauss,the developmentof mathematical probability theory slowed somewhat and not agreat dealof progresswas made until the present century.But it was during the 19th century, throughthe developmentof life insurance companies and throughthe growth of statistical approachesin the social and biological sciences, thatthe applications of probability theory burgeoned. Augustus De Morgan (1806-1871),for example, attempted to reduce the constructsof probability to straightforward rulesof thumb. His work An Essayon Probabilities and on Their Application to Life Contingenciesand Insurance Offices, publishedin 1838, is full of practical advice and iscommentedon by Walker (1929). THE NORMAL DISTRIBUTION The normal distributionwas sonamed because many biological variables when measuredin large groupsof individuals,andplotted asfrequency distributions, do show close approximations to the curve. It is partly for this reason thatthe mathematicsof thedistribution areusedin data assessment in the social sciences and inbiology. The responsibility,aswell as thecredit, for this extensionof the use ofcalculations designed to estimate erroror gambling expectancies into the examinationof human characteristics rests with Lambert Adolphe Quetelet (1796-1874),a Belgian astronomer. In 1835Queteletdescribedhisconceptof theaverageman- / 'homme moyen. L'homme moyenis Nature's ideal,an ideal that corresponds with a middle, measured value.But Nature makeserrors,and in, as itwere,missingthe target, producesthe variability observedin human traitsandphysicalcharacters.More importantly, the extent and frequency of these errorsoften conformto the law of frequencyof error-thenormal distribution. JohnVenn(1834-1933),the English logician, objected to the use of the word uswith a group of objectsof every error in this context:"When Nature presents kind, it is using rathera bold metaphorto speak in this case alsoof a law of error" (Venn, 1888,p. 42), but theanalogy wasattractive to some. Quetelet examined the distributionof the measurements of the chest girths
THE NORMAL DISTRIBUTION
13
of 5,738 Scottish soldiers, these data having been extracted from the 13th volumeof theEdinburgh MedicalJournal. There is nodoubt thatthe measurements closely approximate to a normal curve.In another attemptto exemplify the law, Quetelet examined the heightsof 100,000 French conscripts. Here he noticed a discrepancy between observed and predicted values: The official documents would make it appearthat, of the 100,000men, 28,620 are of less height than5 feet 2 inches: calculation gives only 26,345. Is it not a fair presumption, thatthe 2,275 men who constitutethe difference of these numbershave beenfraudulently rejected? We canreadily understand that it is an easy matterto reduceone'sheighta half-inch,or an inch, whenso greatan interest is at stakeasthat of being rejected. (Quetelet, 1835/1849, p. 98)
Whether or not theallegation stated here - that short (butnot too short) Frenchmen havestoopedso low as toavoid military service- is true is no longer an issue. A more important pointis notedby Boring (1920): While admittingthe dependenceof the law onexperience, Quetelet proceeds in numerouscasesto analyze experience by meansof it. Such a double-edged sword is a peculiarly effective weapon, and it is nowonder that subsequent to use it inspite of the necessary rules of scientific investigators were tempted warfare. (Boring, 1920,p. 11)
The use of thenormal curvein statisticsis not, however, based solely on the fact that it can beusedto describethe frequencydistributionof many observed characteristics.It has a much morefundamental significancein inferential statistics,as will be seen,and thedistribution and itsproperties appear in many partsof this book. Galton first became awareof the distribution from his friend William Spottiswoode,who in 1862 became Secretary of the Royal Geographical Society, but it was thework of Quetelet that greatly impressed him. Many of the data setshe collected approximated to the law and heseemed,on occasion, to be almost mystically impressed with it. I know of scarcely anythingso apt toimpressthe imaginationas thewonderful form of cosmic order expressed by the "Law of Frequencyof Error." The law would havebeenpersonifiedby theGreeksanddeified, if they hadknown of it. It reigns with serenityand in complete self-effacement amidst the wildest confusion. The hugerthe mob and thegreaterthe apparent anarchy, the more perfect is its sway. It is the supremelaw of Unreason. Whenever a large sample of chaotic elementsare taken in handand marshalledin the order of their magnitude,an unsuspectedandmost beautiful form of regularityprovesto have been latentall along. (Galton, 1889,p. 66)
14
1. THE DEVELOPMENT OF STATISTICS
This rathertheologicalattitude towardthe distribution echoesDe Moivre, who, overa century before, proclaimed in TheDoctrine of Chances: Altho' chance produces irregularities, still the Oddswill be infinitely great, that in the processof Time, those irregularities will bear no proportion to the recurrencyof that Order which naturally results from ORIGINAL DESIGN... Such Laws, aswell as theoriginal DesignandPurposeof their Establishment, must all be fromwithout... if we blind not ourselves with metaphysical dust, we shall beled, by ashortandobvious way,to theacknowledgement of the great MAKER andGOVENOUR of all; Himself all-wise, all-powerfulandgood.(De Moivre, 1756/1967p. 251-252)
The ready acceptance of thenormal distributionas a law ofnature encouraged its wide applicationand also produced consternation when exceptions were observed. Quetelet himself admitted the possibility of the existenceof asymmetric distributions,andGalton was attimes less lyrical,for critics had objected to the use of the distribution,not as apractical toolto beused with caution where it seemedappropriate,but as asort of divine rule: It hasbeenobjectedto someof my former work,especiallyin Hereditary Genius, that I pushedthe application of the Law of Frequencyof Error somewhattoo far. I may have doneso, ratherby incautious phrases than in reality; ... I am satisfiedto claim theNormal Law is afair averagerepresentation of theobserved Curves during nine-tenths of their course;...(Galton, 1889,p. 56)6
BIOMETRICS In 1890, WalterF. R. Weldon (1860-1906)was appointedto the Chair of Zoology at University College, London.He wasgreatly impressedandmuch influenced by Gallon's Natural Inheritance.Not only did the book showhim how thefrequencyof the deviationsfrom a "type" might bemeasured,it opened up for him, and forotherzoologists,a hostof biometric problems.In two papers published in 1890 and 1892, Weldon showed that various measurements on shrimps mightbe assessed usingthe normal distribution.He also demonstrated interrelationships (correlations) between two variableswithin theindividuals. But the critical factorin Weldon's contributionto thedevelopmentof statistics was hisprofessorial appointment, for this broughthim into contact with Karl Pearson, then Professor of Applied MathematicsandMechanics,a post Pearson had held since 1884. Weldonwas attempting to remedy his weaknessin 6
work!
Note thatthis quotation and theprevious one from Galton are 10pagesapart in the same
BIOMETRICS
15
mathematicsso that he could extendhis research,and heapproached Pearson for help. His enthusiasmfor the biometric approach drew Pearson away from more orthodox work. A second importantlink was with Galton,who hadreviewed Weldon'sfirst paperon variation in shrimps. Galton supportedand encouragedthe work of thesetwo youngermen until his death, and, under the terms of his will, left £45,000 to endow a Chair of Eugenicsat the University of London, together with the wish thatthe post mightbe offered first to Karl Pearson.The offer was madeand accepted. In 1904, Galtonhad offered the University of London £500to establishthe study of national eugenics.Pearsonwas amemberof the Committee thatthe University set up, and the outcomewas adecisionto appointtheGallon Research Fellow at what was to benamedthe Eugenics RecordOffice. This becamethe Galton Laboratory for National Eugenicsin 1906, and yet more financial assistancewas providedby Gallon. Pearson,still Professorof Applied Mathematics,was itsDirector aswell asHeadof theBiometrics Laboratory. This latter received muchof its funding over many yearsfrom grantsfrom the Worshipful Companyof Drapers, whichfirst gave moneyto theUniversity in 1903. to the Gallon Chair brought applied statistics,bioPearson's appointment melrics, and eugenics together under his direction at University College.II cannot howeverbe claimedabsolutelythat the day-to-day workof these units was drivenby acommon theme. Applied statistics andbiometrics were primarily concernedwith the developmentand applicationof statistical techniques to a variety of problems,including anthropometric investigations; the Eugenics family pedigreesand examined actuarial death Laboratory collected extensive rates. Of course Pearson coordinated all the work, and there was interchange and exchange among the staff that workedwith him, but Magnello (1998,1999) hasargued that therewas not asingle unifying purposein Pearson'sresearch. Others, notably MacKenzie (1981), Kevles (1985), and Porter (1986), have promotedthe view that eugenics was thedriving force behindPearson'sstatistical endeavors. Pearsonwas not aformal memberof the eugenics movementHe did not join the Eugenics Education Society, and apparentlyhe tried to keep the two laboratories administratively separate, maintaining separate financial accounts, for example,but it has to berecognized thathis personal viewsof the human condition and itsfuture included the conviction that eugenics was ofcritical importance.There was anobviousand persistent intermingling of statistical resultsandeugenicsin his pronouncements. For example,in his Huxley Lecture in 1903 (publishedin Biometrika in 1903and 1904),on topicsthat were clearly biometric,havingto do with his researcheson relationships between moral and
16
1. THE DEVELOPMENT OF STATISTICS
intellectual variables,he ended witha plea, if not a rallying cry, for eugenics: The mentally better stockin the nationis not reproducing itselfat thesame rateas it did of old; the less able,and theless energetic, aremore fertile thanthe better stocks. ... The only remedy,if one bepossibleat all, is to alter the relative fertility of the good and the badstocksin the community.. . . intelligence can beaided and be trained,but notrainingor educationcancreateit. You must breedit, that is thebroad result for statecraft whichflows from the equality in inheritanceof the psychicaland the physical characters in man. (Pearson, 1904a, pp. 179-180).
Pearson'scontributionwasmonumental,for in less than8 years, between 1893 and 1901, he published over30 paperson statisticalmethods. The first was written as aresultof Weldon's discovery that the distributionof one set of measurementsof the characteristicsof crabs, collectedat thezoological station at Naplesin 1892,was "double-humped." The distributionwas reducedto the sum of two normal curves. Pearson (1894)proceededto investigatethe general problem of fitting observed distributions to theoretical curves. This work was to lead directlyto theformulationof the x2 test of "goodnessof fit" in 1900,one of the most important developments in the history of statistics. Weldon approachedthe problem of discrepancies between theory and observationin a much more empirical way, tossing coins anddice and comparing the outcomes withthe binomial model. These data helped to produce another lineof development. In a letter to Galton, writtenin 1894, Weldon asksfor a commenton the results of 7,000tossingsof 12 dice collectedfor him by aclerk at University College: A day or two agoPearsonwantedsomerecordsof the kind in a hurry, in order to illustrate a lecture,and Igavehim therecord of the clerk's7000tosses... on examination he rejects them, because he thinks the deviation from the theoretically most probable resultis so great as to make the record intrinsically incredible, (quotedby E. S.Pearson,1965, p. 11)
This incidentset off agood dealof correspondence between Karl Pearson, F.Y. Edgeworth (1845-1926),an economist and statistician, and Weldon, the details of which are now only of minor importance. But,as Karl Pearson remarked, "Probabilitiesare very slippery things" (quoted by E. S. Pearson, 1965, p. 14), and the search for criteria by which to assessthe differences between observed andtheoretical frequencies, andwhetheror notthey couldbe reasonably attributed to chance sampling fluctuations, began. Statistical research rapidly expanded into careful examinationof distributions other than the normal curve and eventually intothe propertiesof sampling distributions,
BIOMETRICS
17
particularly through the seminal work of Ronald Fisher. In developinghis researchinto the propertiesof the probability distributions of statistics, Fisher investigated the basisof hypothesis testingand thefoundations of all the well-known testsof statistical significance. Fisher's assertion that p = .05 (1 in 20) is the probability thatis convenientfor judging whetheror not a deviation is to beconsidered significant (i.e. unlikely to be due tochance), has profoundly affected research in the social sciences,although it should be noted thathe was not the originatorof the convention (Cowles& Davis, 1982a). Of course,the developmentof statistical methodsdoesnot endhere,nor have all the threads been drawn together. Discussion of the important contribution of W. S. Gosset("Student," 1876-1937)to small sample workand therefinements introduced into hypothesis testing by Karl Pearson'sson, Egon S. Pearson(1895-1980)and Jerzy Neyman(1899-1981)will be found in later chapters, whenthe earlier details have been elaborated. Biometrics and Genetics The early years of the biometric school were surrounded by controversy. Pearsonand Weldon heldfast to the view that evolution took placeby the continuous selectionsof variations that were favorable to organismsin their environment.The rediscoveryof Mendel's workin 1900 supportedthe concept that heredity depends on self-reproducing particles (what we now call genes), and that inherited variationis discontinuousand saltatory. The source of the developmentof higher types was occasional genetic jumps or mutations. Curiously enough, thiswas theview of evolution that Galtonhad supported. His misinterpretationof the purely statistical phenomenon of regressionled him to thenotion thata distinctionhad to bemade between variations from the mean that regressand what he called"sports"(a breeder'sterm for an animalor plant variety thatappearsapparently spontaneously) that will not. A championof the position that mutations were of critical importancein the evolutionary processwas William Bateson(1861-1926)and aprolongedand bitter argument withthe biometricians ensued. The Evolution Committeeof the Royal Society broke down over the dispute. Biometrikawasfoundedby Pearson and Weldon, with Galton'sfinancial support, in 1900, after the Royal Society had allowed Batesonto publish a detailed criticismof a paper submittedby Pearson beforethe paper itselfhad been issued. Britain's important scientific journal, Nature, tookthe biometricians' sideand would not print letters from Bateson. Pearson replied to Bateson'scriticisms in Biometrika but refusedto accept Bateson'srejoinders, whereupon Bateson hadthem privately printedby the CambridgeUniversity Pressin the format of Biometrika\ At the British Association meetingin Cambridge in 1904, Bateson, then
18
1. THE DEVELOPMENT OF STATISTICS
Presidentof theZoological Section, took theopportunityto deliverabitter attack on the biometric school. Dramatically waving aloft the published volumesof Biometrika, he pronounced them worthless and hedescribed Pearson's correlation tables as: "aProcrusteanbedinto whichthebiometricianfits hisunanalysed data." (quotedby Julian Huxley,1949). It is even said that Pearson and Batesonrefusedto shake hands at Weldon's funeral. Nevertheless,after Weldon's deaththe controversy cooled. Pearson's work became more concerned with the theory of statistics, althoughthe influenceof his eugenic philosophy wasstill in evidence,and by1910, when Bateson becameDirector of theJohn Innes Horticultural Institute, the argumenthaddied. However, some statistical aspects of this contentious debate predated the evolution dispute,andechoesof them - indeed, marked reverberations from them - arestill around today, although of course MendelianandDarwinian thinking arecompletely reconciled. STATISTICAL CRITICISM Statisticshas been calledthe "scienceof averages,"and this definition is not meant in a kindly way. The great physiologist Claude Bernard (1813-1878) maintained thatthe use ofaveragesin physiology couldnot be countenanced: becausethetruerelationsof phenomenadisappearin the average;whendealing with complex and variable experiments,we must study their various circumstances,andthen presentour most perfect experiment as atype, which, however, still standsfor true facts. ... averagesmust thereforebe rejected, because they confuse while aiming to unify, and distort while aiming to simplify. (Bernard,1865/1927,p. 135)
Now it is, of course, true that lumping measurements together may notgive us anything more than a pictureof the lumping together,and theaverage value may not beanything likeany oneindividual measurement at all, but Bernard's ideal type fails to acknowledgethe reality of individual differences.A rather memorable example of a very real confusionis given by Bernard(1865/1927): A startling instanceof this kindwas inventedby aphysiologistwho took urinefrom a railway stationurinal where people of all nationspassed,and whobelievedthat he could thus presentan analysisof average European urine! (pp. 134-135).
A less memorable, but just astelling, exampleis that of the social psychologist who solemnly reports "mean social class." Pearson(1906) notes that: One of theblows to Weldon, which resultedfrom his biometric viewof life
STATISTICAL CRITICISM
19
was that his biological friends could not appreciatehis newenthusiasms. They could not understandhow the Museum "specimen"was in thefuture to be replacedby the "sample"of 500 to 1000 individuals,(p. 37)
The view is still not wholly appreciated. Many psychologists subscribe to the position thatthe most pressing problems of the discipline,andcertainly the ones of most practical interest, are problemsof individual behavior. A major criticism of the effect of the use of thestatistical approachin psychological researchis the failure to differentiate adequately between general propositions that apply to most, if not all, membersof a particular groupand statistical propositions that applyto some aggregatedmeasureof the membersof the group. The latter approach discounts the exceptionsto thestatisticalaggregate, which not only may be themost interestingbut may, on occasion, constitute a large proportionof the group. Controversy abounds in the field of measurement, probability, and statistics, and the methods employedare open to criticism, revision, and downright rejection. On theother hand, measurement and statistics playa leading rolein psychological research, and thegreatest danger seemsto lie in a nonawareness of the limitations of the statistical approach and thebasesof their development, as well as the use of techniques, assisted by thehigh-speed computer, as recipes for datamanipulation. Miller (1963) observedof Fisher, "Few psychologists have educated us as rapidly, or have influencedour work as pervasively,as didthis fervent, clearheaded statistician."(p. 157). Hogben (1957) certainly agreesthat Fisherhasbeen enormouslyinfluential but heobjectsto Fisher's confidence in his own intuitions: This intrepid beliefin what hedisarmingly calls common sense...has ledFisher ... to advancea battery of conceptsfor thesemantic credentials of which neither he nor hisdisciplesoffer anyjustification enrapport with the generallyaccepted tenetsof the classical theoryof probability. (Hogben, 1957,p. 504)
Hogben also expresses a thoughtoften shared by natural scientists when they review psychologicalresearch,that: Acceptability of a statistically significant result of an experiment on animal behaviourin contradistinctionto aresult whichtheinvestigatorcanrepeat before a critical audience naturally promotes a high outputof publication. Hencethe argumentthat the techniques workhas atempting appealto young biologists. (Hogben, 1957,p. 27)
Experimental psychologists may well agree thatthe tightly controlledexperiment is the apotheosisof classicalscientific method,but they are not so
20
1. THE DEVELOPMENT OF STATISTICS
arrogant as tosuppose that their subject matter will necessarily submit to this form of analysis,and they turn, almost inevitably, to statistical,as opposedto experimental, control. This is not amuddle-headed notion, but it does present dangersif it is accepted without caution. A balanced,but notuncritical, viewof the utility of statisticscan bearrived at from a considerationof the forces that shaped the disciplineand anexamination of its development. Whether or not this is an assertion that anyone, let alonethe authorof this book,canjustify remainsto be seen. Yet thereare Writers, of a Classindeed very different from that of JamesBernoulli, who insinuateas if theDoctrine of Probabilities could haveno placein any serious Enquiry; and that Studiesof this kind, trivial and easy as they be, rather disqualify a man for reasoning on any other subject. Let the Reader chuse.(De Moivre, 1756/1967,p. 254)
2
Science, Psychology, and Statistics
DETERMINISM It is a popular notion thatif psychologyis to beconsidereda science, thenit most certainly is not an exact science.The propositions of psychology are considered to be inexact becauseno psychologist on earth would venturea statement suchas this: "All stable extravertswill, when asked,volunteer to 1 participate in psychological experiments." The propositions of the natural all physicists wouldbe preparedto sciencesare consideredto beexact because attest (with some few cautionary qualifications) that, "fireburns" or, more pretentiously, that"e = mc2." In short, it is felt that the order in the universe, which nearly everyone (though for different reasons)is sure mustbe there,has been more obviously demonstrated by the natural rather than the social scientists. Order in the universe implies determinism, a most useful and amost vexing term, for it brings thosewho wonder about such things into contact with the philosophicalunderpinningsof the rather everyday concept of causality.No one hasstatedthe situation more clearly than Laplace in his Essai: Present eventsare connectedwith preceding onesby a tie basedupon the evident principle that a thing cannot occur without a causewhich producesit. This axiom known by the name of the principle of sufficient reason, extends even to actions which are consideredindifferent; the freest will is unable withouta determinative motive to give them birth;... We ought thento regardthe present stateof the universeas theeffect of its anterior 1 Not wishing to make pronouncements on theprobabilistic natureof the work of others,the writer is makingan oblique referenceto work in which he and acolleague (Cowles& Davis, 1987)found that there is an 80%chance that stable extraverts will volunteerto participatein psychological research.
21
22
2. SCIENCE, PSYCHOLOGY, AND STATISTICS stateand as thecauseof the onethat is to follow. Given for oneinstant an intelligence which could comprehendall the forces by which nature is animated and the respectivesituationsof thebeingswho composeit- an intelligence sufficiently vast to submit these data to analysis - it would embrace in the same formulathe movementsof the greatestbodies of the universeandthoseof the lightest atom;for it nothingwould beuncertainand thefuture, as thepast,would bepresentto its eyes. (Laplace, 1820/1951,pp. 3-4)
The assumptionof determinismis simply the apparently reasonable notion that eventsare caused.Sciencediscovers regularitiesin nature, formulates is to say, descriptionsof these regularities,and provides explanations, that discoverscauses.Knowledge of the past enablesthe future to be predicted. This is thepopular viewof science,and it isalsoa sort of working ruleof thumb for those engaged in the scientific enterprise. Determinism is thecentral feature of the developmentof modern science up to thefirst quarterof the 20th century. The successesof the natural sciences, particularly the successof Newtonian andinfluenced someof thegiantsof psychology, particularly mechanics, urged in North America, to adopt a similar mechanistic approach to the study of behavior. The rise of behaviorism promoted the view that psychology could be ascience, "like othersciences."Formulae couldbedevised that would allow behaviorto bepredicted,and atechnology couldbe achieved that would enable environmentalconditionsto be somanipulatedthat behavior couldbe controlled. The variability in living things was to bebrought under experimental control, a program that leads quite naturally to thenotionof thestimulus control of behavior. It follows thatconceptssuchaswill or choiceor freedomof action could be rejectedby behavioral science. In 1913, JohnB. Watson (1878-1958)publisheda paper that became the behaviorists' manifesto.It begins, "Psychologyas thebehaviorist viewsit is a purely objective experimental branch of naturalscience.Its theoretical goalis the predictionand control of behavior" (Watson, 1913, p. 158). to Oddly enough, this pronouncement coincided with work that began questionthe assumptionof determinismin physics,the undoubted leader of the natural sciences.In 1913, a laboratory experimentin Cambridge, England, provided spectroscopic proofof what is known as theRutherford-Bohr model of the atom. Ernest Rutherford (later Lord Rutherford, 1871-1937)had proposed thatthe atom was like a miniaturesolar system with electrons orbiting a central nucleus. Niels Bohr (1885-1962),a Danish physicist, explained that the electrons moved from oneorbit to another, emitting or absorbing energy asthey movedtoward or away from the nucleus.Thejumping of anelectronfrom orbit to orbit appearedto be unpredictable.The totality of exchanges could only be predictedin a statistical, probabilistic fashion.That giantof modern physicists,
DETERMINISM
23
Albert Einstein(1879-1955),whose workhad helpedto start the revolutionin physics,was loath to abandonthe conceptof a completely causal universe and indeed neverdid entirely abandonit. In the 1920s, Einstein made the statement that has often been paraphrased as, "God doesnot play dice withthe world." Nevertheless,he recognizedthe problem. In a lecture givenin 1928, Einstein said: Today faith in unbroken causalityis threatenedprecisely by thosewhosepath it had illumined as their chief and unrestricted leaderat the front, namely by the representativesof physics... All natural lawsaretherefore claimedto be, "in principle," of a statisticalvariety and ourimperfect observationpracticesalone havecheatedus into a belief in strict causality, (quotedby Clark, 1971,pp. 347-348)
But Einstein never really accepted this proposition, believing to the endthat indeterminacywas to beequated with ignorance. Einstein may be right in subscribing ultimatelyto the inflexibility of Laplace's all-seeing demon, but another approachto indeterminacywas advanced by Werner Heisenberg (1902-1981)a German physicist who, in 1927, formulatedhis famous uncertainty principle. He examinednot merelythe practical limits of measurement but the theoretical limits,and showed thatthe act ofobservationof the position and velocity of a subatomic particle interfered with it so as toinevitably produce errors in the measurementof one or theother. This assertion hasbeen takento mean that, ultimately,the forces in our universeare random and therefore indeterminate.Bertrand Russell (1931) disagrees: Spaceand time were inventedby the Greeks, and served their purpose admirably until the present century. Einstein replaced them by akind of centaur whichhe called "space-time,"and this did well enoughfor a coupleof decades,but modern quantum is necessary. mechanicshas made it evident thata more fundamental reconstruction The Principle of Indeterminacyis merely an illustrationof this necessity,not of the failure of physical lawsto determinethe courseof nature, (pp. 108-109)
The important pointto be awareof is that Heisenberg's principle refers to the observerand the act ofobservationand notdirectly to the phenomena that are being observed. This implies that the phenomena have an existence outside their observationand description,a contention that,by itself, occupies philosophers. Nowhereis the demonstration that technique and method shape the way in which we conceptualize phenomena more apparent than in the physicsof light. The progressof eventsin physics thatled to theview that light was both wave and particle, a view that Einstein's workhad promoted, beganto dismay him when it was used to suggest that physics would have to abandon strict
24
2. SCIENCE, PSYCHOLOGY,AND STATISTICS
continuity and causality. Bohr responded to Einstein's dismay: You, the man whointroducedthe ideaof light asparticles! If you are soconcerned with the situation in physicsin which the natureof light allows for a dual interpreto ban the use of photoelectric cellsif you tation, thenask theGerman government think that light is waves, or the use ofdiffraction gratings if light is corpuscular, (quotedby Clark, 1971,p.253)
The parallels in experimental psychologyare obvious. That eminent historian of the discipline, Edwin Boring, describes a colloquium at Harvard when his colleague, William McDougall,who:"believed in freedom for the human mind - in at least a little residue of freedom- believed in it andhoped for as much as hecould savefrom the inroads of scientific determinism,"and he, a determinist, achieved,for Boring, an understanding: McDougall's freedom was my variance. McDougall hoped that variance would always be found in specifyingthe laws of behavior,for there freedom might still persist. I hoped then- less wise thanI think I am now (it was 31years ago)- that alimit. At any rate this general science would keep pressing variance towards as zero fact emergesfrom this example: freedom, when you believeit is operating, always residesin an areaof ignorance. If there is a known law, you do nothave freedom. (Boring, 1957,p. 190)
Boring was really unshakablein his belief in determinism,and that most influential of psychologists,B. F.Skinner, agrees with the necessityof assuming order in nature: It is aworking assumption which must be adoptedat thevery start. Wecannotapply the methods of science to a subject matter whichis assumedto move about capriciously. Sciencenot only describes,it predicts. It dealsnot only with the past but with the future...If we are to use themethodsof sciencein the field of human affairs, we must assume that behavior is lawful and determined. (Skinner,1953, p. 6)
Carl Rogersis among thosewho have adoptedas afundamental positionthe view that individualsare responsible, free,and spontaneous. Rogers believes that, "the individual choosesto fulfill himself by playing a responsible and voluntary partin bringing aboutthedestined events of the world" (Rogers,1962, quotedby Walker, 1970,p. 13). The use of theword destinedin this assertion somewhat spoils the impact of whatmosttaketo be theindividualistic andhumanisticapproachthat is espoused by Rogers. Indeed,he hasmaintained thatthe conceptsof scientific determinism and personal choicecanpeacefully coexistin the way inwhich the particle
DETERMINISM
25
and wave theoriesof light coexist. The theoriesaretrue but incompatible. Thesefew words do little more than suggest the problems thatare facedby the philosopherof science whenhe or shetacklesthe conceptof methodin both the naturaland thesocial sciences. What basic assumptions can wemake? So often the argumentspresentedby humanistic psychologists have strong moral or even theological undertones, whereas those offering the determinist's view point to theregularities that existin nature- even human nature - andaver that without such regularities, behavior would be unpredictable. Clearly, beings to whose behaviorwas completely unpredictable would have been unable achievethe degreeof technologicalandsocial cooperation that marks thehuman species. Indeed, individuals of all philosophical persuasions tend to agree that someone whose behavior is generally not predictable needs some sort of treatment.On the other hand,the notion of moral responsibility implies freedom. If I am to bepraisedfor my good worksandblamedfor my sins,a statement that "nature is merely unfolding as it should" is unlikely to be acceptedas a defensefor the latter, and unlikely to be advancedby me as areasonfor the former. One way out of theimpasse,it is suggested,is to reject strict "100 percent" determinismand to accept statistical determinism. "Freedom"then becomes partof the error termin statistical manipulations. Grimbaum (1952) considersthe arguments against both strict determinism and statistical determinism, arguments based on the complexityof human behavior,the conceptof moral choice, and theassignmentof responsibility,assertions that individuals are uniqueand that therefore their actions are notgeneralizablein the scientific sense,and that human beingsvia their goal-seeking behavior themselves determinethe future. He concludes,"Since the important arguments against determinismwhichwe have considered arewithout foundation,the psychologist neednot bedeterredin his questand canconfidently use thecausal hypothesis as a principle, undauntedby the caveat of the philosophical indeterminist" (Grunbaum,1952,p. 676). Feigl (1959) insists that freedom must not beconfused withthe absenceof causality, and causal determination must not be confused with coercionor compulsion or constraint. "To be free means thatthe chooseror agent is an essentiallink in the chainof causal events and that no extraneous compulsion - be itphysical, biological,or psychological- forces him to act in adirection incompatiblewith his basic desiresor intentions" (Feigl, 1959,p. 116). To some extent,and many modern thinkers would say to alarge extent (and Feigl agreeswith this), philosophical perplexities can beclarified, if not entirely resolved,by examiningthe meaningof the terms employedin the debate rather than arguing about reality. Two further points mightbe made. The first is that variabilityanduncertainty in observationsin the naturalas well as thesocial sciences require a statistical
26
2. SCIENCE, PSYCHOLOGY, AND STATISTICS
approachin order to reveal broad regularities, and this appliesto experimentation andobservationnow, whatever philosophical stanceis adopted.If the very idea of regularity is rejected, then systematic approaches to the study of the human conditionareirrelevant. The secondis that,at the end of thediscussion, he said,"I know thatI have free most will smile withDr Samuel Johnson when will and there'san end onit."
PROBABILISTIC AND DETERMINISTIC MODELS The natural sciencesset great storeby mathematical models.For example,in the physicsof massand motion, potential energy is given by PE = mgh, where m is mass,g is accelerationdue togravity, and h isheight. Newton'ssecond law of motion states thatF = ma, where F is force, m is mass, and a is acceleration. These mathematical functions or modelsmay betermed deterministic models because, giventhe valueson theright-hand sideof the equation,the construct on the left-hand sideis completely determined.Any variability that mightbe observedin, for example, Forcefor a measured mass and agiven acceleration is dueonly to measurement error. Increasing the precisionof measurement using accurate instrumentation and/or superior technique will reducethe error in Fto 2 very small margins indeed. Some psychologists, notably, Clark Hull in learning theoryand Raymond B. Cattell m personality theoryand measurement,approachedtheir work with the aim (one mightsay thedream!) of producing parallel models for psychological constructs.Econometricians similarlysearchfor models thatwill describe the market with precision and reliability. For social and biological scientists thereis apersistentandenduring challenge that makes their disciplines both fascinatingandfrustrating.That challengeis thesearchfor meansto assess and understandthe variability thatis inherentin living systemsand societies. In univariate statistical analysis we are encompassing those procedures wherethereisjust onemeasuredor dependent variable (the variable that appears on the left-hand sideof the equals signin the function shown next)and one or more independentor predictor variables: those that are seenon theright-hand
2
For thoseof you who have been exposed to physics, it must be admitted that although or less unchallenged for 200yearsthemodelshegaveus are notdefinitions Newton's laws were more but assumptionswithin the Newtonian system, as thegreat Austrian physicist, Ernst Mach, pointed the out. Einstein's theoryof relativity showed that theyare not universally true. Nevertheless, distinction betweenthe models of the natural and thesocial and biological sciencesis, for the moment,a useful one.
SCIENCEAND INDUCTION
27
side of the equation.The function is known as thegeneral linear model.
The independent variables are those thatare chosen and/or manipulated by the investigator. Theyare assumedto be related to or have an effect on the dependent variable.In this modelthe independent variables, x\, x2, * 3j xn , are assumedto be measured without error, po , PI, Pa, PS, P« , are unknown parameto the interceptin the simple linear model,and ters. Bo is aconstant equivalent the others are theweightsof the independent variables affecting, or being used to estimate,a given observationy.Finally, e is therandom errorassociatedwith the particular combinationof circumstances: those chosen variables, treatments, individual difference factors, and so on. Butthese modelsare probabilistic models. Theyare based on samplesof observationsfrom perhapsa variety of populationsand they may betested underthe hypothesisof chance. Even when they pass the test of significance, like other statistical outcomes they are notnecessarily completely reliable. The dependent variable can, at best,be only partly determinedby such models,and other samplesfrom other populations, perhaps using differently defined independent variables, may, and sometimes do, give us different conclusions. SCIENCE AND INDUCTION It is common to tracethe Western intellectual tradition to two fountainheads, Greek philosophy, particularly Aristotelian philosophy, and Judaic/Christian with the Greeks in the sense that theyset out its theology. Science began commonlyacceptedground rules. Science proceeds systematically. It gives us a knowledgeof nature thatis public and demonstrable and, most importantly, open to correction. Science provides explanations that are rational and,in principle, testable, rather than mystical or symbolic or theological. The cold rationality that this implieshasbeen tempered by theJudaic/Christian notionof the compassionate human being as acreature constructed in the imageof God, and, by the belief thatthe universeis the creationof God and assuch deserves the attention of the beings thatinhabit it. It is these streams of thought that give us thedebate about intellectual values and scientific responsibilityand sustain the view that science cannot be metaphysically neutral nor value free. But it is nottrue to saythat thesetraditionshavecontinuouslyguided Western thought. Christianity took some time to become established. Greek philosophy and science disappeared under the pragmatic technologists of Rome. Judaism, weakened by the loss of its home and thepersecutionof its adherents, took refuge in the refinement and interpretation of its ancient doctrines. When
28
2. SCIENCE, PSYCHOLOGY, AND STATISTICS
Christianity did becomethe intellectual shelterfor the thinkers of Europe,it embracedthe view thatGod revealed whatHe wishedto revealandthat nature everywherewas symbolicof a general moral truth known only to Him. These ideas, formalized by the first great Christian philosopher,St. Augustine (354-430),persistedup to, andbeyond,theRenaissance, andsimplistic versions may be seenin the expostulationsof fundamentalistpreachers today.For the best part of 1,000 yearsscientific thinking was not animportant part of intellectual advance. Aristotle's rationalismwasrevivedby "the schoolmen,"of whomthe greatest was Thomas Aquinas(1225?-1274),but the influence of science and the shapingof the modernintellectual world beganin the 17th century. The modern world,so far asmental outlookis concerned,beginsin the seventeenth century. No Italian of the Renaissance would havebeenunintelligible to Plato or Aristotle; Luther would have horrified ThomasAquinas, but would not have been difficult for him to understand. Withthe seventeenth century it is different: Plato and Aristotle, Aquinasand Occam,could not have made head or tail of Newton. (Russell, 1946,p. 512)
This is not to deny the work of earlier scholars. Leonardo da Vinci (1452 -1519)vigorously propounded the importanceof experienceand observation, and enthusiastically wroteof causalityand the "certainty" of mathematics. Francis Bacon(1561-1626)is a particular exampleof one whoexpressedthe importanceof systemand methodin the gainingof new knowledge,and his contribution, althoughoften underestimated, is of great interestto psychologists. an accountof Bacon that shows how his ideascan be Hearnshaw (1987) gives seenin the foundationand progressof experimentaland inductive psychology. He notes that: "Bacon himself made few detailed contributions to general psychology as such,he sawmore clearly than anyone of his time the need for, and thepotentialitiesof, a psychologyfounded on empirical data,and capable of being appliedto 'the relief of man'sestate'"(Hearnshaw, 1987, p. 55). A turning point for modern science arrives with the work of Copernicus (1473-1543). His account of the heliocentric theoryof our planetary system was publishedin theyearof his deathand hadlittle impactuntil the 17th century. The Copernican theory involved no newfacts, nor did it contributeto mathematical simplicity. As Ginzburg (1936) notes, Copernicus reviewed theexisting facts and came up with a simpler physical hypothesis than that of Ptolemaic theory which stated thatthe earthwas thecenterof theuniverse: The fact that PTOLEMY and his successorswere led to make an affirmation in violence to the facts as then known shows that their acceptanceof the belief in the
SCIENCE AND INDUCTION
29
immobility of the earth at thecentre of the universewas not theresult of incomplete knowledge but rather the resultof a positive prejudice emanating from non-scientific p. 308) considerations. (Ginzburg, 1936,
This statementis onethat scientists, when they are wearing their scientists' hats, would supportand elaborate upon,but anexaminationof the stateof the psychological sciences could not fully sustainit. Nowhereis our knowledge in the studyof the human condition,andnowhereare our more incomplete than interpretations more open to prejudiceand ideology.The study of differences between the sexes, the nature-nurture issuein the examinationof human personality, intelligence,and aptitude,the sociobiological debateon the interpretation of the way in which societiesare organized, all are marked by undertonesof ethicsand ideology thatthe scientific purist wouldsee asoutside the notion of an autonomous science. Nor is this a complete list. Scienceis not somuch about factsas about interpretationsof observation, and interpretationsas well as observationsare guided and moldedby preconceptions. Ask someoneto "observe"and he or she will ask what it is that is to be observed. To suggest thatthe manipulationand analysisof numerical data, in the senseof the actual methods employed, canalso be guidedby preconceptions seems very odd. Nevertheless, the developmentof statisticswasheavily influencedby theideological stance of its developers.The strengthof statistical analysisis that its latter-day usersdo nothaveto subscribeto theparticular views of the pioneersin order to appreciateits utility and apply it successfully. A common viewis that science proceeds through the processof induction. Put simply, thisis theview thatthe future will resemblethe past. The occurrence of an eventA will lead us to expectan eventB if past experiencehas shownB alwaysfollowing A The general principle thatB follows A is quickly accepted. Reasoning from particular casesto general principlesis seen as the very foundation of science. The conceptsof causality and inference come togetherin the process of induction. The great Scottish philosopher David Hume (1711-1776) threw down a challenge thatstill occupiesthe attentionof philosophers: As to past experience, it can beallowed to give direct and certain informationof those precise objects only, and that precise periodof time, which fell under its cognizance:but why this experience should be extendedto future times,and toother objects, whichfor aughtwe know, may beonly in appearance similar; this is themain question on which I would insist. Thesetwo propositions are farfrom being the same, / have found thatsuchan object has always beenattended withsuchan effect and I foresee that other objects, which are,in appearance, similar, willbe attended withsimilar effects. I shall allow, if you please,that the oneproposition may justly be inferred fromthe other: I know,
30
2. SCIENCE, PSYCHOLOGY, AND STATISTICS
in fact, that it always is inferred. But if you insist thatthe inferenceis madeby a chain of reasoning,I desireyou to produce that reasoning. (Hume, 1748/1951, pp. 33-34)
The arguments against Hume's assertion that it is merely the frequent conjunctionor sequencingof two events that leads us to abelief thatonecauses the other have been presented in many formsand this account cannot examine them all. The most obvious counteris that Hume'sown assertion invokes causality. Contiguityin time and spacecausesus to assume causality. Popper the ideaof repetition basedon similarity as thebasisof a belief (1962) notes that in causality presents difficulties. Situationsarenever exactlythe same.Similar situationsareinterpretedasrepetitionsfrom a particular pointof view - and that point of view is a system of "expectations,anticipations, assumptions,or interests"(Popper, 1962,p. 45). In psychological matters there is the additional factor of'volition. I wish to pick up my pen andwrite. A chain of nervousand muscularand cognitive processes ensuesand I dowrite. The fact that human beings can and docontrol their future actions leadsto a situation wherea general denialof causality flies in the face of common sense. Such a denial invites mockery. Hume's argument that experience does not justify prediction is more difficult to counter. The courseof natural eventsis not wholly predictable,and the history of the scientific enterpriseis littered with the ruins of theories and explanations that subsequent experience showed to bewanting. Hume's skepticism, if it was accepted, would lead to a situation where nothing could be learnedfrom experienceand observation.The history of humanaffairs would refute this, but theargument undoubtedly leads to acautious approach. Predicting the future becomesa probabilistic exerciseand scienceis no longer ableto claim to be the way tocertainty andtruth. Using the probability calculusas an aid toprediction is onething; usingit to assessthe value of a particular theoryis another. Popper(1962) regards statements about theories having a high degreeof probability as misconceptions. Theoriescan beinvokedto explain various phenomena andgoodtheories arethose that stand up to severe test. But, Popper argues, corroboration cannot be equated with mathematical probability: All theories,including the best,havethe sameprobability, namelyzero. That an appealto probabilityis incapableof solving the riddle of experienceis a conclusionfirst reachedlong ago byDavid Hume... Experience doesnot consist in the mechanical accumulation of observations. Experienceis creative. It is the result of free, bold and creative interpretations, controlled by severe criticismand severe tests. (Popper, 1962, pp. 192-193)
INFERENCE
31
INFERENCE Induction is, andwill continue to be, alarge problemfor philosophical discussion. Inference can benarrowed down. Although the termsare sometimes used in the same sense and with the same meaning, it is useful to reservethe term inference for the makingof explicit statements about the propertiesof a wider universe thatare based on a much narrowerset of observations. Statistical inference is precisely that,and the discussion just presented leads to the argumentthat all inference is probabilistic and therefore all inferential statementsare statistical. Statistical inference is a way ofreasoningthat presents itself as amathematical solutionto the problemof induction. The searchfor rulesof inferencefrom the time of Bernoulli and Bayesto that of Neymanand Pearsonhasprovided the spur for the developmentof mathematical probability theory.It has been has led to argued thatthe establishingof a set ofrecipesfor data manipulation a situation where researchersin the social sciences"allow statistics to do the thinking for them." It has beenfurther argued that psychological questions that do not lend themselvesto the collectionand manipulationof quantitative data are neglectedor ignored. These criticisms are not to betaken lightly, but they can beanswered.In the first place, statistical inference is only a part of formal is experimental psychological investigation.An equally important component design.It is the lastingcontributionof Ronald Fisher,a mathematical statistician and achampionof the practical researcher,that showedus how theformulation of intelligent questionsin systematic frameworks would produce datathat, with the helpof statistics, could provide intelligent answers. In the second place,the social sciences have repeatedly come up with techniques that have enabled qualitativedatato be quantified. In experimental psychologytwo broad strategies have been adopted for coping with variability. The experimental analytic approach sets out boldly to contain or to standardizeas many of the sourcesof variability as possible. In the micro-universeof the Skinner box, shaping andobservingthe rat'sbehavior dependon a knowledgeof the antecedentand present conditions under which a particular pieceof behaviormay beobserved.The second approach is that of statistical inference. Experimental psychologists control (in the senseof standardizeor equalize) those variables that they cancontrol, measure what they wish to measure witha degreeof precision,assume that noncontrolled factors operate randomly,and hope that statistical methods will teaseout the"effects" from the "error." Whateverthe strategy, experimentalists will agreethat the knowledge they obtain is approximate. It has also been generally assumed that this approximate science is an interim science. Probabilityis part of scientific
32
2. SCIENCE, PSYCHOLOGY, AND STATISTICS
method but not part of knowledge. Some writers have rejected this view. Reichenbach(1938), for example, soughtto devise a formal probability logic in which judgmentsof the truth or falsity of propositionsis replaced by the notion of weight. Probability belongs to a classof events. Weight refers to a single event,and asingle eventcan belongto many classes: Supposea manforty years old hastuberculosis;. . . Shall we consider . . . the frequency of death withinthe classof men forty yearsold, or within the classof tubercular people?... We take the narrowest classfor which we have reliable statistics... we should take the classof tubercularmen of forty ... thenarrower the classthe betterthe determinationof weight...a cautious physicianwill even placethe man inquestion within a narrower classby making an X-ray; he will then use as theweight of the case,the probability of death belongingto a condition of the kind observedon the film. (Reichenbach, 1938,pp. 316-317)
This is a frequentist viewof probability, and it is theview that is implicit in statistical inference. Reichenbach's thesis shouldan have appealfor experimental psychologists, although it is not widely known. It reflects, in formal terms, the way in which psychological knowledge is reported in the journals and textbooks, although whether or not thewriters and researchers recognize this may be debated.The weight of a given propositionis relativeto the stateof our knowledge,andstatements about particularindividualsandparticular behaviors are prone to error. It is not that we aretotally ignorant,but that manyof our classesare toobroadto allow for substantialweightto beplacedon theevidence. to extendthe relative Popper (1959) takes issue with Reichenbach's attempts frequency view of probability to include inductive probability. Popper, with Hume, maintains thata theory of induction is impossible: We shall haveto get accustomedto the idea thatwe must not look upon scienceas a "body of knowledge",but rather as asystem of hypotheses; thatis to say, as a systemof guessesor anticipations... of which we arenever justifiedin saying that we know that theyare"true" or "more or lesscertain"or even"probable". (Popper, 1959, p. 317)
Now Popper admits only that a systemis scientific when it can betestedby experience. Scientific statements aretestedby attemptsto refuteor falsify them. Theories that withstand severe tests are corroboratedby thetests,but they are not proven, nor arethey even made more probable. It is difficult to gainsay Popper's logicandHume's skepticism. They arefood for philosophical thought, but scientistswho perhaps occasionally worry about such things put willthem aside,if only becauseworking scientistsarepractical people.The principle of induction is the principle of science,and thefact that Popperand Hume can
STATISTICS IN PSYCHOLOGY
33
shout from the philosophical sidelines that the "official" rules of the gameare irrational andthat the "real" rulesof the gameare notfully appreciated,will not stop the game from being played. Statistical inference may bedefinedas the use of methodsbasedon therules of chance to draw conclusionsfrom quantitativedata. It may be directly compared with exercises where numbered tickets aredrawn from a bag oftickets with a view to making statements about the compositionof the bag, or wherea die or a coin is tossedwith a view to making statements about its fairness. Supposea bagcontains tickets numbered 1,2,3,4, and 5.Each numeral appears on the same,very large, numberof tickets. Now suppose that25 tickets are drawnat randomand with replacementfrom the bag and the sum of the numbers is calculated.The obtainedsum could be as low as 25 and as high as 125, but the expected value of the sumwill be 75,because each of the numerals should occur on one fifth of thedraws or thereabouts.The sumshouldbe 5(1 +2 + 3 + 4 + 5) = 75. Inpractice, a given drawwill have a sumthat departsfrom this value by an amount aboveor below it that can bedescribedas chance error. The likely sizeof this error is givenby astatistic calledthe standard error, which is readily computedfrom the formula a/Vw , whereCT is thestandarddeviation of the numbersin the bag. Leaving aside,for the moment,the problem of estimatingCTwhen,as isusual,the contentsof the bag areunknown,all classical from this sort of exercise.The real varistatistical inferential procedures stem ability in the bag isgiven by the standard deviation, and thechance variability in the sumsof the numbers drawnis given by the standard error. STATISTICS IN PSYCHOLOGY The use ofquantitative methodsin the study of mental processes begins with Gustav Fechner(1801-1887)who sethimself the problem of examiningthe relationship between stimulusandsensation.In 1860he publishedElementeder Psychophysik,in which he describeshis invention of a psychophysicallaw that describesthe relationship between mind and body. He developed methods of measuringsensationbasedon mathematicalandstatisticalconsiderations,methods that have theirechoesin present-day experimental psychology. Fechner madeuse of thenormal law in hisdevelopmentof themethodof constant stimuli, applying it in the Gaussian sense as a way ofdealingwith error anduncontrolled variation. Fechner'sbasic assumptions and theconclusionshe drew from his experimental investigations have been shown to befaulty. Stevens'work in the 1950s and thelater developments of signal detection theory have overtaken the work of the 19th century psychophysicists, but therevolutionary natureof Fechner's methods profoundly influenced experimental psychology. Boring (1950)
34
2. SCIENCE, PSYCHOLOGY, AND STATISTICS
devotesa whole chapterof his book to thework of Fechner. Investigationsof mental inheritanceand mental testing began at about the same time with Galton,who took the normal law of error from Quetelet and madeit the centerpieceof his research.The error distributionof physics became a descriptionof thedistributionof valuesaboutavalue thatwas"mostprobable." Galton laidthe foundationsof the methodof correlation thatwasrefinedby Karl is examinedin more detail laterin this volume. At the turn Pearson, work that of the century, Charles Spearman (1863-1945) usedthe methodto define mentalabilitiesasfactors. When two apparentlydifferent abilitiesare shownto be correlated, Spearman took this as evidencefor the existenceof a general factor G, a factor of generalintelligence, and factors that were specific to the different abilities. Correlational methods in psychology were dominant for almost the whole of the first half of this centuryand thetechniquesof factor analysiswere honed during this period. Chapter 11 provides a review of its development,but it is worth noting here that Dodd (1928) reviewed the considerable literature that hadaccumulated over the 23years since Spearman's original work, and Wolfle (1940) pushed this labor further. Wolfle quotes Louis Thurstoneon what he takesto be themost importantuse offactor analysis: Factoranalysisis useful especiallyin thosedomains wherebasicandfruitful concepts areessentiallylacking andwhere crucial experiments have beendifficult to conceive. ... They enableus to make onlythe crudest first map of a newdomain. But if we have scientific intuitionand sufficient ingenuity,the rough factorialmap of a new domain will enableus toproceedbeyondthe factorial stageto themore direct forms of psychologicalexperimentationin the laboratory. (Thurstone, 1940, pp. 189-190)
The interesting point about this statementis that it clearly sees factor analysis as amethodof data exploration rather than an experimental method.As Lovie was that of an experimenter using (1983) points out, Spearman's approach correlational techniques to confirm his hypothesis,but from 1940on, that view of the methodsof factor analysishas notprevailed. Of course,the beginningsof general descriptive techniques crept into the psychological literature over the same period. Means and probable errorsare commonly reported, and correlation coefficients are also accompaniedby estimatesof their probable error.And it was around 1940 that psychologists started to become awareof the work of R. A. Fisherand toadopt analysisof variance as thetool of experimental work.It can beargued thatthe progression of events thatled to Fisherian statistics also led to a division in empirical psychology,a split between correlationaland experimental psychology. Cronbach (1957) chose to discussthe "two disciplines"in his APApresidential address. He notes thatin the beginning:
STATISTICS IN PSYCHOLOGY
35
All experimentalprocedureswere tests,all testswere experiments... .the statistical comparison of treatments appearedonly around 1900. . . Inference replaced estimation: the mean and its probable error gave way to the critical ratio. The standardizedconditions and thestandardizedinstruments remained,but the focus shifted to thesinglemanipulatedvariable,andlater, following Fisher,to multivariate manipulation. (Cronbach, 1957, p. 674)
Although there have been signs that the two disciplinescan work together, the basic situationhas notchangedmuch over 30 years. Individual differences are error variance to the experimenter;it is the between-groupsor treatment variance thatis of interest. Differential psychologists look for variations and relationships among variables within treatment conditions. Indeed, variation in the situation here leads to error. It may befairly claimed that these fundamentaldifferencesin approach have had themost profoundeffect on psychology.And it may befurther claimed that the sophisticationand successof the methodsof analysis thatare usedby the two camps have helped to formalizethe divisions. Correlationand ANOVA have led to multiple regression analysis and MANOVA, and yet themethods are based on the same model- the general linear model. Unfortunately, are frequently unawareof the fundamentals, frightened statistical consumers away by the mathematics,or, bored and frustratedby the argumentson the rationale of the probability calculus, they avoid investigation of the general structureof the methods. When these problems have been overcome, the face of psychologymay change.
3
Measurement
IN RESPECT OF MEASUREMENT In the late 19th centurythe eminent scientistWilliam Thomson, LordKelvin (1824-1907),remarked: I often say that whenyou canmeasurewhat you arespeakingabout,and expressit in numbers,you know something about it; but whenyou cannot measure it, when you cannot expressit in numbers, your knowledge is of ameagreandunsatisfactory kind: it may be thebeginningof knowledge,but youhave scarcely,in your thoughts, advancedto the stageof science whatever the matter mightbe. (William Thomson, Lord Kelvin, 1891,p. 80)
This expressionof the paramount importance of measurementis part of our scientific tradition. Many versionsof the same sentiment,for example, thatof Gallon, notedin chapter1, andthat of S. S.Stevens(1906-1973),whose work is discussed laterin this chapter,are frequently noted with approval. Clearly, scientific respectability,a stateof affairs that does scant measurement bestows and James, justice to the work of people like Harvey, Darwin, Pasteur, Freud, who, it will be noted, if they are to belabeled"scientists,"are biological or behavioralscientists.The natureof the dataand thecomplexityof the systems studied by thesemen arequite different in quality from the relatively simple the domainof the physical scientist. This is not, of course,to systems that were deny the difficulty of the conceptualand experimental questions of modern physics,but thefact remains that,in this field, problemscan often be dealt with in controlled isolation. It is perhaps comfortingto observe that,in the early years, therewas a 36
IN RESPECT OF MEASUREMENT
37
skepticism about the introductionof mathematics intothe social sciences. Kendall (1968) quotesa writer in the Saturday Reviewof November11, 1871, who stated: If we saythat G representsthe confidence of Liberals in Mr Gladstoneand D the confidenceof Conservativesin Mr Disraeli and x, y thenumberof thoseparties;and infer that Mr Gladstone'stenure of office dependsupon someequation involving dG/dx, dD/dy, we have merely wrappedup a plain statementin a mysterious collection of letters. (Kendall, 1968,p. 271)
And for GeorgeUdny Yule, the most level-headed of the early statisticians: Measurementdoesnot necessarily mean progress.Failing the possibility of measuring that whichyou desire,the lust for measurement may, for example, merely result in your measuringsomethingelse- andperhapsforgetting thedifference-or in your ignoring somethings becausethey cannotbe measured.(Yule, 1921,pp. 106-107)
To equate science with measurement is a mistake. Scienceis about systematic and controlled observationsand the attempt to verify or falsify those observations. And if the prescriptionof science demanded that observations be must be quantifiable, thenthe natural as well as thesocial sciences would severely retarded.The doubts aboutthe absoluteutility of quantitative description expressedso long ago could well be ponderedon by today's practitioners of experimental psychology. Nevertheless, the fact of the matteris that the early years of the young disciplineof psychology show, with some notable exceptions, a longing for quantification and, thereby, acceptance.In 1885 Joseph on memory, Ueber das GedachtJacobs reviewed Ebbinghaus's famous work it mustbe confessed that psychology nis. He notes"If sciencebe measurement is in a badway" (Jacobs, 1885, p. 454). Jacobs praises Ebbinghaus's painstaking investigations and hiscareful reporting of his measurements: May we hope to see the daywhen schoolregisterswill record that suchand such a lad possesses 36 British Association unitsof memory power or when we shall be able to calculate how long a mind of 17 "macaulays"will take to learn Book ii of Paradise Lost"? If this be visionary, we may atleast hopefor much of interest and practical utility in the comparison of the varying powersof different minds which can now atlast be laid down to scale.(Jacobs,1885, p. 459)
The enthusiasmof the mental measurers of the first half of the 20th century reflects the same dream,and even today,the smile that Jacobs' words might and users contains a little of the bring to the facesof hardened test constructors old yearning.The urgeto quantify our observationsand toimpose sophisticated on them is a very powerful one in thesocial sciences. statistical manipulations
38
3.
MEASUREMENT
It is of critical importance to remember that sloppy and shoddy measurement of figures, clean cannot be forgiven or forgotten by presenting dazzling tables and finely-drawngraphs,or by statistical legerdemain. Yule (1921), reviewing BrownandThomson's bookon mental measurement, commentson the problem in remarks thatare notuntypical of the misgivings seetheir expressed,on occasion,by statisticiansandmathematicians when they methodsin action: Measurement.O dear! Isn'tit almost an insult to the word to term someof these numerical data measurements? They are of thenatureof estimates, mostof them, and outrageouslybad estimatesoften at that. And it should alwaysbe the aim of theexperimenternot to revel in statistical methods (whenhe does reveland notswear)but steadily to diminish, by continual improvement of his experimental methods,the necessityfor their use and the influence they haveon his conclusions.(Yule, 1921,pp. 105-106)
The general tenorof this criticism is still valid, but the combination of experimentaldesignandstatistical method introduced a little later by Sir Ronald Fisherprovidedthe hope, if not the complete reality,of statisticalasopposedto strict experimental control. The modern statisticalapproach more readily recognizes the intrinsic variability in living matter and its associated systems. it becameclearthat all actsof Furthermore, Yule's remarks were made before observation contain irreducible uncertainty (as notedin chap.2). Nearly 40 years after Yule's review, Kendall (1959) gently reminded us of the importanceof precisionin observationandthat statistical procedures cannot replaceit. In an acuteand amusing parody,he tells the story of Hiawatha. It is a tragic tale. Hiawatha,a "mighty hunter,"was anabysmal marksman, although he didhavethe advantageof having majoredin applied statistics. Partly relying on his comrades'ignoranceof thesubject,heattemptedto show thathis patently awful performancein a shooting contestwas notsignificantly different from that of his fellows. Still, theytook away his bow andarrows: In a corner of the forest Dwells alonemy Hiawatha Permanentlycogitating On the normal law of error. Wondering in idle moments Whether an increasedprecision Might perhapsbe rather better Even at therisk of bias If therebyone, now andthen, could Registeruponthe target. (Kendall, 1959,p. 24)
SOME FUNDAMENTALS
39
SOME FUNDAMENTALS Measurementis the applicationof mathematicsto events. We usenumbersto designateobjectsandeventsand therelationships that obtain between them. On occasion the objects are quite realand therelationships immediately comprehensible; dining-room tables,for example, and their dimensions, weights, surfaceareas,and so on. Atother times,we may bedealing with intangibles such as intelligence,or leadership,or self-esteem.In thesecasesour measurements are descriptionsof behavior that,we assume, reflectsthe underlying construct. But the critical concernis the hope that measurement will provide us with preciseandeconomical descriptions of eventsin a manner thatis readily to communicatedto others. Whateverone'sview of mathematics with regard its complexities and difficulty, it is generallyregardedas adiscipline thatis clear, orderly, and rational. The scientist attemptsto add clarity, order, and rationality to the world aboutus by using measurement. Measurementhasbeena fundamental feature of human civilizationfrom its very beginnings. Division of labor, trade,andbarterareaspectsof our condition that separateus from the huntersand gathererswho were our forebears. Trade and commerce mean that accounting practices are institutedand the"worth" of a job or anartifact has to belabeledanddescribed. When groups of individuals agreedthat a sheep couldfetch three decent-sized spears and a couple of cooking-pots,the species made a quantum leap intoa world of measurement. Counting, makinga tally, representsthe simplestform of measurement. Simple though it is, it requires thatwe have devisedan orderly and determinate number system. the developmentof children, must have Developmentof early societies, like included the mastery of signsand symbolsfor differences and sameness and, particularly, for onenessand twoness. Most primitive languages at least have words for "one,""two," and"many,"andmodern languages, including English, have extra wordsfor one and two(single, sole, lone, couple, pair, and soon). Trade,commerce,and taxation encouraged the developmentof more complex number systems that required more symbols. The simple tallyrecordedby a mark on a slatemay bemademore comprehensibleby altering the mark at convenient groupings,for example, at every five units. This system, still followed by some primitive tribes,aswell as bypsychologists when constructing, by hand, frequency tables from large amountsof data, corresponds with a readily availableandportable countingaid - the fingers of one hand. It is likely that the familiar decimal system developed because the human handstogether have10 digits. However, vigesimal systems, based on 20, are known, and languageonceagain recognizes the utility of 20 with the word score in Englishand quatre-vingt for 80 in French. Contraryto generalbelief, the decimal systemis not theeasiestto use arithmetically, and it isunlikely that
40
3.
MEASUREMENT
decimal schemeswill replaceall other counting systems. Eggs and cakeswill continueto besold in dozens rather than tens, and thehours aboutthe clock will still be 12. This is because12 hasmore integral fractional parts than 10; that is, you candivide 12 by more numbersand getwhole numbersand notfractions (which everyonefinds hard)thanyou can 10.Viewed in this way,the now-abanof 12penniesto the shilling and 20shillingsto doned British monetary system the pound doesnot seemso odd orirrational. Systems of number notationand counting havea base or radix. Base5 (quinary), 10 (decimal), 12 (duodecimal),and 20(vigesimal) have been mentioned, but any baseis theoretically possible.For many scientific purposes, binary (base2) is used because this system lies at theheart of the operationsof the electronic computer.Its two symbols,0 and 1, canreadilybe reproducedin the off and on modesof electrical circuitry. Octal (base 8) andhexadecimal (base 16) will also be familiar to computer users.The base of a number system correspondsto the number of symbols thatit needs to express a number, provided thatthe systemis aplace system. A decimal number,say 304, means 3 X 100 plus 0 X 10 plus 4 X 1. The symbol for zero signifies an empty place. The invention of zero, the earliest undoubted occurrence of which is in India over 1,000 yearsago but which was independently usedby the Mayas of Yucatan, marksan important and arithmetical operation.The ancient step forwardin mathematical notation Babylonians,who developeda highly advanced mathematics some4,000years ago, had asystem witha baseof 60 (abasewith many integral fractions) that did not have a zero. Their scriptsdid not distinguish between, say, 125 and 7,205, and which one ismeantoften has to beinferred from the context. The absenceof a zero in Roman numerals may explainwhy Romeis not remembered for its mathematicians,and therelative sophisticationof Greek mathematics to believe that zeromay have been invented in the Greek leads some historians world and thence transmitted to India. Scalesof Measurement Using numbersto count events,to order events,and toexpresstherelationship to be between events,is the essenceof measurement. These activities have carried out accordingto some prescribed rule. S. S.Stevens (1951) in his classic piece on mathematicsandmeasurementdefinesthe latter as"the assignmentof numeralsto objects or events accordingto rules" (p. 1). This definitionhasbeen criticizedon thereasonable grounds that it apparently doesnot exclude rulesthat do nothelp us to beinformative,nor rulesthatensure that the samenumeralsare always assignedto the same events under the same as,"Assign the first conditions. Ellis (1968)haspointedout that some such rule numberthat comes into your head to eachof the objectson thetable in turn,"
SOME FUNDAMENTALS
41
mustbe excluded fromthe definitionof measurement if it is to be determinative and informative. Moreover, Ellis notes that a rule of measurement must allow for different numerals,or rangesof numerals,to beassignedto different things, or to the same things under different conditions. Rules such as "Assign the number3 to everything" are degenerate rules. Measurement must be madeon a scale,and weonly havea scale whenwe havea nondegenerate, informative, determinative rule. For themoment,the historical narrativewill be setasidein order to delineate and comment on the matter. Stevens distinguished four kinds of scales of measurement,and hebelieves thatall practical common scales fall into one or other of his categories.Thesecategoriesare worth examiningandtheir utility in the scientific enterprise considered. The Nominal Scale The nominal scale,as such, doesnot measure quantities.It measures identity and difference. It is often said thatthe first stagein a systematic empirical scienceis the stageof classification. Likeis grouped with like. Events having characteristicsin commonareexamined together.The ancient Greeks classified the constitutionof nature into earth, air, fire, and water. Animal, vegetable, or mineral areconvenient groupings. Mendeleev's (1834-1907)periodictable of the elementsin chemistry,and plantand animal species classification in biology (the Systema Naturae of the great botanist Carlvon Linne, knownasLinnaeus [1707-1778]), and the many typologies that existin psychology are further examples. Numbers can,of course,be usedto label eventsor categoriesof events. Street numbers,or house numbers, or numberson football shirts"belong"to particular events,but there is, for example,no quantitative significance between player number10 andplayer number4, on ahockey team,in arithmetical terms. Player number 10 is not 2.5times player number 4. Such arithmetical rules cannot be applied to the classificatory exercise. However, it is frequentlythe casethat a tally, a count, will follow the constructionof a taxonomy. Clearly, classificationsform a large partof the data of psychology. People may be labeled Conservative, Liberal, Democrat, Republican, Socialist, and so on, on thevariableof "political affiliation," or urban, suburban, rural, on the variable of "location of residence,"and wecould thinkof dozens,if not scores, of others. The Ordinal Scale The essential relationship that characterizes the ordinal scale is greater than (symbolized>) or less than (symbolized Q,p(E \A) = 1, and p{(B + C)\A}=p(B\A)+p(C\A). For readersof a mathematical inclinationwho wish to see more of the developmentof this approach(it doesget alittle more difficult), Kolmogorov's book is conciseand elegant.
6 Distributions
When Grauntand Halley and Queteletmade their inferences, they made them on thebasisof their examinationof'frequency distributions. Tables, charts, and graphs- nomatter how theinformationis displayed- all can beusedto show a listing of data, or classificationsof data, and their associated frequencies. of the freThese are frequency distributions. By extension, such depictions quencyof occurrenceof observationscan beusedto assessthe expectationof particular values,or classesof values, occurringin the future. Real frequency distributionscanthenbe usedasprobability distributions.In general, however, theprobability distributions thatarefamiliar to theusersof statistical techniques are theoretical distributions, abstractions basedon a mathematical rule, that of eventsin the real world. When bodies match, or approximate, distributions of data are described,it is the graph and thechart that are used. But the theoretical distributions of statisticsandprobability theoryaredescribedby the mathematical rules or functions that define the relationships between data,both real and hypothetical,andtheir expectedfrequenciesor probabilities. Over the last 300 yearsor so, thecharacteristicsof a great many theoretical in one distributions,all of which have beenfound to have some practical utility situationor another, have been examined. The following discussionis limited to threedistributions thatare familiar to usersof basicstatisticsin psychology. An accountof somefundamentalsampling distributionsis given later. THE BINOMIAL DISTRIBUTION In the years 1665-1666,when Isaac Newtonwas 23 and had just earnedhis degree, his Cambridge college (Trinity)was closed becauseof the plague. Newton went hometo Woolsthorpein Lincolnshireand began,in peaceand leisure,a scientificrevolution.These weretheyearsin which Newton developed
68
THE BINOMIAL DISTRIBUTION
69
someof the most fundamentaland importantof his ideas: universal gravitation, the compositionof light, the theory of fluxions (the calculus),and thebinomial theorem. The binomial coefficientsfor integral powershad been knownfor many centuries, but fractional powers werenot considereduntil the work of John Wallis (1616-1703),Savilian Professorof Geometry at Oxford, and themost influential of Newton's immediate English predecessors. However, expansions of expressions such as (x -x2 )1/2 were achievedby Newton earlyin 1665. He announcedhis discoveryof the binomial theoremin 1 676 inletterswritten to the Secretaryof the Royal Society, although he never formally publishedit nor did he provide a proof. Newton proceededfrom earlier work of Wallis, who publishedthe theorem, with creditto Newton, in 1685. The problem was to find the areaunder the curve with ordinates (x - x2 )" Whenn is zerothe first two terms arex - 1 (x3 \ andwhennis1 they arex - 1 (x3 \ Newton, usingthe methodof interpolation employedsomuchby Wallis, reasoned that when n was l
\* /2 the corresponding terms should be, ;c - -r— . He arrived at theseries
and then discovered that the same result couldbe obtained by deriving, and subsequently integrating
2
l/1
the binomial expansionof (1 - jc) . Theinterestingandimportant pointto be noted is that Newton's discoverywas notmade by consideringthe binomial coefficients of Pascal'strianglebut by examiningthe analysisof infinite series, a discovery of much greatergeneralityand mathematical significance. Figure 6.1 showsthe binomial distributionfor n = l. Newton'sdiscoveryof the calculus in his "golden years"at Woolsthorpe establishes him as its originator, but it was Gottfried Wilhelm Leibniz (1646-1716),the German philosopher and mathematician,who haspriority of publication, and it is pretty well established that the discoveries were indethe claims to priority of pendent. However,a bitter quarrel developed over discovery and allegations were made that, on avisit to London in 1673, Leibniz could have seenthe manuscriptof Newton's De Analysi Aequationes Numero Terminorum Infmitas, which, though writtenin 1669, was notpublisheduntil 1711. Abraham De Moivre was among those appointed by the Royal Society in 1712to report on thedispute. De Moivre made extensive use of themethod in his own work, and it was hisApproximatio, first printed and circulated to somefriends in 1733, that links the binomial to what we nowcall the normal
70
6. DISTRIBUTIONS
FIG. 6.1
The Binomial Distribution for N = 7
distribution. The Approximatio is included in the second (1738) and third (1756) editionsof the Doctrine. It should be mentioned thata Scottish mathematician, James Gregory (1638-1675),working at the time (1664-1668)in Italy, derivedthe binomial on themathematicsof infinite series, expansionand produced important work discoveredquite independentlyof Newton. THE POISSON DISTRIBUTION Before the structure of the normal distribution is examined, the work of Simeon-Denis Poisson (1781-1840)on a useful special caseof the binomial will be described.The Ecole Polytechnique was foundedin Parisin 1794. It was the model for many later technical schools, and its methods inspiredthe productionof many student texts in mathematicsandengineering whichare the forerunnersof present-day textbooks. Among the brilliant mathematiciansof the Ecole duringthe earlier yearsof the 19th centurywas Poisson.His nameis a familiar label in equationsandconstantsin calculus, mechanics, andelectricity. He waspassionately devoted to mathematics and toteaching,andpublished over 400 works. AmongthesewasRecherchesur laProbabilite desJugements in 1837. This contains thePoisson Distribution, sometimes called Poisson's law n of large numbers.It wasnoted earlier thatas n in (P + Q) increases,the binomial distribution tendsto thenormal distribution. Poisson considered the case where
THE POISSON DISTRIBUTION
71
as nincreases towardinfinity, P decreases towardzero,and nPremainsconstant. The resulting distributionhas aremarkable application. on relatively rare accidents,say, Data collectedby insurance companies people trapping their fingers in bathroom doors, indicates that the probability of this event happeningto any oneindividual is very low, in fact near zero. However, a certain numberof such accidents(X) is reported everyyear,and the number of these accidents varies from year to year. Overa number of years a statistical regularityis apparent,a regularity thatcan bedescribedby Poisson's distribution. If we set X at k, aninteger, then
where A, is any positive number,e is theconstant2.7183...,and k! isfactorial k. Although the distribution is not commonlyto be found in the basic statistics testsin psychology,it is usedin the social sciencesand it does havea surprising rangeof applications. It hasbeen usedto fit distributionsin, for example, quality of patients suffering control (defectsper numberof units produced), numbers from certain specificdiseases,earthquakes, wrong-number telephone connections, the daily numberof hits by flying bombs in London during WorldWar II, misprintsin books, and many others.
FIG. 6.2
The Poisson Distribution applied to Alpha Emissions (Rutherford & Geiger, 1910)
72
6. DISTRIBUTIONS
Poisson attempted to extendthe possibleutility of probability theory, for he applied it to testimonyand tolegal decisions. These applications received much criticism but Poisson greatly valued them. Poisson formally discussedthe conceptsof a random quantityandcumulative distribution functions,and these are significant theoretical contributions. But his nameandwork in probability in the literature, perhaps because he wasovershaddoesnot occupy much space as Laplaceand Gauss. Sheynin (1978) owed by famous contemporaries such hasgiven us acomprehensive review of his work in the area. An exampleof a Poisson distributionis given in Figure 6.2. THE NORMAL DISTRIBUTION The binomialandPoisson distributions stand apart from the normaldistribution because theyare applied to discrete frequency data. The invention of the calculus provided mathematics with a tool that allowedfor the assessmentof probabilities in continuous distributions.The first demonstrationof integral approximation,to thelimiting caseof the binomial expansion,wasgivenby De Moivre. In the Approximatio, De Moivre beginsby acknowledgingthe work of James Bernoulli: Altho" the Solution of Problemsof Chanceoften requires that several Termsof the Binomial (a + b)n [this is modern notation]be added together, nevertheless in very high Powersthe thing appearsso laborious,and of sogreatdifficulty, that few people have undertaken that Task; for besides Jamesand Nicholas Bernoulli, two great Mathematicians,I know of no body thathas attemptedit; in which, tho' they have shown very great skill,and havethe praise whichis due totheir Industry,yet some thingswerefarther required;for what they havedoneis not somuch an Approximation as thedetermining very wide limits, within which they demonstrated that the Sum of Termswas contained. (De Moivre, 1756/1967,3rd Ed., p. 243)
De Moivre proceedsto showhow hearrivedat theexpressionof the ratio of the middle termto the sum of all thetermsin the expansionof (1 +1)" whenn is a very high power. His answerwas2/BVn, where"B representsthe Number of which theHyperbolic Logarithmis 1 - 1/12+ 1/360-1/1260+ 1/1080, &c." He acknowledgesthe help of James Stirlingwho hadfound that "B did denote the Square-rootof the Circumferenceof a Circle whose Radius is Unity, sothat if that Circumferencebe called c, the Ratio of the middle Termto the Sum of all the Terms will be expressedby 2/V(«c)" (De Moivre, 1756,3rd Ed., p. 244). De Moivre had thus obtained(in modern notation)the expression , for large n, whereY 0 is themiddle term.
THE NORMAL DISTRIBUTION
73
He also givesthe logarithmof theratio of the middle termto anyterm distant from it by aninterval /as (w + / - l^)log \m + l- 1 j + (m - I + V2)log \m - I + 1} - 2wlogw + log ((m +l)/m , wherem = '/2« andconcludes,in the first of nine corollaries numbered 1 through6 and 8through 10,7 having been omitted from the numbering, that,"if m or ½n be aQuantity infinitely great, thenthe Logarithm of the Ratio, whicha Term distantfrom the these middleby the Interval 1, has to themiddle Term, is -2///n" (p. 245). This is merely the _ 2/n expression that,Y/ = Y 0 e~21 , for large n. The second corollary obtains the "Sum of the Terms intercepted between the Middle, andthat whole distance from i t . . . denotedby /", in modern terms,the sum of Yo+ YI + Y 2 + Y 3 + . .. + Y/, as
which is the expansionof the integral
When / is expressedas S^fn , and Sinterpretedas !/2, the sum becomes
... which convergesso fast, thatby help of no more than sevenor eight Terms,the Sum required may becarriedto six or seven placesof Decimals: Now that Sumwill be found to be 0.427812,independently fromthe common Multiplicator2/Vt, and therefore to the Tabular Logarithm of 0.427182,which is 9.6312529,adding the Logarithm of 2At, viz. 9.9019400,the sumwill be 19.5331929,to which answersthe number 0.341344. (De Moivre, 1756, 3rd Ed., p. 245)
This familiar final figure is thearea underthe curveof the normal distribution betweenthe mean (whichis, of course, alsothe middle value)and anordinate one standard deviation from the mean. In the third corollaryDe Moivre says: And therefore, if it was possible to take an infinite number of Experiments, the Probability that an Event whichhas anequal numberof Chancesto happenor fail, shall neitherappearmore frequently than'/2 n+] /2"Jn times, nor more rarely than '/2 n - !/2 Vntimes,will be expressedby thedoubleSum of thenumberexhibitedin the secondCorollary, that is, by 0.682688,and consequentlythe Probability of
74
6. DISTRIBUTIONS thecontrary...will be0.317312,thosetwo Probabilities together compleating Unity, which is the measureof Certainty. (De Moivre, 1756,3rd Ed., p. 246)1
½Vw is what todaywe call the standard deviation.De Moivre did not name it but he did, in Corollary 6, saythat Vw "will be as itwere the Modulus by which we are toregulateour Estimation"(De Moivre, 1756,3rd Ed., p. 248). In fact what De Moivre doesis to expandthe exponentialand to integrate from 0 to SG. In Corollary 6 De Moivre notes thatif / is interpreted as Vw rather than V2 \H , then the series doesnot convergeso fast and that moreand more terms would be required for a reasonableapproximation as / becomes a greater proportion of V«, ... for which reason1makeuse inthis Caseof the Artifice of Mechanic Quadratures, first invented by Sir Isaac Newton...;it consistsin determiningthe Area of a Curve nearly, from knowinga certain numberof its OrdinatesA, B, C, D, E, F, &c.placed at equalIntervals,(De Moivre, 1756, 3rd Ed., p. 247)
He usesjust 4 ordinatesfor his quadratureand finds, in effect, that the area between±2o or '/2/7 ± Vw is 0.95428,and thatthe areain whatwe now call the tails is 0.04572. The true valueis a little less than thisbut it is, nevertheless, familiar. Theseresults can beextendedto the expansionof (a + b]n and where a and b are notequal. If the Probabilitiesof happeningand failing be in anygiven Ratioof inequality,the Problemsrelatingto the Sum of theTermsof the Binomial (a + b)n will be solved with the same facilityas those in which the Probabilitiesof happeningand failing are in aRatio of Equality. (De Moivre, 1756/1967,3rd Ed., p. 250)
In Corollary 9, De Moivre in effect, and in modern terms, introduces of the normal ^(npq), the expressionwe usetoday for the standard deviation approximationto thebinomial distribution. The sum andsubstance oftheApproximatiois that it gives,for the first time, the function that was rediscovered much later, the function that dominates inference - thenormal distribution - which in so-called classical statistical modern terminologyis given by the density function
1
The value for the proportionof areabetween±la is 0.6826894,so that De Moivre was out by one unit in the sixth decimalplace.
THE NORMAL DISTRIBUTION
75
The normal distributionis shownin Figure 6.3. De Moivre's philosophical positionis revealed in the sections headed "Remark I" in the 1738 editionof the Doctrine and anadditional,and much longer, "RemarkII" in 1756. De Moivre sets his work in the philosophical contextof an ordered determinate universe. His notionof Original Design (see the quotationin chapter1) is anotion that persisted at least downto Quetelet. A powerful deity revealsthe grand design through statistical averages andstable statisticalratios. Chance produces irregularities. As Pearson remarked: There is much valuein the idea of the ultimate laws being statistical laws, though why the fluctuations shouldbe attributedto a Lucretian 'Chance',I cannot say. It is not anexactly dignified conceptionof the Deity to supposehim occupied solely with first momentsand neglecting secondand higher moments!(Pearson,1978, p. 160)
and elsewhere: The causeswhich led De Moivre to his "Approximatio" or Bayes to his theorem and until onerecognizes were more theologicaland sociological than mathematical, that the post-Newtonian English mathematicians were more influenced by Newton's theology thanby hismathematics,thehistory of sciencein theeighteenth centuryin particular thatof the scientistswho were membersof the Royal Society- must remain obscure. (Pearson, 1926,p. 552)
FIG. 6.3
The Normal Distribution
76
6. DISTRIBUTIONS
It is interesting that thisis preciselythe sort of analysis thathasbeen brought to bear on thework of Galton and Karl Pearson himself, save that it is the philosophy of eugenics thatinfluenced their work, rather than Christian theology. In Remark II, De Moivre takesup Arbuthnot's argumentfor the ratio of male to female births, whichwas discussedin Chapter4, defendingthe arguby Nicholas Bernoulli,who ment againstthe criticisms thathad been advanced had noted thata chance distributionof the actual male/female birth ratio would be found if the hypothesized ratio (i.e., the ratio underwhat we would now call the null hypothesis)had been takento be 18:17 rather than 1:1. But De Moivre insists: This Ratio oncediscovered, and manifestly servingto a wise purpose,we conclude the Ratio itself, or if you will the Form of the Die, to be anEffect of Intelligence and Design. As if we were shewna number of Dice, each with18 white and 17black faces, whichis Mr. Bernoulli's supposition, we shouldnot doubt but that thoseDice had beenmadeby someArtist; and that their form was notowing to Chance,but was adaptedto theparticularpurposehe had inView. (De Moivre, 1756/1967,3rd Ed., p. 253)
With the greatestrespect to De Moivre, this was clearly not Arbuthnot's argument, and De Moivre's view that he might have saidit is somewhat De Moivre's specious. LikeQuetelet'suse of thenormal curve many years later, view is a prejudgmentand all findings must be made to fit it. Karl Pearson (1978)makes essentiallythe same point,but it is apparentlya point thathe did not recognize in his own work.
7 Practical Inference
Someof the philosophical questions surrounding induction and inference were are considdealt within chapter2. Herethe foundationsof practical inference ered. INVERSE PROBABILITY AND THE FOUNDATIONS OF INFERENCE The first exercisesin statistical inference arose from a considerationof statistical asthosefound in mortalitytables.The developmentof theories summaries such of inferencefrom the standpointof the implicationsof mathematical theory can be dated from the work of Thomas Bayes(1702-1761),an English Nonconformist clergyman whose ministry was inTunbridge Wells. Bayes was recognized as avery good mathematician, although he published very little,and he was electedto theRoyal Societyin 1742 (see Barnard, 1958, for a biographical note). Curiously enough,the paperfor which he isrememberedwas commu1 nicatedto theRoyal Societyby hisfriend Richard Price more than2 yearsafter his death,and thecelebrated forms of the theorem that bear his name, although they follow from the essay,do not actually appearin the work. An Essay Towards Solving a Problem in the Doctrine of Chances(Bayes,1763) is still the subject of much discussionand controversy bothas to itscontentsand implications and as to howmuch of its import was contributedby its editor, Richard Price. The problem that Bayes addressed is statedby Pricein the letter 1 Price was also a Unitarian Church minister. His church stillstandsin Newington Greenin north London and is theoldest Nonconformist church building still being so usedin London. Next doorand abutting the church is a licensed betting shop.It has been noted that Bayesian analysis, resting, as it does,on thenotion of conditional probability, is akin to gambling.
77
78
7. PRACTICAL INFERENCE
accompanyinghis submissionof the essay: Mr De Moivre . . . has . . after . Bernoulli,and to agreater degree of exactness,given rules to find the probability thereis, that if a very great numberof trials be made concerning any event, the proportion of the numberof times it will happen,to the numberof times it will fail in those trials,should differ less thanby small assigned limits from the proportion of the probability of its happeningto theprobability ofits failing in one single trial. But I know of no personwho hasshewnhow to deduce the solution of the converse problemto this; namely, "the number of times an unknown event has happenedand failed being given,to find the chance thatthe probability ofits happening should lie somewhere between any twonamed degrees of probability." (Bayes, 1763,pp. 372-373)
A demonstrationof some simple rules of mathematical probability, using a frequency model, will help to derive and illustrate Bayes' Theorem. The probability of drawinga redcard from a standard deckof playing cardsis 26/52 or/?(R) = 1/2. The probability of drawing a picture cardfrom the deck is 12/52 orp(P) = 3/13. The probability of drawing a redpicture cardis 6/52 or/?(R & P) = 3/26. Whatis theprobability of drawing eithera redcard or a picture card? The answeris, of course, 32/52 orp(Ror P) =8/13. Note that: p(R or P)=p(R) +p(?) -p(R & P), [8/13 = (1/2 + 3/13) - 3/26]. This is known as theaddition rule. Suppose thatyou draw a card from the deck, but you are notallowed to see it. What is the probability that it is a redpicture card? We cancalculatethe answerto be6/52. Now suppose that you aretold thatit is a redcard. Whatnow is the probability of it being a picture card? The probability of drawing a red card is 26/52. We also can figure out that if the card drawn is a redcard then the probability of it also beinga picture cardis 6/26. In fact, p(R & P) = /7(R)[p(P|R)]» [6/52 = 26/52(6/26)] The term /?(P|R) symbolizesthe conditional probability of P, that is, the probability of P given that R has occurred and the expression denotesthe multiplication rule. Notethat/?(R& P) is thesameas/?(P& R), andthat this is equal to p(P)\p(R\P)], or 6/52 = 12/52(6/12). From this fact we seethat:
which is the simplest formof Bayes' Theorem.In current terminology,the left-
INVERSE PROBABILITY AND THE FOUNDATIONS OF INFERENCE
79
hand sideof this equationis termed theposterior probability, the first term on the right-hand sidethe prior probability, and theratio p(P\R)/p(R.), the likelihood ratio. From the addition rulewe canalso show thatthe probability of a redcard is the sum of theprobabilitiesof a redpicture cardand a rednumber card minus the probability of a picture number card (which doesnot exist!), or:
p(R) =p(R & P) + p(R & N) -p(? & N). But p(P & N) - 0 becausethe events picture card and number card are mutually exclusive. So Bayes' Theoremmay bewritten:
The fact that this formulais arithmeticallycorrect may be checked by substitutingthe valueswe canobtain from the known distributionof cards in the standard deck. However, this does not demonstratethe allegedutility of the theoremfor statistical inference.For that we must turnto another example after substitutingD (for Data) in place of R, and H,(for Hypothesis1) in place of P, and H2 (for Hypothesis2) in place of N, where H, and H2 are twomutually exclusiveand exhaustive hypotheses. We have:
Bayes' Theorem apparently provides a meansof assessingthe probability of a hypothesis giventhe data (or outcome), whichis the inverseof assessing the probability of an outcome givena hypothesis(or rule), and of coursewe may envisage more than two hypotheses.In the original essay, Bayes demonstrated his construct usingas anexamplethe probability of balls comingto rest on one or other of the parts of a plane table. Laplace, who in 1774 arrivedat essentiallythe same resultas Bayes,but provideda more generalized analysis, used the problem of the probability of drawing a white ball from an urn containingunknown proportionsof black and white balls given thata sampling of a particular ratiohasbeen drawn. Traditionally, writers on Bayes make heavy use of "urn problems,"and tradition is followed here witha simple example. Phillips (1973) givesa comprehensive account of Bayesian procedures and showshow they can beappliedto data appraisal in the social sciences. Suppose we arepresented withtwo identical urnsW and B. Wcontains70 white and 30
80
7. PRACTICAL INFERENCE
black balls,and Bcontains40 white and 60black balls. Fromone of theurns we areallowed to draw 10 balls, and wefind that 6 of them are white and 4 of them are black. Is the urnthat we have chosen more likely to be W ormore likely to be B? Presumably most of us wouldopt for W. Whatis the probability that it was W? /?(Hi|D) is theprobability of W giventhe data,and/?(H2|D)is the probability of B given the data.Now theprobability of drawing a white ball from W is 7/10 and from B it is 4/10. The probability of the data givenH, [p(D|Hi)] is (0.7)6(0.3)4, and p(D|H2) is (0.4)6(0.6)4. Now we apply Bayes' Theorem:
In order to completethe calculation,we haveto have valuesfor p(H,) and /?(H2) - the prior probabilities. And here is thecontroversy. Theobjectivist view is that theseprobabilitiesare unknownand unknowable because the state of the urnthat we have chosenis fixed. There are nogroundsfor saying that we have chosenW or B. The personalist wouldsay that we canalways state urn was some degreeof belief. If we are completely ignorant about which chosen,then bothp(H,) andp(H2) can beexpressedas 0.5. This is a formal statementof Laplace's Principleof Insufficient Reason,or Bayes' Postulate.If we put thesevalues in our equation,p(Hi|D) works out to be 0.64, which matches common sense. This value could then be usedto revisep(H])and/?(H2) and further datacollected.For itsproponents,the strengthof Bayesianmethods lies in the claim that they provideformal proceduresfor revising many kindsof opinion in the light of new data in a direct andstraightforward way, quite unlike the inferential proceduresof Fisherian statistics. The most important point that has to beaccepted,however, is the justification for the assignmentof prior probabilities. Althoughin this simple example there might seem to be little difficulty, in more complex situations, where probabilities, based on relative frequency2, hunches,or onopinion, cannotbe assigned precisely or readily, the picture is far less clear. Furthermore, the principleof the equal distribution of ignorancehasitself come undera great dealof philosophicalattack. "It is rather generally believed thathe [Bayes] did not publish becausehe distrustedhis postulate,and thoughthis scholium defective.If so he wascorrect" (Hacking, 1965, p. 201). 2
Presumablyit could be argued thatour urn with its unknown ratioof black andwhite balls is one of aninfinite distributionof urns with differing make-ups.
FISHERIAN INFERENCE FISHERIAN
81
INFERENCE
When the justification for the probabilistic basisof inferencein the senseof revising opinionon thebasis of data was thoughtof at all, it was theBayesian approach that held sway until this century. Its foundations had come under in any practical senseof the attack primarilyon the grounds that probability word must be basedon relative frequencyof observationsand not ondegrees of belief. Venn (1888)was perhapsthe most insistent spokesman for this view. Sir Ronald Fisher(1890-1962),certainlythe most influential statisticianof all time, set out toreplaceit. In 1930he publisheda paper that supposedly set out his notion of fiducial probability, claiming, "Inverse probability has, I believe, survived so long in spiteof its unsatisfactorybasis, because its critics have until recent timesput forward nothingto replaceit as arational theoryof learningby experience" (Fisher, 1930, p. 531). There is no doubt thatFisher'smethods,and thecontributionsof Neyman and Pearson that have been grafted on to them, have provided us with a set of inferential procedures. There seems to beconsiderable doubt as towhetherhe provided us with a coherent non-Bayesian theory. Fisher himself asserted that the conceptof the likelihood functionwasfundamentalto his newapproachand distinguishedit from Bayesian probability. Kendall (1963) in his obituary of Fisher says this: It appearsto me that, at this point [1922],his ideas werenot very well thought out. Certainly his exposition of them was obscure. But, in retrospect,it becomes plain that he wasthinking of a probability function f(x,Q) in two different ways: as the probability distributionof* for given0, and asgiving some sortof permissible range of 0 for anobservedx. To this latterhe gave the thenameof 'fiducial probability distribution' (later to be known as afiducial distribution)and in doing so began a long train of confusion;for it is not aprobability distribution to anyonewho rejects Bayes's approach, and indeed, may not be adistribution of anything. Fisher neverthelessmanipulatedit as if it were, and thereafter maintainedan attitude of rather contemptuous surprise towards anyone who wasperverse enoughto fail in understandinghis argument.... The position on both sideshasbeen restatedad nauseam, without much attempt at reconciliationor, as Ithink, withoutan explicit recognitionof the real point, which is that a man's attitude towards inference, like his attitude towards religion,is determined by his emotional make-up,not by reasonor mathematics. (Kendall, 1963, p. 4) Richard von Mises (1957)is baffled also: Fisher introducesthe term [likelihood] in order to denote something different from probability. As hefails to give a definition for either word,i.e.,he does not indicate how the value of either is to bedeterminedin a given case,we canonly try to derive
82
7. PRACTICAL INFERENCE the intended meaningby consideringthe context in which he usesthesewords. I do not understandthe many beautifulwordsusedby Fisherand hisfollowers in supportof the likelihood theory. The main argument, namely, thatp[the probability of the hypothesis]is not a variable but an "unknown constant,"does not mean anything to me.(von Mises 1957, pp. 157-158)
It is tempting to leaveit at that, but some attemptat capturing the flavor of Fisher's position must be made. Clearly,if one hasknowledgeof adistribution of probabilities of events, then that knowledge can beused to establish the probability of an event that has not yetbeen observed,for example, the probability thatthe next roll of two dice will producea 7 (p = .167). Whatof the situation whereaneventhasbeen observed - theroll didproducea 7 - can we sayanything abouttheplausibilityof anevent withp =. 167havingoccurred? This is a decidedlyodd sort of question because the eventhasindeed occurred! Before the draw from a lottery with 1,000,000 ticketsis made,the probability of my winning is .000001.It will be difficult to convince me after the draw, as I clutch the winning ticket, that whathashappenedis impossible,or even very, very unlikely to have occurredby chance,and if you continueto insist I shall merely keepon showingyou my ticket. Fisher, in 1930, putsthe situationin this way: There are two different measuresof rational belief appropriateto different cases. Knowing the populationwe canexpressour incomplete knowledgeof, or expectation of, the sample in terms of probability; knowing the samplewe can expressour incomplete knowledgeof the populationin terms of likelihood. (Fisher, 1930,p. 532)
Likelihood thenis a numerical measureof rational beliefdifferent from probability. Whetheror not thelogic of the situationis understood,all usersof statistical methodswill recognize the reasoningas crucial to both estimation and hypothesis testing.The method of maximum likelihood that Fisher propounds (althoughit had been put forward by Daniel Bernoulliin 1777; see Kendall, 1961) justifiesthe choice of population parameter.The method simply says thatthe best estimateof the parameteris the value that maximizesthe probability of the observations,or that whathasoccurredis themost likely thing that could haveoccurred.A numberof writers, for example, Hogben(1957), have stated that this assumption cannot bejustified without an appealto inverse probability, so that Fisherdid not succeedin detaching inference from Bayes' Theorem.Nevertheless,the basicnotion is that we canfind the likelihood of a particular population parameter,say thecorrelationcoefficient R, by defining it as avalue thatis proportionalto the probability that from a population with that valuewe have obtaineda sample withan observed valuer.
BAYES OR p < .05?
83
There is no doubt that Fisher's argument will continueto be controversial and that many attemptsto resolve the ambiguitieswill be made. Dempster (1964) is among thosewho have enteredthe fray, and Hacking's (1965) rationalehasresultedin one well-known statistician (Bartlett, 1966) proposing that the resulting theorybe renamed "the Fisher-Hacking theory of fiducial inference."
BAYES OR p < .05? Criticism of Fisherian methods arrived almost as soon as they beganto be adopted.In more recent years, criticism of null hypothesis significance testing (NHST) appearsto have becomean 'area' in its own right. Berkson (1938) probably led theway, whenhe noted that: if the number of observationsis extremely large- for instanceon theorder of 200,000- thechi-squareP will besmall beyondanyusual levelof significance ... For we mayassume thatit is practically certain thatany seriesof real observations doesnot follow a normal curve with absolute exactitude in all respects. (p. 526)
Berkson was expressingin essence what many, many other writers in the following 60 years have observed: that when HQ is expressedas anexact null hypothesis (zero difference or no relationship) then very small deviations from this (dare one sayit?) pedantic viewwill be declared significant] Someof the more recently-expressed doubts have been brought together by Henkel and Morrison (1970),but thepapers they collected together were most certainly not the last word. Indeed, Hunter (1997) called for a ban onNHST, and reportedthat no lessa body thanthe American Psychological Association has a"committee looking into whether the use of thesignificancetest should be discouraged"(p. 6)! The main criticisms, endlessly repeated, are easily listed. NHST does not offer any way oftestingthe alternativeor research hypothesis; the null hypothesis is usually falseand when differencesor relationships are trivial, large samples will lead to its rejection; the method discourages replication and encourages one-shot research; the inferential model dependson assumptions about hypothetical populations and data that cannot be verified; and there are more. Someof the criticismsarevalid andneedto befaced more carefully, even more boldly than they are, while some seem to be'strawmen' set up to beblown down. For example,the fact that we only reject HQ and do nottest H^ is not particularly satisfying,but the notion that that inference is faulty becausein some way it rests on characteristicsof populationsnot observedor not yet observedis merely statinga fact of inference! Moreover, mostof the criticism
84
7. PRACTICAL INFERENCE
would be diluted, if not eliminated,if greater attention to parametersof a test other thanalpha level were considered, that is to say, effect size, sample size, and power. Even Fisher's most supportive and ardent colleague, Frank Yates (1964) said: The mostcommonly occurringweaknessin the application of Fisherianmethodsis, I think, undue emphasis on testsof significanceandfailureto recognize thatin many typesof experimentalwork estimatesof thetreatmenteffects, togetherwith estimates of the errorsto which theyaresubjectare thequantitiesof primary interest.To some extent Fisher himselfis to blame for this. Thusin The Design of Experimentshe wrote: "Every experimentmay besaid to exist onlyin order to give the factsa chance of disproving the null hypothesis."(p. 320)
Yates goeson to saythat the null hypothesis,as usually expressed,is "certainly untrue"andthat: suchexperiments[variety andfertilizer trials] are infact undertaken with the different purposeof assessing the magnitudeof the effects... [Fisher]did n o t . .. sufficiently emphasisethe point in his writings. Scientistswere thusencouragedto expectfinal and definitive answers...some of them, indeed, came to regardthe achievementof a significant resultas an end initself, (p. 320)
Although a numberof modificationsof, or alternativesto, NHST have been suggested (see, for example,confidenceintervals, discussed in chapter13, p.. 199) by far themost popularof the suggested replacements (rather than salvage operationsfor the Fisherian model)is Bayesian analysis. The claim is usually made that Bayesian methods areconcernedwith the alternative hypothesis, that they encourage replication, that, in fact, they reflect more clearly the traditional it is also often claimed that despitethe 'scientific method.' Curiously enough, so-called subjectivityof Bayesian priors,a Bayesian analysis will arrive at the same conclusion as a'classical' analysis. This leaves thejourneyman psychological researcher with nothing but thetired protest that nothing hasbeenoffered as arewardfor changingfrom the familiar routines!
8 Sampling and Estimation
RANDOMNESS AND RANDOM NUMBERS In common parlance,to choose"at random" is to choose without bias, to make the actof choosingonewithout purpose even though the eventual outcomemay be usedfor a decision.The emphasis here is on the actrather thanthe outcome. It is certainly not true to saythat ordinaryfolk accept that random choice means the polls, have been absenceof design in the outcome. Politicians, examining known to remark thatthe result was "in thecards."Primitive notions of fate, and modern appealsto the "will of God" often lie demons, guardian angels, behind the drawing of lots and thetossingof coins to "decide" if Mary loves It is absenceof design John,or to take one courseof action rather than another. in the mannerof choosing thatis important.The point mustnot belabored,but everyday notionsof chanceare still construed, evenby the most sophisticated, in ways thatare not too farremovedfrom the random elementin divination and sortilege practicedby theoraclesandpriestswho were consultedby our remote, andnot-so-remote, ancestors. And, asalready noted, perhaps one of thereasons for the delay in the developmentof the probability calculusarose from a reluctanceto attemptto "second-guess" the gods. The concepts of randomnessand probability are, therefore, inextricably intertwined in statistics.The difficulties inherentin defining probability, which are once more presented in an examinationof randomwere discussed earlier, ness. It is commonly thought that everyone knows what randomness is. The great statistician Jerzy Neyman (1894-1981)states, "The method of random sampling consists,as it is known, in taking at random elementsfrom the population whichit is intendedto study. The elements compose a sample which is then studied" (Neyman, 1934, p. 567). It is unlikely that thisdefinition would begiven a high gradeby most teachers of statistics. Laterin this paper and,this inadequatedefinition notwithstanding, 85
86
8. SAMPLING AND ESTIMATION
it is apaperof centralandenormous importance, Neyman does note that random samplingmeans that each element in the population must have an equal chance of being chosen,but it is not uncommonfor writers on the methodto ignore both this simple directiveandinstructionsas to themeansof its implementation. Of course,the use of thewords "equalchance"in the definition bringsus back to what we mean by chanceandprobability. For most purposeswe fall back on the notion of long-run relative frequency, rather than leaving these constructs as undefined ideas that make intuitive sense. Practical tests of randomness rest on the examinationof the distributionof eventsin long series.It is alsothe case, as Kendall and Babington Smith (1938) point out, that random sampling proceduresmay follow a purposive process.For example,the numberTI is not a random number,but its calculation generates a seriesof digits that may be random. These authors are amongthe earliestto set out thebasesof random sampling from a straightforward practical viewpoint, and they were writinga mere 60 years ago.The conceptof a formal random sampleis a modern one; concerns aboutits placeand importancein scientific inferenceandsignificance testing paralleledthe developmentof methods of experimental designand technical approaches to theappraisalof data. Nevertheless, informal notionsof randomness that imply lack of partiality and"choiceby chance"go back many centuries. Stigler (1977) researched the procedure knownas "the trial of the Pyx," a samplinginspection procedure that has beenin existenceat the Royal Mint in London for almost eight centuries. Over a period of time, a coin wouldbe taken daily from the batch thathad been mintedand placedin a box called the Pyx. At intervals, sometimes separated by as much as 3 or 4years, the Pyx was openedand thecontents checked andassayedin order to ensure thatthe coinage met thespecifications laid downby the Crown. Stigler quotes Oresme on the procedurefollowed in 1280: When the Master of the Mint has brought the pence, coined, blanched and made ready,to the placeof trial, e.g. the Mint, he must put them all at once on the counter which is covered withcanvas. Then, whenthe pence have been well turned over and thoroughly mixedby the handsof the Master of the Mint and theChanger,let the Changertakea handful in themiddle of theheap,moving roundnine or tentimes in one direction or the other, until he hastaken six pounds. He must then distribute thesetwo or three times into four heaps, so that theyarewell mixed. (Stigler, 1977, p. 495)
The Masterof the Mint was alloweda marginof error calledthe remedyand had to make goodany deficit that was discovered. Although mathematical statistics playedno part in these tests,and if they had they would have been
RANDOMNESSAND RANDOM NUMBERS
87
1 more precise, the procedure itself mirrors modern practice.
The trial of the Pyxeven in the Middle Ages consistedof a sample being drawn, a null hypothesis (the standard) to betested,a two-sided alternative,and atest statistic and a critical region (the total weightof the coins and theremedy). The problem even carried with itselfa loss function which was easily interpretablein economic terms. (Stigler, 1977,p.499)
Random selectionsmay bemadefrom real, existent universes, for example, the population of Ontario, or from hypothetical universes,for example, individuals who,over a period of time, have takena particular drugas part of a clinical trial. In the latter case,to talk of "selection"in any real senseof the word is stretching credulity,but we do use the results basedon theindividuals actually examinedto make inferences about thepotential population thatmay begiven the drug. In the same way, the samples of convenience thatare used in psychological research are hardly ever selected randomly in the formal sense. Undergraduate student volunteers are notlabeledas automatically constituting random samples,but they are often assumedto be unbiased withrespectto the dependent variables of interest,anassumption that hasproduced much criticism. This latter statement emphasizes the fact that a sampling method, whatever it is, relatesto the universe under study and theparticular dependent variable or variablesof interest. A questionnaire asking about attitudes to healthandfitness given to membersof the audienceat, say,a symphony concertmay well be generalizableto thepopulationof the city, but thesame instrument given to the annual conventionof a weight watchers' club would not. These statements seem to be soobviousand yet,as weshall see,overlooking possible sources of bias either by accidentor design has led tosome expensive fiascos. Unfortunately, the method of sampling can never be assuredly independent of the variable under study. Kendalland Babington-Smith (1938) note that "The assumption of independence must therefore be made with moreor less confidenceon a priori grounds. It is part of the hypothesison which our ultimate expressionof opinion is based"(p. 152). Kendall and Babington Smith commenton the use of"random" digits in random sampling,and it isworth examining these applications because most of the present-day statistical cookbooks at least pay lip-serviceto the procedures by including tablesof random numbersand some instructionas to their use. Individual units in the populationare numberedin some convenient way, and
1 Stigler notesthat the remedywas specified on a perpound basisandthat all the coin weightswere combined. This,togetherwith the central limit theorem, almost guarantees that the Master would not exceedthe remedy.
88
8. SAMPLING AND ESTIMATION
then numbers, taken from the tables,arematchedwith the individualsto select the sample. Thisproceduremay result in a sampleof individuals numbered1, 2, 3,4, 5,6, 7, 8,9, or 2,4,6,8, 10, 12,14,16, 18,20, groupings that mayinvite follow immediately recognizable orders but that nevercomment because they theless couldbe generatedby a random selection.The fact that a random selection producesa grouping that looksto bebiasedor non-representative led to a great dealof debatein the 1920sand 1930s. The sequence 1,4,2,7,9, has the appearanceof being randombut the sequence1, 4, 2, 7, 9, 1, 4, 2, 7, 9, has not. Finite sequencesof random numbersare therefore only locally random. Even the famous tablesof one million random digits produced by the RAND Corporation (1965)can only, strictly speaking,be regardedas locally random, for it may bethat the sequenceof onemillion wasaboutto repeat itself. Random sequencesmay also be "patchy." Yule (1938b),for example,after examining Tippett's tables, which hadbeen constructed by taking numbersat random from census reports, gained the impression that they were rather patchy and proceededto apply furtherteststhat gavesome supportto hisview. Tippett (1925) usedhis tables (not then published) to examine experimentally his work on the distribution of the range. Simpletestsfor local randomnessareoutlined by Kendalland BabingtonSmith, tests that Tippett's tables hadpassed,andalthough much more extensive appraisalscan bemade todayby using computers, these prescriptions illustrate the sort of criteria that shouldbe applied. Each digit should appear approximatelythe same number of times; no digit should tendto befollowed by another digit; there are certain expectationsin blocks of digits with regard to the occurrenceof three, four, or five digits thatare all thesame; thereare certain expectations regarding thegapsin the sequence between digits that are thesame. Testsof this sortdo notexhaustthe many thatcan beapplied. Whitney(1984) hasrecently noted that "It hasbeen said that more time hasbeen spent generating and testing random numbers than using them" (p. 129). COMBINING OBSERVATIONS The notion of using a measureof the averageas anadequateand convenient summary or description of a numberof data is an acceptedpart of everyday discourse.We speakof average incomes and average prices, of average speeds and averagegas consumption,of averagemen andaverage women.We are referring to some middling, nonextreme value that we take as a fair representation of our observations.The easy use of theterm doesnot reflect the logical problems associated with the justification for the use ofparticular measuresof the average.The term itself, we find from the Oxford English to notions of sharing laboror risk, so Dictionary, refers, among other things,
COMBINING OBSERVATIONS
89
old forms of the word referto work doneby tenantsfor a feudal superioror to shared risks among merchants for lossesat sea. Modern conceptions of the word include the notion of a sharingor eveningout over a rangeof disparate values. A large seriesof observationsor measurements of the same phenomenon producesa distributionof values. Giventhe assumption thata single true value does,in fact, exist,the presenceof different valuesin the series shows that there are errors. The assumption that over-estimations are aslikely as under-estimations would provide supportfor the use of themiddle valueof the series,the median,asrepresentingthetrue value.Theassumption that thevaluewe observe most frequently is likely the true valuejustifies the use of themode,and the "evening out" of the different valuesis seenin the use of thearithmetic mean. It is this latter measure that is now thestatisticof choice when observations are combined,andthereare anumberof strandsin our history that have contributed to thejustification for its use.The employmentof the arithmetic meanand the use of the lawof error have sound mathematical underpinnings in the Principle of Least Squares,which will be considered later. First, however, a somewhat critical look at the use of themeanis in order. The Arithmetic Mean In the 1755 Philosophical Transactions, and in arevision in Miscellaneous Tracts of 1757, Thomas Simpson argues the casefor the arithmetic meanin An Attempt to Show the Advantage arisingby Taking the Mean Of a Number of Observations in Practical Astronomy.Theseare valuable contributions that in the context of probabilityand discuss,for the first time, measurement errors point the way toward the idea of a law of facility of error. Simpson (1755) complainsthat "somepersons,of considerable note, have been of opinion,and even publickly maintained, that onesingle observation, taken with duecare,was as muchto berelied uponas theMean of a great number" (pp.82-83). In the revision, Simpson (1757) statesas axioms that positiveand negative errors are equally probableand that thereare assignablelimits within which errors can betaken to fall. He also diagramsthe law of error as anisosceles triangleandshows thatthe meanis nearerto thetrue value thana single random of observation.The claim of the arithmetic meanto be thebest representation a large bodyof data is often justified by appealto the principle of least squares and the law oferror. This is high theory,and inappealingto it there is a danger of overlookingthe logic of the use of themean. Simply,asJohn Venn (1891), who wasmentionedin chapter 1, puts it: Why do we resortto averagesat all? How can asingle introductionof our own, andthat a fictitious one, possiblytakethe
90
8. SAMPLING AND ESTIMATION place of the many values which were actually given to us? And theanswersurely is, that it can notpossibly do so; the onething cannot takethe place of the other for purposesin general,but only for this or that specific purpose, (pp. 429-430)
This seemingly obvious statement is onethat hasfrequently been ignoredin statisticsin the social sciences. Venn points out thedifferent kindsof averages that can beusedfor different purposesandnotes cases where the use of anysort of averageis misleading. Edgeworth (1887) hadprovidedanattemptto examine the mathematical justifications for the different averagesand Venn and many the validity of others referto this treatment. Edgeworth's paper also examines the least squaresprinciple in this context. Venn illustrates his argument with some straightforward examples. If two people reckonedthe distanceof Cambridge from London to be 50 and 60miles, in the absenceof any information that would leadus to suspect eitherof the measures,one would guess that55 miles was theprobabledistance.However,if oneperson said that someone they knew lived in Oxford and another thatthe individual lived in Cambridge,the most probable location would not be atsome placein between.In the latter case, in the absenceof any other information, one would havea chanceat arriving at the truth by choosingat random. Edgeworth's paperson the best mean represent some of his most useful work. A particular serviceis rendered by his distinction between realor objective meansand fictitious or subjective means. The former arise whenwe use thearithmetic meanas thetrue valueunderlying a groupof measurements that are subjectto error; the latter is a descriptionof a set. The mean of observations is a cause,as it were the source from which diverging errorsemanate.The meanof statistics is adescription,a representativeof the group, that quantity which,if we must in practice put onequantityfor many, minimisesthe error unavoidably attending such practice. Observationsare different copiesof one original; statisticsare different originals affording one 'genericportrait.' (Edgeworth, 1887,p. 139)
This formal distinctionis clear. However,becausethe mathematicsof the analysisof errors and themanipulationsof modern statistics rest on the same principles, the logic of inference is sometimes clouded.It is Quetelet who brought the mathematicsof error estimationin physical measurement into the assessmentof thedispersionof humancharacteristics. Clearly, something of the sort had been done before in the examinationof mortality tablesfor insurance purposes,but we seeQuetelet makinga direct statement that only partially recognizes Edgeworth's later distinction: Everything occursthen as though there existeda type of man, from whichall other men differed moreor less. Nature hasplaced beforeour eyesliving examples of
COMBINING OBSERVATIONS
91
what theory showsus. Each peoplepresentsits mean, and thedifferent variations from this meanwhich may becalculateda priori. (Quetelet, 1835/1849,p. 96)
We notedin chapter1 that it was Quetelet'sview that the averagevalue of mentalandmoral, aswell asphysical, characteristics represented the idealtype, I'homme moyen (the average man) for which Nature was constantly striving. in that he someQuetelet'sown pronouncements were somewhat inconsistent times promotedthe view of the "average being"as auniversal biological type and at other times suggested that the average differed across groupsand circumstances. Nevertheless, his methods werethe forerunnersof work that attemptsto establish"norms" in biology, anthropology,and thepsychologyof individual differences.One of therequirementsof a "good" psychometric test is that it be accompaniedby norms. These norms provide standards for comparisonsacrossindividual test scores and, just asQuetelet's characterization of the average as the ideal type aroused opposition and controversy, so the establishmentof national normsand sub-group normsand racial norms produces heated debate today. The Principle of Least Squares In its best-known form, this famous principle states that the sum ofsquared differencesof observationsfrom the meanof those observations is aminimum; that is to say, it is smaller thanthe sum ofsquareddifferences from any other reference point. Legendre(1752-1833)announcedthe Principle of Least Squaresin 1805, but in Theoria Motus Corporum Coelestium, published in 1809, Gauss (1777-1855)discussingthe method, refersto his work on it in 1795 (whenhe was a 17-year-old student preparing for his university studies). This claimto priority upset Legendre and led tosome bitter dispute. Today the methodis most frequently associated with Gauss, who isoften identified as thegreatest mathematician of all time. That the method veryquickly becamea topic of much commentaryand discussionmay bedemonstratedby the fact that in the 70 or so years followingLegendre'spublication no lessthan 193 authors produced 72 books, 23 parts of books,and 313memoirs relatingto it. Merriman (1877) providesa list of these titles together with some notes. Faced with such sources, to say nothing of the derivations, papers, commentaries, and monographs published in the last 110years,the following represents perhaps one of adozen ways of commentingon its origins. Using the meanto combinea set ofindependent observations was atechniquethat had been usedin the 17th century. Gauss later examined the problem of selecting from a numberof possible waysof combining datathe one that producedthe least possible uncertainty about the "true value." Gauss noted
92
8. SAMPLING AND ESTIMATION
that in thecombinationof astronomical observations themathematical treatment dependedon themethodof combination.He approachedthe problemfrom the standpoint thatthe method should lead to the cancellationof random errorsof measurementandthat, astherewas noreasonto preferonemethod over another, the arithmetic mean should be chosen.Having appreciated thatthe "best,"in the senseof "most probable,"value couldnot beknown unlessthe distribution of errorswasknown,heturnedto anexaminationof thedistribution, which gave the meanas themost probable value. Gauss's approach wasbasedon practical considerations,andbecausethe procedureshe examineddid produce workable solutionsin astronomicalandgeodetic observations, themethodwas vindicated. In fact, the principle of least squares, asGauss himself noted, can beconsidered independentlyof the calculusof probabilities. If we haven observationsX^X2 X3 ^ . . .Xn , from apopulation witha mean u,, what is the least squares estimate of u? It is thevalueof (i that minimizes,
NOW :
whereX is themeanof the n observa-
tions.
Clearly the right-hand sideof the last expressionis at aminimum when, X- u,, which demonstrates the principle. This easily obtained result provides a rationalefor estimatingthe population meanfrom the sample mean that is intuitively sensible.The law oferror enters the picture whenwe considerthe arithmetic meanas themost probable result. In this casewe find thatthe law is infact givenby thenormal distribution, often referredto as theLaplace-Gaussian distribution. It hasbeen statedon more than one occasionthat Laplaceassumedthe normal law in arriving at themean to provide whathe describedas themost advantageous combination of observations. Laplacecertainly consideredthe casewhen positiveandnegativeerrors are equally likely,andthesecasesrest on theerror law being whatwe nowcall "normal" in form. The threadsof the argumentare notalways easyto disentangle, but one of thebetter accounts,for thosewho arepreparedto grapple with a little mathematics,wasgiven by Glaisher as long ago as 1872. The crucial point is that of the rationalefor the two fundamentalconstructs of statistics,
COMBINING OBSERVATIONS
93
the mean,X,and the variance, I,(X- X) /n. Theessential factis easily seen. Given thata distribution of observationsis normal in form, and given thatwe know the mean and thevarianceof the distribution, thenthe distribution is completely and uniquely specified.All its propertiescan beascertained given this information. In the context of statisticsin the social sciences, both the normal law and the least squares principleare best understoodin the context of the linear model. The model encompasses these constructs and brings togetherin a formal sense the mathematicsof the combinationof observations developed for use inerror estimationand mathematical statistics as exemplifiedby analysisof variance and regression analysis. Representation and Bias The earliest examples of the use ofsamplesin social statisticswe have seenin the work of Graunt, Petty, Halley, and theearly actuaries. These samples were neither randomnor representative,and mistaken inferences were plentiful. In any event, it was not until much later, when attempts were made to collect information on populations, inferential exercises repeated, and the results of thetechniques employed could be made. compared, that critical examinations Stephan (1948) reports that Sir Frederick Morton Eden estimated thepopulation of Great Britainin 1800to beabout 9,000,000. This estimate, which wasbased on the numberof births and theaverage number of inhabitantsin eachhouse(a number that was obtained by sampling),was confirmedby the first British censusin 1801. Earlier attemptsto estimate populations had been madein France,and Laplacein 1802 madean attemptto do sothat followeda scheme he had devised and publishedin 1786, a scheme that included a probability the measureof the precisionof the estimate. Specifically, Laplace averred that odds were 1,161 to 1that the error wouldnot reach halfa million. Westergaard (1932) provides some more details of these exercises. Elsewhere in Europeand in the United Statesthe 19th centurysaw various censuses conducted, aswell as attemptsto estimatethe size of the populationfrom samples.In the United States the Constitution providedfor a census every10 years in order to determine Congressional representation, but a Bill introduced intothe British Parliamentin 1753to establishan annual census was defeatedin the Houseof Lords. It appears thatthe probability mathematicians and the rising group of political arithmeticians never joined forces in the 19th century. The latter favored the institutionof complete censuses and, generally speaking, not were mathematicians,and theformer were scientists who hadmany other problems
94
8. SAMPLING AND ESTIMATION
to test their mettle. Almost100years passed before scientific sampling procedures wereproperly investigated. Stephan (1948) lists four areas where modern sampling methods could have been used to advantage:agricultural crop and livestock estimates,economic statistics, social surveys and health surveys,and public opinion polls.The last will be considered herein a little more detail because it is in this areathat the accuracy of forecastsis so often and soquickly assessedand breakdownsin sampling procedures detected with the benefit of hindsight. The Raleigh Starin Raleigh, North Carolina, conducted "straw votes" as early as 1824, covering political meetings andtrying to discoverthe "senseof the people." By the turn of the century, many newspapers in the United States were regularly conducting opinion polls, a common method being merely to invite membersof the public to clip out aballot in the paperand tomail it in. The same basic procedure was followed by all thepublications. Thenthe largecirculation magazines, notably Literary Digest, began to mail out ballotsto very large numbersof people, sometimesas many as 11,000,000. In 1916 this publicationcorrectly predictedthe electionof Wilson to thepresidencyand from then until 1936 enjoyed consistent and much admired success.Its predictions were very accurate.For example,in 1932its estimateof the popular votefor each candidatein the presidential election came to within 1.4% of the actual outcome. In 1936 came disaster.For years Literary Digesthadconducted polls on all kinds of issues, mailingout millions of ballots,at considerable expense, to telephone subscribers and automobile owners.In 1936 the magazine predictedthat AlfredM. Landon wouldwin thepresidencyon thebasisof the return of over 2,300,000replies from over 10,000,000 mailed ballots. The record shows that Franklin D. Rooseveltwon thepresidency withone of thelargest majorities in American presidential history. The reasonsfor this disastrous mistake are noweasyto see. Priorto 1936, preferencefor the twomajor political parties in the United Stateswas notrelatedto level of income.In that yearit seems thatit was. The telephone subscribers and carowners (andin 1936 these were the rather moreaffluent) who hadreceivedthe Literary Digest ballots were, in the main, Republicans. In 1937the magazine ceased publication. Crum (1931) and Robinson (1932, 1937) gave commentaries on someof these early polls. Fortune magazine fared no better in 1948, underestimating the vote for the Democratsand Harry Trumanby close to 12%and, of course,failing to pick the winner. In 1936, with a much, much smaller sample than that of Literary Digest (less than 5,000),it had forecast Roosevelt's vote to within 1%. The explanation for its failure in 1948 restswith the swing of both decidedand undecided voters between the Septemberpoll and theNovember electionand failure to correct for a geographic sampling bias. Parten(1966) has some
SAMPLING IN THEORY AND PRACTICE
95
commentary on these and other polls. One of the most successful polling organizations, the American Institute of Public Opinion, headedby George Gallup, beganits work in 1933. But the Gallup poll also predicteda win for Dewey over Trumanin 1948,and many people have seen the Life photograph of a victorious president holding the Chicago Daily Tribune withits famous Type I error headline, "Dewey defeats Truman." The result producedone of thefirst claims thatthe polls influencedthe outcome, inducing complacency in the Republicansand asmall turnoutof their to the defeat of the Republican candidate. Today there is supporters, leading much controversy overthe conductingand the use ofpolls. Theywill survive becausein general theyare correct much more often than not. Their success is due to thedevelopmentof refined sampling techniques. SAMPLING IN THEORY AND IN PRACTICE Chang (1976) givesa quite thorough reviewof inferential processesand sampling theory,and it is worth sketchingin some of its developmentin the context of survey sampling. However, from the standpointof statistics and sampling in psychology, thereis no doubt but that the rationaleof sampling proceduresfor hypothesis testing rather than parameter estimation is of greater import. The two arerelated. The early political arithmeticians held to theview that statistical ratios - for example, malesto females, average number of children perfamily, and so on were approximately constant, and, as aresult, proceededto draw inferences from figures collectedin a single townor parish to whole regionsand even countries.The early 19th centurysaw theintroductionof the law of error by Gaussand Laplace and anawareness that variability was animportant consideration in the assessmentof data. Populations are nowdefinedby twoparameters, the meanand thevariance.One of theearliest attemptsto put ameasureof precision ontoa sampling exercisewas that of Laplacein 1802.The samplehe used was not random, althoughhe appearsto have assumed that it was. Communes distributed across France were chosen to balanceout climatic differences. Those having mayors known for their"zeal and intelligence" were also selectedso that the data wouldbe themost precise. Laplace also assumed that birth ratewas homogeneousacrossthe French population, exactly the sort of unwarranted assumption that was madeby theearly political arithmeticians. Nevertheless, Laplace estimated the population total from his figures and, appealing to the central limit theorem (whichhe had discussedin 1783), approximatedthe distribution of estimationerrorsto thenormal curve. Survey samplingfor all kinds of social investigations owes a great dealto Sir Arthur Lyon Bowley(1869-1957). In his time he was recognized as a
96
8. SAMPLING AND ESTIMATION
pioneerin the definition of sampling techniques, and hismethodsand assumptions werethe subjectof much debate.In the event, someof his approaches were found to bedefective,but hiswork focussed attention on theproblem. In 1926 he summarizedthe theory of sampling,and ashort paperof 1936 outlines the application of samplingto economicand social problems.He servedon numerous official committees that investigated the economic stateof the nation,social effects of unemploymentand so on, andworked directlyon many surveys. Maunder (1972)has written a memoir that pointsup Bowley's contributions, contributions that were somewhat overshadowed by thework of his contemporaries Pearson, Fisher, andNeyman.He was acalm and courteous man, enormously concerned with social issues, and heoccupied,at theLondon School of Economics,the first Chair devotedto statisticsin the social sciences. Bowley (1936) definesthe sampling problem simply: We are hereconcerned. . . with the investigationof the numerical structureof an actual andlimited universe,or "population" whichis thebetter wordfor our purpose. Our problems are quite definitely to infer the population from the sample. The problem is strictly analogousto that of estimatingthe proportion of the various colours of balls in a limited urn on thebasisof one ormore trial draws, (pp.474-475)
In the early yearsof this century, Bowley began to examine boththe practice and theory of survey sampling. His work helpedto highlight the utility of probability samplingof oneform or another. Systematic selection was adopted and advocatedby A. N. Kiaer(1838-1919), Director of theNorwegian Bureau of Statistics,in his examinationsof census data,but themajority of influential statisticians, represented by the InternationalStatisticalInstitute, rejected sampling, pressingfor complete enumeration. It took almost30 yearsfor theutility and benefits of the methodsto be appreciated. Seng (1951) and Kruskal and Mosteller (1979) give accounts of this most interesting periodin statistical history. The latter authors givea translationand paraphraseof the remarks of Georg von Mayer, Professorat the University of Munich, on Kiaer's workon the representative method, which was presentedat ameetingof the Institutein Berne in 1895: I regardas most dangerousthe point of view found in his work. I understand that can have some value,but it is a value restrictedto terrain representative samples already illuminated by full coverage. One cannot replaceby calculation the real observationof facts. A sample provides statistics for the units actuallyobserved,but not true statisticsfor the entire terrain. It is especially dangerousto proposerepresentative sampling in the midst of an assembly of statisticians. Perhaps for legislative or administrativegoals sampling may haveuses- but onemust never forget that it cannot replacea complete survey. It is necessaryto addthat thereis among us these daysa current in the minds of
SAMPLING IN THEORY AND PRACTICE
97
in many ways, haveus calculate rather thanobserve. mathematicians that would, We must remainfirm and say: no calculations when observations can bemade, (von Mayer, quotedby Kruskal & Mosteller, 1979,pp. 174-175)
Oddly enough, Kiaer's work is not mathematicalin the sense that modern aremathematical.At the time those methods methodsof parameter estimation were not fully delineatednor understood.Kiaer aimed, by a variety of techniques,to producea miniatureof the population, although he noted as early as 1899 the necessityfor the investigationof both the practical and theoretical aspectsof his methods. At a meetingin 1901(a report of which waspublished in 1903), Kiaer returnedto the theme and it was in adiscussion of his contribution that L. von Bortkiewicz suggested that the "calculus of probabilities" couldbe usedto test the efficacy of sampling. By establishinghow much of a difference between sample and population could be obtainedaccidentally and checking whetheror not anobserveddifference lay outside those limits, the representativeness of the sample couldbe decidedon. Bortkiwiecz did not, apparently, formulate all the necessary tests, and othershad employed this method, but he seemsto have beenthe first to draw the attention of practicing statisticiansto thepossibilities. In 1903, Kiaer must have thought that the sampling argument was won, for a subcommitteeof the International StatisticalInstituteproposeda resolutionat the Berlin meeting: The Committee, considering that the correct application of the representative method, in a certain numberof cases,can furnish exact and detailed observations from which the resultscan begeneralized,within certain limits, recommendsits use, provided thatin thepublicationof the resultsthe conditions under which the selection of the observation unitsis madeare completely specified.The question willbe kept on the agenda,sothat a report may bepresentedin the next sessionon the application of the method in practiceand on thevalueof the results arrivedat. (quotedby Seng, 1951, p. 230)
What is more, a discussantat the meeting,the French statistician Lucien March, returnedto the ideas thathad been put forward by Bortkiewicz and outlined someof the basicsof probability sampling (see Kruskal & Mosteller, 1979, for a short summaryof this presentation).The wayahead seemed clear. In fact the question was,for all intentsand purposes, shelved for more than 20 years,and it was notuntil 1925,at theRome sessionof the Institute, thatthe advantagesof the sampling method were fully recognized. Thiswas in nosmall way due to thetheoretical workof Bowley. Bowley had suggestedin his Presidential address to the Economic Scienceand Statistical Sectionof the British Associationas early as 1906 thata systematic approach to the problem
98
8. SAMPLING AND ESTIMATION
of sampling would bearfruit: In general,two linesof analysisarepossible:we may find anempiricalformula (with Professor Karl Pearson) which fits this classof observations [Bowleyis referringto data thatmay not benormally distributed],and byevaluatingthe constants determine an appropriate curve of frequency,andhence allotthe chancesof possible differences betweenour observationand theunknown true value; or we mayaccept Professor Edgeworth's analysis of thecauseswhich would producehis generalisedlaw of great numbers,and determinea priori or by experiment whether this universal law may be expectedor is to befound in the casein question. (Bowley, 1906, pp. 549-550)
Edgeworth's methodis basedon the Central Limit Theorem,and Bowley explains its utility clearly and simply: If quantitiesare distributed accordingto almost any curve of frequency,... the averageof successive groups o f. . .these conformto anormal curve (the more and more closelyas n isincreased) whose standard deviation diminishes in inverse ratio to thenumberin each sample...If we canapply this method...,we areableto give not only a numerical average, but areasoned estimate for the real physical quantity of which the averageis a local or temporary instance. (Bowley, 1906, p. 550)
The procedure demands random sampling: The chancesare thesamefor all the itemsof the groups to besampled,and the way they are takenis absolutely independent of their magnitude. It is frequently impossibleto covera whole areaas thecensusdoes,...but it is not necessary.We canobtain as good resultsas wepleaseby sampling,and very often quite small samplesare enough;the only difficulty is to ensure that every person or thing has thesame chanceof inclusion in the investigation. (Bowley, 1906,pp. 551-553)
THE THEORY OF ESTIMATION Over the next 20 years, Bowleyand his associates completed a number of surveys, and his theoretical researches produced The Measurement of the Precision Attainedin Samplingin 1926. This paper formed part of the report of the International Statistical Institute, which recommended and drew attention to the methods of random selectionand purposive selection:"A number of groupsof units areselected which together yield nearly the same characteristics as the totality" (p. 2). The report does not directly addressthe method of stratified sampling, even though the techniquehad been in general use. This from Neymanin his paperof 1934. Bowley procedure received close attention had attemptedto present a theory of purposive samplingin his 1926 report.
THE THEORY OF ESTIMATION
99
A distinctive featureof this method, according to Bowley, was that it was acaseof cluster sampling.It wasassumed that the quantity under investigation was correlated with a numberof characters,called controls,and that the regressionof the cluster meansof thequantityon thoseof each controlwaslinear. Clusters were to be selected in such a waythat averageof each control computed from the chosen clusters should (approximately) equalits population mean.It was hoped that,due to theassumed correlations between controls and thequantity under investigation, the above method of selection would result in arepresentative sample with respectto thequantity under investigation. (Chang, 1976, pp. 305-306)
Unfortunately,a practical testof the method (Gini, 1928) proved unsatisfactory and Neyman's analysis concluded that it was not aconsistentnor an efficient procedure. As Neyman pointed out, the problemof samplingis theproblemof estimation. The first forays intothe establishment of atheory hadbeen madeby Fisher (1921a, 1922b, 1925b),but themannerof samplinghadreceived little attention from him. The methodof maximum likelihood, which rested entirelyon the propertiesof the distribution of observations, gave the mostefficient estimate. Any appealto thepropertiesof theapriori distribution- theBayesian approach - wasrejectedby Fisher. Neyman attempted to clarify the situation: We are interestedin characteristicsof a certain population, say, n , . . . it hasbeen usually assumed that the accurate solution of sucha problem requires the knowledge of probabilitiesa priori attachedto different admissible hypotheses concerning the valuesof the collective characters [the parameters] of the populationn. (Neyman, 1934, p. 561)
He then turnsto Bowley's work,notingthatwhenthepopulationn isknown, then questions about the sort of samples thatit could producecan beanswered from "the safe groundof classical theoryof probability"(p. 561). The second question involvesthe determination, when we know the sample,of the probabilities a posteriori to beascribedto hypotheses concerning the populations. Bowley's conclusions arebased: on some quite arbitrary hypotheses concerning the probabilities a priori, and Professor Bowley accompanies his results withthe following remark: "It is to be emphasized thatthe inference thus formulated is based on assumptions thatare difficult to verify andwhich are notapplicablein all cases."(Neyman, 1934,p. 562)
Neymanthen suggests that Fisher's approach (that involving the notion of fiducial probability, although Neyman does not use theterm) "removesthe difficulties involved in the lack of knowledgeof the apriori probability law"
100
8. SAMPLING AND ESTIMATION
(p. 562). He further suggests that these approaches have been misunderstood, due, he thinks,to Fisher's condensed form of explanationanddifficult method of attacking the problem: The form of the solution consistsin determining certain intervals which I propose to call confidenceintervals..., in which we mayassumearecontainedthe values of the estimated charactersof the population, the probability of the error in a statementof this sort being equal to or less than1 - e ,where e is anynumber0 < e < 1,chosenin advance.The numbers I call the confidence coefficient. (Neyman, 1934, p. 562)
Neyman's commentson Fisher's abilityto explain his view produced,in the discussionof his paper, the first (mild) reaction from Fisher. Subsequent reactionsto Neyman's workandthat of his collaborator, Egon Pearson, became increasingly vitriolic. The report reads: Dr Fisher thoughtDr Neyman mustbe mistaken in thinking the term fiducial probability [Neyman had used the term "confidence coefficient"]had led to any misunderstanding;he had notcome uponany signsof it in the literature. WhenDr Neyman said"it really cannot be distinguished fromthe ordinary concept of probability," Dr Fisher agreedwith him ... Hequalified it from the first with the word fiducial... Dr Neyman qualified it with the word confidence. The meaning was evidently the same,and he did notwish to deny that confidence could be used adjectivally. They wereall too familiar with it, as Professor Bowleyhad reminded them, in the phrase"confidence trick." (discussion on Dr Neyman'spaper, 1934, p. 617)
From the standpointof the familiar statistical procedures found in our texts, this paperis importantfor its treatmentof confidence intervalsand itsemphasis of the importanceof random sampling.It extended estimation from so-called point estimation,the use of asample value to infer a populationvalue,to interval the probability of a range of values. Neyman estimation, whichassesses demonstratesthe use of theMarkov2 method for deriving the best linear Andrei Andreyevich Markov(or Markoff, 1856-1922)is best knownfor his studiesof the probabilities of linked chainsof events.Markov chains have been used in a variety of social and biological studiesin the last 30 or 40years.But Markov made many contributionsto probability theory. If we havea random variableX, then regardless of its distribution,for any positive numberc (i.e. c > 0), theprobability, Px(X > cu), thatthe random variable X is greater thanc timesits expected valueux = u doesnot exceed lie. Thatis, Px(X > c^) < \lc. This is knownas theMarkov Inequality. Markov was astudentof Pafnuti LTchebycheff(sometimes spelledChebychev or Chebichev,1821-1894),who formulated the Tchebycheff Inequality which states that, Px(X< n - da or X > + da) = tfwhereX is a random variable with expected value u andvariance a2, and d> 0. This resultwas independently arrived at by theFrench mathematician I.J. Bienayme (1796-1876).These inequalities areimportant in the developmentof the descriptionsof the propertiesof probability distributions.
THE BATTLE FOR RANDOMIZATION
101
unbiased estimators.It also contains other important ideas, in particular, a discussion of the methods of stratified samplingand appropriate statistical modelsfor it. Neyman's paper marks a new era inboth the methodandtheory of sampling, although,at the time, it was its treatment of the problem of estimation that received the most attention.In a senseit complementedand supplementedthe work of Ronald Fisher thatwas going on atRothamsted,but it became evident that Fisher did not quitesee itthat way. THE BATTLE FOR RANDOMIZATION There is no doubt thatthe requirement that samples should be randomly drawn wasthoughtof by thesurvey-makersas aprotection against selection bias. And there is also no doubt that when sample size is large it affords such protection, but not, it must be stressed,a guarantee. In agricultural research,it had long been recognized that reduction of experimentalerror was ofcritical importance.At Rothamstedtwo methods were available: repeating the experiments over many years, and multiplying the number of plots on a field. Mercer and Hall (1911) discussthe problem in considerable detail andgive suggestions for arrangingthe plots sothat theymay be "scattered."This was theapproach thatwasabandoned, although not immediately, when Fisher started his important workat the Station. Eventually,for Fisherand hiscoworkersthe argumentfor randomizationhad aquite different motive from that of trying to obtain a representative sample, one that is crucial for an appreciationof the use ofstatisticsin psychology. Fisher, although a brilliant mathematician,was apractical statistician,and hisapproachto statistics can only be understood through his work on thedesignof experimentsand the analysis of the resultant data.The core of Fisher's argument rests on the contention thatthe valueof an experiment depends on thevalid estimationof error, an argument that everyone would agree with. But how was theestimate to be made? In nearly all systematic arrangements of replicated plotscareis takento put theunlike plots as close togetheras possible, and thelike plots consequentlyas far apart as possible,thus introducinga flagrant violationof the conditions upon whicha valid estimate is possible. One way ofmaking sure thata valid estimateof error will be obtainedis to arrange the plots deliberatelyat random. The estimateof error is valid, because,if we imagine a large numberof different results obtainedby different random arrangements,the ratio of the real to the estimatederror, calculated afreshfor each of these arrangements, will be actually distributed in the theoretical distributionby which the significance of the result is tested. Whereasif a group of arrangementsis chosen such that the real errorsin this group are on thewhole lessthan thoseappropriateto random arrangements,it has
102
8. SAMPLING AND ESTIMATION
now beendemonstrated that the errors, asestimated, will,in sucha group, be higher andthat, in consequence, within such a group, than is usualin random arrangements, the test of significanceis vitiated. (Fisher,1926b,pp. 506-507)
The contribution of Fisher thatis overwhelmingly importantis the development of the t and ztestsand thegeneral technique of analysis of variance.The essenceof these proceduresis that they provide estimates of error in the observationsand theapplicationof testsof statistical significance.These methods were not immediately recognizedas being useful for larger-scale sample surveys,and it waspartly the work of Neyman (mentioned earlier) and others in the mid-1930s that, ironically, introduced them to this area. Argumentsabout randomized versus systematic designs began in the mid1920s. Mostly they revolved around the issueof what to do when therewas an unwanted assignmentof treatmentsto experimental units, thatis, when the assignmenthad apattern thatthe researcher either knew or suspected might confoundthe treatments. Fisher argued very strongly against the use ofsystematic designs,on thebasisof theory, but hisargumentwas notwholly consistent. Some had suggested that if a random design produced a pattern thenit should be discardedandanother random assignment drawn. Of course,thesubjectivity introducedby this sortof procedureis precisely thatof the deliberate balancing of the design. And how many draws mightone beallowed? Most experimenterson carrying out a random assignment of plots will be shocked to find out how farfrom equallythe plots distribute themselves... if the experimenter at bychanceasaltogether "too bad,"or in other ways rejects the arrangement arrived "cooks" the arrangementto suit his preconceived ideas,he will either (and most probably)increasethe standarderror as estimatedfrom the yields; or, if his luck or his judgment is bad, he will increase the real errors while diminishing his estimate of them. (Fisher, 1926b,pp. 509-510)
But even Fisher never quite escapedthe difficulty. Savage (1962) talked with him: "What would you do," I had asked, "if, drawinga Latin Squareat random for an experiment, you happenedto draw a Knut Vik square?"Sir Ronald said he thought he would draw againandthat, ideally, a theory explicitly excluding regularsquares should be developed. (Savage, 1962, p. 88)
Studentsandteachers cursing their way through statistical theory and practice should takesome comfortfrom the inconsistencies expressed by the master. The debate reached its height in argument between "Student" and Fisher. "Student"consistently advocated systematic arrangements. In a letter to Egon Pearson (Pearson, 1939a) written shortly before his death in 1937, "Student"
THE BATTLE FOR RANDOMIZATION
103
commentson work by Tedin (1931) thathad examinedthe outcomes when in experimental sytematic, as opposed to random, Latin squares were used designs. The "Knight's move" Latin squarehe prefers aboveall others:"It is interesting as anillustration of what actually happens when we depart from artificial randomisation:1would Knight's move every time!" (quotedby E. S. Pearson, 1939a, p. 248).3 Over the previous yeara seriesof papers, letters,and lettersof rebuttalhad come forth from "Student"andFisher. "Student"was adamantto the end,and Fisher reiteratedhis claim that valid error estimates cannot be computedin arranged designs andthat in such casesthe test of significanceis made ineffective. Picard (1980) describes anddiscussesthe argumentandalso examinesthe contributionsof Pearsonand Yates,who hadsucceeded Fisher at Rothamsted. Others enteredthe debate.Jeffreys (1939)is puzzled: Reading "Student's"paper [of 1937] and Fisher's Designof Experiments I find myself in almost complete agreement with both; and Ishould therefore have expected them to agreewith each other. But it seemsto me that "Student"is wrong in regarding Fisheras anadvocateof extremerandomness,andpossibly Fisherhas notsufficiently emphasizedthe amount of systemin his methods. (Jeffreys, 1939, p. 5)
Jeffreys makesthe point thatomitting to take accountof relevant information makes avoidable errors: The bestprocedureis to designthe work so as todetermineit [the error] as accurately as possibleand not toleave it to chance whetherit can bedeterminedat a l l . . .. The hypothesis is a considered proposition.. . The argument is inductive and not deductive; it is not dealt with by consideringan estimable error thathas nothing to do with it. (Jeffreys, 1939,p. 5)
As ever, Jeffreys' argumentis a paragonof logic, and it notes that Fisher's advice to balanceor eliminatethe larger systematiceffects as accuratelyas possible and randomizethe rest "sumsup thesituation very well"(p. 7). This is the prescription thatthe designof experimentsfollows today. E. S. Pearson (1938b) attempted to expandon and toclarify "Student's" stand,but heclearly understoodthe view of the opposition. Nevertheless, he concluded that balanced layouts could give some slight advantage. An illustration of "Two Knight's moves"would be DEABC B C D EA E A B CD C D E AB ABODE
104
8. SAMPLING AND ESTIMATION
Yates (1939),in a lengthy paper also goes over the wholeof "Student's"views on thematter,but hisconclusion supports the essenceof Fisher's views: The conclusionis reached thatin caseswhere Latin square designs can beused,and in many caseswhere randomized blocks have to beemployed,the gain in accuracy with systematic arrangements is not likely to be sufficiently great to outweigh the disadvantages to which systematic designs are subject. On the other hand, systematic arrangements may in certain casesgive decidedly greateraccuracy than randomized blocks, but it appears thatin suchcasesthe use of the modern devicesof confounding, quasi-factorial designs, or split-plot Latin are likely to give a similar squares whichare much more satisfactory statistically, gain in accuracy. (Yates, 1939, p. 464)
This bringsus to theapproachesof the present day. The realization that sampling was importantin psychologicalresearch,and that its techniqueshad been much neglected, waspresentedto the disciplineby McNemar (1940a).In an extensive discussion, he pointsout situations thatare still with us: One wonders, for instance, how many psychometric test scores for policeman, firemen, truck driversetal.have been interpreted by theclinician in termsof college sophomorenorms. In psychological research we aremorefrequently interestedin makingan inference regarding the likenessor difference of two differentially defined universes, such as two racial groups,or an experimentalvs. acontrol group. The writer venturesthe guessthat at least 90% of theresearchin psychology involves such comparisons. It is not only necessaryto considerthe problemof samplingin the caseof experimental and control groups,but alsoconvenientfrom the viewpointof both good experimentation and sound statisticsto do so. (McNemar, 1940a,p. 335)
This paper, which, regrettably, is not among those most widely cited in the psychological literature, should be required readingfor all those embarking on researchin any aspect of the social sciences.Its closing remarks contain a prediction thathasbeen most certainly fulfilled andwhose contentis dealt with shortly: The applicability in psychologyof certain of ProfessorR. A. Fisher's designs should be examined. Eventually, the analysisof variancewill come intouse inpsychological research.(McNemar, 1940a,p. 363)
For themoment,no more needsto besaid.
9 Sampling Distributions Large setsof elementary events are commonly called populationsor universes in statistics,but the settheory term sample space is perhapsmore descriptive. The term population distribution refers to the distributionof the valuesof the possible observationsin the sample space. Although the characteristicsor parameters of the population(e.g., the mean,\i, or the standard deviation,a) are of both practicaland theoretical interest, these values are rarely, if ever, known precisely. Estimatesof the values are obtained from corresponding sample values,the statistics. Clearly,for a sample of a given size drawn randomly froma samplespace,a distributionof valuesof a particular summary statistic exists. This simple statement defines a sampling distribution.In statistical practice it is the propertiesof these distributions that guides our inferences about propertiesof populationsof actualor potential observations.In chapter 6 thebinomial, the Poisson,and thenormal distributionswere discussed.Now that samplinghas been examinedin some detail, three other distributions and the statistical tests associated with them are reviewed. THE CHI SQUARE DISTRIBUTION The developmentof the j^ (chi-square) testof "goodness-of-fit" represents one of the most important breakthroughs in the history of statistics, certainlyas important as thedevelopmentof the mathematical foundations of regression. The fact that both creationsare attributableto the work of one man,1 Karl MacKenzie (1981) givesa brief accountof Arthur Black (1851-1893),a tragic figure who,on his death, left a lengthy, and now lost, manuscript, Algebraof Animal Evolution,which was sent to Karl Pearson. "Pearsonstarted to read it, but realized immediately thatit discussed topics very similar to thosehe wasworking on, anddecidednot toreadit himself but to sendit to Francis Galtonfor his advice" (p. 99).Of great interest is that buried among Black's notebooks, which have survived, is a derivation of the chi-square approximation to the multinomial distribution.
105
106
9. SAMPLING DISTRIBUTIONS
Fig. 9.1 Chi- Square for 4 and 10 Degrees of Freedom Pearson,is impressiveattestationto hisrole in thediscipline. Thereare anumber of routesby whichthetestcan beapproached,but thepath thathasbeen followed thus far is continued here.This path leads directlyto the work of Pearsonand Fisher, who did not make use, and,it seems, werein general unaware,of the earlier work on goodness-of-fitby mathematiciansin Europe. Before looking at thedevelopmentof the test of goodness-of-fitthe structureof the chi-square distribution itself is worth examining. Figure9.1 showstwo chi-square distributions. Givena normally distributed population of scoresY with a meanu, and a variance a2, suppose that samples of size n = 1 aredrawnfrom the distribution and each scoreis convertedto its corresponding standard score 2. 2
z - (7-fj,)/a and x = (Y-\\) /G definesthechi-square distribution with one degreeof freedom. If samples of n = 2 aredrawn, thenx,2 is given by 2
-,
2
T
(Y] - n) /a + (F2 - u,)/a". In fact if n independent measures aretaken ran2 2 domly from a distribution with mean\i, and variancea , % is defined as the sum of the squaredstandardizedscores:
CHI-SQUARE
107
This is the essenceof the distribution2 that Pearson used in his formulationof the test of goodness-of-fit. Why wassucha test so necessary? Gamesof chanceand observational errorsin astronomyand surveying were subject to the random processesthat led scientistsin the 18th and early 19th centuriesto the examinationof error distributions.The quest was fora sound mathematical basisfor the exerciseof estimating true values. Simpson introduced a triangular distributionand Daniel Bernoulli in 1777 suggesteda semicircular one.In the absenceof empirical data, these distributions, established on apriori grounds, were somewhat arbitrarily regarded as having no more and noless claimto accuracyand utility. But the 19th centurysaw the normal law established. It had powerful credentials because of the fame of its two main developers, Gauss andLaplace. Startingfrom the assumption thatthe arithmetic mean represented the true value, Gauss showed that the error distribution was normal. Laplace, startingfrom the view that every individual observation arisesfrom a very great numberof independently acting random factors (the essence of the central limit theorem), cameto the same result. Gauss's proofof the methodof least squares further establishedthe importance of the normal distribution,and when, in 1801,he usedthe initial data collected 3 from observationson a newplanet, Ceres, to accurately predict where it would be observed laterin the year, these procedures, as well as Gauss'sreputation, were firmly establishedin astronomy. In astronomyaswell as inbiology and social science,the Laplace-Gaussian distribution wasindeed law,it was indeed regarded asnormal. This prescription led to Quetelet'suse of it as a"double-edged sword" (see chapter 1) and led to many astronomers using it as areasonto reject observations that were considered to be doubtful, for example Peirce (1852). Quetelet's procedure for establishingthe "fit" of the normal curvewas thesame as that of the early astronomers.The tabulation,and later the graphing,of observedand expected frequenciesled to their being comparedby nothing more than visual inspection (see, e.g., Airy, 1861). 2
A history of the mathematicsof the x2 distribution would includethe developmentof the gamma function by the French mathematician 1. J. Bienayme(1796-1876),who, in the 1850s,found a statistic that is equivalentto the Pearson X in the contextof least squares theory. Pearson was apparentlynot awareof his work; nor were F. R. Helmertand E.Abbe, who,in the 1860sand 1870s, also arrived at the X
distributionfor the sum ofsquaresof independentnormal variates. Long after the test had become
commonly used,von Mises (1919)linked Bienayme's workto the PearsonX2. Details of this aspectof the test and distribution's historyare given by Lancaster (1966), Sheynin (1966), and Chang (1973). 3
Cereswas the firstdiscovered "planetoid" in the asteroid belt. Gauss also determined the orbit of Pallas, another planetoid.
108
9. SAMPLING DISTRIBUTIONS
There was some dissent. Egon Pearson (1965) notes: As a reactionto this view among astronomers I rememberhow SirArthur Eddington in his Cambridgelectures about 1920 on theCombination of Observationsused to quote the remark that"to say that errors must obey the normal law means taking away the right of thefree-born Englishmanto makeany error he darn well pleases!" p. 6)
Karl Pearson's first statistical paper (1894) was on theproblemof interpreting a bimodal distributionas two normal distributions, the problem thathad arisen as aresult of Weldon's discovery that the distributionof the relative frontal breadthsof his sampleof Naples crabswas adouble-humped curve. This paper introduced the methodof momentsas ameansof fitting a theoretical curve to a set ofobservations.As Egon Pearson (1965) states it: The question"doesa Normal curvefit this distributionandwhat does this mean if it doesnot?" was clearly prominent in their discussions.There were three obvious alternatives; (a) The discrepancybetweentheory and observationis no more than mightbe expectedto arise in random sampling. (b) The dataareheterogeneous, composedof two or more Normal distributions. (c) The dataare homogeneous,but there is real asymmetryin the distributionof the variable measured. The conclusion(c) mayhave been hard to accept, such was theprestige surrounding the Normal law. (p. 9) 4 It appearsfrom Yule's lecture notes that Karl Pearson probably was employing a procedure that used the ratio of anabsolute deviation from expectation to its standard errorto examinethe recordof 7,000 (actually 7,006) tosses of 12 dice madeby Mr Hull, a clerk at University College. Thiswas therecord (see chapter 1) that Weldon said Karl Pearson had rejectedas "intrinsically incredible." Yule's notes also contain an empirical measureof goodnessof fit that Egon Pearson saysmay be setdown roughly as R = I\O - 7]/ir, where \O - 7] is the absolute valueof the difference between the observed and theoreticalfrequencyand T thetotal theoreticalfrequency,thoughit shouldbe mentionedthat the actual notesdo not containthe formula in this form. This expression mean absolute error was in useduring the latter yearsof the 19th century,and Bowley usedit in the first edition of his textbook in 1902. Karl Pearson's second statistical paper (1895) on asymmetricalfrequency curves occupied the attentionof thebiologists,but thequestionof thebiological meaningof skewed distributionswas not onethat, at the time, was in the
These notesare reproducedin Biometrikas Miscellanea(Yule, 1938a).
CHI-SQUARE
109
forefront of Pearson'sthoughts. Interestingly enough, Pearson did not use the mean absolute error as atest of fit in any of hiswork. His preoccupationwas with the developmentof a theory of correlation,and it was inthis context that he solved the goodness-of-fit problem. The 1895 paperand two supplements that followed in 1901 and 1916b introduceda comprehensive system of frea way tosampling distributions that are central to quency curves that pointed the use ofstatistical tests,but it was a waythat Pearson himself did not fully develop. Pearson's(1900a) seminal paper begins, "The object of this paper is to investigatea criterion of theprobability on anytheoryof an observed system of errors, and to apply it to the determination of goodnessof fit in the case of frequency curves'" (p. 157). Pearson takesa systemof deviationsx} , x2, . . ., xn from the meansof n variables with standard deviations a,, c r 2 , . . ., an andwith correlations r, 2, r13, an 2 7*23 •> • • •, '"n-i.n d derivesx as "the equationto a generalized 'ellipsoid,'all over the surfaceof which the frequencyof the systemof errorsor deviations jc,, x2,..., xn is constant" (p. 157). It was Pearson's derivation of the multivariate normal distribution that 2 formed the basisof the x test. LargeJCj's represent large discrepancies between theory and observationand inturn would give large values of x,2. But x.2 can be madeto becomea test statisticby examiningthe probabilityof a systemof errors occurring witha frequency asgreat or greater thanthe observed system. an expressionfor the multivariatenormal surface, Pearsonhadalready obtained and hehere givesthe probability of n errors whenthe ellipsoid, mentioned earlier is squeezedto becomea sphere,which is thegeometric representation of zero correlationin the multivariate space,and showshow onearrivesat the probability for a given valueof x2. When we compare Pearson's mathematics with Hamilton Dickson'sexaminationof Gallon's elliptic contours, seenin the scatterplotof two correlated variables while Galton was waiting for a train, we see how farmathematical statistics had advancedin just 15 years. Pearsonconsidersan (n +l)-fold grouping with observed frequencies, mi', m2', w3', . . . , wn', wn+ /, andtheoretical frequencies known a priori, m\, m2, w 3 , . . ., wn, mn +,. 2w = Iw' = N, is thetotal frequency,and e = m' -m is the error. The total errorZe (i.e., e} + e2 + e3 +... + en+] ) is zero. The degrees of freedom,as they are nowknown,follow; "Hence onlyn of the n + 1errors are variables;the n + 1th isdetermined when the first n areknown,and inusing 2 formula (ii) [Pearson'sbasicx. formula] we treat onlyof n variables" (Pearson, 1900a,pp.160-161). Starting withthe standard deviation for the random variationof error and the correlation between random errors, Pearson auses complex trigonometric
110
9. SAMPLING DISTRIBUTIONS
transformationto arrive at aresult"of very great simplicity":
Chi-square (thestatistic)is theweighted sum ofsquared deviations. Pearson (1904b, 1911) extended the use of thetest to contingency tables and to the twosamplecaseand in1916(a) presented a somewhat simpler derivation than the onegiven in 1900a,a derivation that acknowledges the work of Soper (1913). The first 20 years of this century brought increasing recognition that the test was of thegreatest importance, starting with Edgeworth , who wrote to Pearson, "I have to thank you for your splendid methodof testing my mathematical curvesof frequency. That x? of yours is one of themost beautiful of the instruments thatyou have addedto theCalculus" (quotedby Kendall, 1968, p. 262). And even Fisher,who by that timehad becomea fierce critic: "The testof goodnessof fit was devisedby Pearson,to whose labours principallywe now owe it, that the test may readily be appliedto a great varietyof questionsof frequencydistribution" (Fisher, 1922b,p. 597). In the next chapterthe argument that arose between Pearson andYule on the assumptionof continuous variation underlying the categoriesin contingency tables is discussed. When Fisher's modifications and correctionsto Pearson's theory were accepted,it was Yule who helped to spread the word on the interpretationof the newidea of degreesof freedom. The goodness-of-fit testis readily applied whenthe expected frequencies based on some hypothesisare known. For example, hypothesizing that the expected distributionis normal witha particular meanand standard deviation of any valueto be quickly calculated. Today enablesthe expected frequency 2 X is perhaps moreoften appliedto contingency tables, where the expected valuesare computedfrom the observed frequencies. This now routine procedure forms the basisof one of themost bitter disputes in statisticsin the 1920s.In 1916 Pearson examined the questionof partial contingency. The fixed n in agoodness-of-fit test imposes a constrainton the frequenciesin thecategories;only k - 1 categories are free to vary. Pearson realizedthat in the caseof contingencytables additional linear constraints were did not placed on the group frequencies,but heargued that these constraints allow for a reductionin the numberof variablesin the casewherethe theoretical distribution wasestimatedfrom the observed frequencies. Other questions had also been raised. Raymond Pearl(1879-1940),an American researcherwho was at the Biometric Laboratoryin the mid-191Os, pointedout some problemsin the
CHI-SQUARE
111
applicationof X2 in 1912, noting that some hypothetical data hepresents clearly show an excellent "fit" between observed and expected frequency but that the 2 value of x was infinite! Pearson,of course, replied,but Pearl was unmoved. I have earlier pointed out other objectionsto the x2 test ... I have never thought it necessaryto make any rejoinderto Pearson'scharacteristically bitter replyto my criticism, nor do Iyet. The x2 test leadsto this absurdity. (Pearl, 1917, p. 145)
Pearl repeatsthe argument that essentially notes that in caseswhere thereare small expected frequencies in the cells of thetable,the valueof X 2 can begrossly inflated. Karl Pearson (1917),in a reproof that illustrateshis disdain for those he believed had betrayedthe Biometric school, respondsto Pearl: "Pearl . . . provides a striking illustration of how the capable biologist needs a long continuedtraining in the logic of mathematics before he ventures intothe field of probability" (p. 429). Pearl had in fact raised quite legitimate questions about the application of 2 the X test, but in the context of Mendelian theory,to which Pearsonwas steadfastly opposed.A close readingof Pearl's papers perhaps reveals he that hadnot followed all Pearson's mathematics, but thequestionshe hadraised were germane. Pearson's response is theapotheosisof mathematical arrogance that, on occasion, frightens biologists and social scientists today: Shortly Dr Pearl'smethodis entirelyfallacious, as anytrained mathematician would have informedDr Pearlhad hesought advice before publication. It is most regrettable that such extensions of biometric theory should be lightly published, without any duesenseof responsibility,not solely in biologicalbut in psychological journals. It can only bring biometry into contempt as ascienceif, professinga mathematical foundation, it yet showsin its manifestations most inadequate mathematical reasoning. (Pearson, 1917, p. 432)
Social scientists beware! In 1916 Ronald Fisher, then a schoolmaster, raised his standardand made his first foray into whatwas to becomea war. The following correspondence is to be found in E. S. Pearson (1968): Dear Professor Pearson, There is an article by Miss Kirstine Smithin the current issueof Biometrika which, I think, ought not to pass without comment.I enclosea short note uponit. Miss Kirstine Smith proposesto use theminimum valueof x2 as acriterion to determinethe best formof the frequency curve;... It shouldbe observed thatx can only be determined whenthe material is grouped into arrays,and that its value
112
9. SAMPLING DISTRIBUTIONS
dependsupon the manner in which it is grouped. Thereis ... something exceedingly arbitrary in a criterion which depends entirely upon the mannerin which the datahappensto begrouped, (pp. 454-455)
Pearson replied: Dear Mr Fisher, I am afraid that I don't agreewith your criticism of FrokenK. Smith (sheis a pupil of Thiele'sand one of themost brilliant of the younger Danish statisticians).. . . your argumentthat x2 varies withthegrouping is of course well known... Whatwe have to determine, however,is with given grouping which method gives the lowest
X 2. (p. 455)
Pearson asks for a defenseof Fisher's argument before considering its publication. Fisher's expanded criticism received short shrift. After thanking Fisher for a copy of his paper on Mendelian inheritance (Fisher, 1918), he hopesto 5 find time for it, but pointsout thathe is"not a believerin cumulative Mendelian factors as being the solution of the heredity puzzle"(p. 456). He then rejects Fisher's paper. Also I fear thatI do not agreewith your criticismof Dr Kirstine Smith's paperand under present pressure of circumstances must keep the little space I have in Biometrika free from controversy which can only waste what power I have for publishing original work.(p. 456)
Egon Pearson thinks that we canaccepthis father's reasons for these rejections; the pressureof war work, the suspensionof manyof his projects,the fact that he wasover 60 yearsold with much workunfinished, and hismemoryof the strain that had been placedon Weldonin the rowwith Bateson.But if he really was trying to shun controversyby refusing to publish controversial views, then he most certainlydid not succeed. Egon Pearson's defense is entirely underhadcensored Fisher's work before and standablebut far toocharitable. Pearson appearsto have been trying to establishhis authority overthe youngerman and over statistics. Evenhis offer to Fisher of an appointmentat the Galton Laboratory mightbe viewed in this light. FisherBox (1978) notes thatthese experiencesinfluencedFisherin his refusal of the offer, in the summerof 1919, recognizing that, "nothing would betaughtor publishedat theGalton laboratory 5
Here Pearsonis dissembling. Fisher's paper wasoriginally submittedto theRoyal Society and, althoughit was notrejected outright,the Chairmanof the Sectional Committee for Zoology "had beenin communication withthe communicatorof the paper,who proposedto withdraw it." Karl Pearsonwas one of the tworeferees. Norton and Pearson (1976) describe the event and publish the referees' reports.
CHI-SQUARE
113
without the approvalof Pearson"(p. 82). The last strawfor Fisher camein 1920 whenhe senta paperon theprobable error of the correlationcoefficient to Biometrika: Dear Mr Fisher, Only in passingthrough Town todaydid I find your communicationof August 3rd. I am very sorry for the delay in answering ... As therehas beena delay of three weeks already, and as Ifear if I could givefull attentionto your paper, whichI cannotat the present time,I shouldbe unlikely to publish it in its presentform ... I would preferyou published elsewhere ... I am regretfully compelledto exclude all that I think erroneouson my own judgment, becauseI cannotafford controversy,(quoted by E .S.Pearson,1968, p. 453)
Fishernever again submitted a paperto Biometrika,and in 1922(a) tackled the X2 problem: This short paper withall its juvenile inadequacies, yet did somethingto break the ice. Any readerwho feels exasperated by itstentativeandpiecemeal character should remember thatit had to find its way topublicationpast critics who,in the first place, could not believe thatPearson'swork stood in needof correction, and who, if this had to beadmitted, were sure that they themselves had corrected it. (Fisher's commentin Bennett, 1971,p. 336)
Fishernotes thathe is notcriticizing the generaladequacyof the X2 test but that he intendsto show that: the valueof n' with which the table shouldbe enteredis not nowequalto thenumber of cells but to onemore thanthe number of degreesof freedom in the distribution. Thus for a contingency tableof r rows and ccolumnswe should taken' = (c- \)(r - 1) + 1insteadof n' = cr. This modificationoften makesa very great difference to the probability (P) that a given value of x2 should have been obtained by chance. (Fisher, 1922a,p.88)
It shouldbe noted that Pearson entered the tablesusing n' = v + 1, where «' is the numberof variables(i.e., categories)and v iswhatwe now call degrees of freedom. The modern tables areenteredusing v = (c - 1 )(r - 1). The use of «' to denote sample size and n todenote degrees of freedom, even thoughin many writings n wasalso usedfor samplesize, sometimes leads to frustrating readingin these early papers. It is clear that Pearson did not recognise thatin all caseslinear restrictions imposed upon the frequenciesof the sampled population,by our methodsof reconstructing that population, have exactly the sameeffect upon the distributionof x as have the cell contentsof the sample. (Fisher, 1922a, p. 92) restrictions placed upon
114
9. SAMPLING DISTRIBUTIONS
Pearsonwasangeredandcontemptuouslyrepliedin thepagesof Biometrika. He reiteratesthe fundamentalsof his 1900 paperandthen says: The processof substitutingsampleconstantsfor sampled populationconstantsdoes not mean thatwe select out of possible samples of size n, those which have precisely the same valuesof the constantsas theindividual sample under discussion. ... In using the constantsof the given sampleto replace the constants of the sampled population,we in nowise restrictthe original hypothesisof free random samples tied down onlyby their definite size.We certainly do not byusing sample constants reduce in any way therandom sampling degrees of freedom. The abovere-descriptionof what seemto me very elementaryconsiderations would be unnecessaryhad not arecent writerin theJournal of the Royal Statistical Society appearedto have wholly ignored them ... thewriter hasdoneno service to the scienceof statisticsby giving it broad-cast circulation in the pagesof the Journal of the Royal Statistical Society. (K.Pearson, 1922, p. 187)
And on and on,neverreferring to Fisherby namebut only as "mycritic" or "the writer in theJournalof theRoyal StatisticalSociety,'" until the final assault when he accuses Fisher of a disregardfor the natureof the probable error: I trust my critic will pardon me for comparinghim with Don Quixote tilting at the windmill; he must eitherdestroyhimself, or thewhole theoryof probable errors,for they are invariablybasedon using sample values for thoseof the sampled population unknown to us. Forexample hereis an argumentfor Don Quixote of the simplest p. 191) nature ... (K. Pearson, 1922,
The editorsof the Journal of the Royal Statistical Society turned tail and ran, refusing to publish Fisher'srejoinder. Fisher vigorously protested,but to no avail, and he resignedfrom the Society. There wereother questions about that he dealtwith in Bowley'sjournal Economicain 1923and Pearson's paper in the Journal of the Royal Statistical Society (Fisher, 1924a),but the endcame in 1926 when, using data tables that had beenpublishedin Biometrika, Fisher calculated the actual average value of x2 which he had proved earlier should theoretically be unity and which Pearsonstill maintained should be 3. Inevery case the averagewas close to unity, in no casenearto 3 . . .. There was noreply. (Fisher Box, 1978, p. 88,commenting on Fisher, 1926a)
THE t DISTRIBUTION W. S. Gosset was a remarkable man,not the least becausehe managedto maintain reasonably cordial relations with both Pearsonand Fisher,and at the same time.Nor did heavoid disagreeing with themon various statistical issues. He wasborn in 1876 at Canterburyin England,and from 1895to 1899 was
THE t DISTRIBUTION
115
at New College, Oxford, wherehetook a degreein chemistryand mathematics. In 1899 Cossetbecamean employeeof Arthur Guinness,Son andCompany Ltd., the famous manufacturers of stout. He was one of the first of the scientists, trained either at Oxford or Cambridge,that the firm hadbegun hiring(E. S. Pearson, 1939a).His correspondencewith Pearson, Fisher, and others shows him to have beena witty and generousman with a tendencyto downplay his role in the developmentof statistics. Comparedwith the giants of his day he published very little,but his contributionis of critical importance. As Fisher puts it in his Statistical Methodsfor Research Workers: The studyof the exact distributionsof statistics commences in 1908 with"Student's" paper The Probable Error of the Mean. Oncethe true natureof the problem was indicated, a large numberof sampling problems were within reach of mathematical p. 23) solution. (Fisher, 1925/1970,
The brewery had apolicy on publishingby itsemployees that obliged Gosset to publishhis work underthe nomdeplume "Student." In essence,the problem that "Student"tackled was thedevelopmentof a statistical test that could be applied to small samples.The nature of the process of brewing, with its variability in temperatureand ingredients, means that it is not possibleto take large samples over a long run. In a letter to Fisherin 1915,in which he thanks the mathematical solutionto the Fisher for the Biometrika paper that begins small sample problem,"Student"says: The agricultural (and indeed almost any) Experiments naturally requireda solution of the mean/S.D. problem,and the Experimental Brewerywhich concerns such things as theconnection between analysis of malt or hops, and thebehaviourof the beer,andwhich takesa day toeachunit of theexperiment, thuslimiting the numbers, demandedan answerto such questions as, "If with a small numberof casesI get a value r, what is the probability that thereis really a positive correlationof greater value than (say) .25?" (quoted by E. S.Pearson, 1968,p. 447)
Egon Pearson (1939a) notes that in his first few years at Guinness, Gosset was making use ofAiry's Theory of Errors (1861), Lupton's Notes on Observations (1898),and Merriman'sTheMethod of Least Squares (1884).In 1904 he presenteda report to his firm that stated clearly the utility of the application of statisticsto thework of the breweryand pointedup theparticular difficulties that mightbe encountered.The report concludes: We have beenmetwith the difficulty that noneof our books mentions the odds, which areconveniently accepted asbeingsufficient to establishany conclusion,and itmight be of assistanceto us toconsult some mathematical physicist on thematter, (quoted by Pearson,1939a,p.215)
116
9. SAMPLING DISTRIBUTIONS
A meeting was in fact arranged with Pearson, and this took placein the summerof 1905. Not all of Gosset'sproblems were solved, but a supplement to his report and asecond reportin late August 1905 produced many changes in the statisticsusedin the brewery. The standard deviation replaced the mean error, andPearson's correlation coefficient becamean almost routine procedure in examining relationships among the many factors involved with brewing. But one featureof the work concerned Cosset: "Correlation coefficients areusually calculatedfrom large numbersof cases,in fact I havefound only one paperin Biometrika of which the casesare as few innumberas thoseat which I have been working lately" (quoted by Pearson, 1939a, p. 217). Gosset expressed doubt about the reliability of the probable error formula for the correlationcoefficient whenit was appliedto small samples.He went to London in September 1906to spenda year at the Biometric Laboratory.His first paper, published in 1907, derives Poisson's limit of the binomial distribuare countedin a tion and applies it to the error in sampling when yeast cells haemacytometer.But his most important work during that year was thepreparation of his twopaperson theprobable errorof the meanand of thecorrelation coefficient, both of which werepublishedin 1908. The usual methodof determining thatthe meanof the population lies withina given distanceof the meanof the sample,is to assumea normal distribution about the mean of the sample with a standarddeviation equalto 5/Vn, where s is the standard deviation of the sample,and to use the tables of the probability integral. But, as wedecreasethe value of the numberof experiments,the value of the standard deviationfound from the sampleof experiments becomes itself subject to increasing error,until judgements reached in this way become altogether misleading. ("Student," 1908a,pp. 1-2)
"Student"setsout what the paper intendsto do: I. The equationis determinedof the curve which represents the frequency distribution of standard deviations of samples drawnfrom a normal population. II. Thereis shown to be nokind of correlation betweenthe meanand thestandard deviation of sucha sample. III. The equationis determinedof the curve representing the frequency distribution of a quantity z, which is obtained by dividing the distance between the mean of the sampleand themeanof the populationby the standard deviationof the sample. IV. The curve foundin I. is discussed. V. The curve found in III. is discussed. VI. The two curvesare compared with some actual distributions. VII. Tables of the curves foundin III. are given for samplesof different size. VIII and IX. Thetablesareexplainedand some instances aregiven of their use. X. Conclusions. ("Student," 1908, p. 2)
THE t DISTRIBUTION
117
"Student"did not provide a proof for the distribution of z. Indeed,he first examined this distribution, andthat of s, byactually drawing samples (of size4) from measurements, made on 3,000 criminals, takenfrom data usedin a paper by Macdonell (1901).The frequencydistributionsfor s and zwere thus directly obtained,and themathematical work came later. There hasbeen comment over the yearson thefact that "Student's"mathematical approachwas incomplete, but this shouldnot detractfrom his achievements. Welch (1958)maintainsthat: The final verdict of mathematical statisticians will, I believe,be that they have lasting value. They havethe rare quality of showing us how anexceptional man wasable to make mathematicalprogresswithout paying too much regard to the rules. He fortified what he knew with some tentative guessing, but this was backed by subsequent careful testing of his results, (pp.785-786)
In fact "Student"had given to future generationsof scientists,in particular social and biological scientists,a new andpowerful distribution.The z test, which was to becomethe t test,6 led the way for allkindsof significance tests, and indeed influenced Fisher as hedeveloped that mostuseful of tools, the analysis of variance. It is also the case thatthe 1908 paperon theprobable error of the mean ("Student," 1908a) clearly distinguishedbetween whatwe nowcall sample statisticsand population parameters,a distinction that is absolutely critical in modern-day statistical reasoning. The overwhelming influence of the biometriciansof Gower Streetcan perhaps partly account for the fact that "Student's" workwas ignored. In 1939 Fisher said that"Student's"work was receivedwith "weighty apathy," and,as late as 1922, Gosset, writing to Fisher,and sendinga copy of his tables, said, "You are theonly man that's everlikely to usethem!" (quotedby Fisher Box, to say theleast, suspiciousof work using small 1981, p. 63). Pearson was, samples.It was theassumptionof normality of the samplingdistributionthatled to problems,but thebiometricians never used small samples,and"only naughty brewers taken so small that the difference is not of theorder of the probable error!" (Pearson writingto Gosset, September 1912, quoted by Pearson, 1939a, p. 218). Some of "Student's"ideashad been anticipated by Edgeworthas early as 1883 and,as Welch (1958) notes,one might speculateas to what Gosset's reaction would have been had hebeen awareof this work. Gosset'spaperon theprobable errorof thecorrelation coefficient("Student," 1908b) dealt withthe distributionof the r values obtained when sampling from 6
Eisenhart (1970) concludes that the shift from z to / was due to Fisherandthat Gosset chose / for the newstatistic. In their correspondence Gosset used t for his owncalculationsand x forthose of Fisher.
118
9. SAMPLING DISTRIBUTIONS
a populationin which the two variables were uncorrelated, that is, R = O7. This endeavorwasagain basedon empirical sampling distributions constructed from the Macdonell dataand amathematical curve fitted afterwards. With characteristic flair, he says that he attempted to fit a Pearson curveto the "no correlation" distributionand came up with a Type II curve. "Working from 2
0
2
(n-4)/2
y= yoC\. -x ) for samplesof 4, I guessedthe formula y = yQ(\ -x ) and proceededto calculatethe moments" ("Student,"1908b, p.306). He concludeshis paperby hoping thathis work "may serveasillustrations for the successful solver of theproblem" (p. 310). And indeedit did, for Fisher's paperof 1915 showed that "Student" hadguessed correctly. Fisher had beenin correspondence with Gosset in 1912,sendinghim a proof that appealedto «-dimensional spaceof the frequency distributionof z. Gosset wantedPearson to publishit. Dear Pearson, I am enclosing a letter which givesa proof of my formulae for the frequency distribution of z(=xls),... Would you mind lookingat it for me; Idon't feel at home in morethan threedimensions evenif I could understandit otherwise. It seemedto methat if it's all right perhapsyou might like to put theproof in a note. It's so nice and mathematical thatit might appealto some people, (quoted by E. S. Pearson,1968,p. 446)
In fact the proof was notpublished then,but again approachingthe mathematics througha geometrical representation, Fisher derived the samplingdistribution of the correlation coefficient,andthis, together withthe derivation of the 2distribution, was publishedin the 1915 paper.The sampling distribution of r was,of course,of interestto Pearsonand hiscolleagues; afterall, r was Pearson'sstatistic. Moreover, the distribution followsa remarkable systemof curves, witha variety of shapes that depart greatly from the normal, depending on n and thevalueof the true unknown correlation coefficient R, or, as it is now generally known,p. Pearsonwasanxiousto translate theory into numbers, and the computationof the distributionof r wascommencedand publishedas the "co-operativestudy" of Soper, Young, Cave,Lee,and K. Pearson in 1917. Although Pearsonhad said thathe would send Fisher the proofs of the paper (the letteris quotedin E. S.Pearson, 1965), there is, apparently,no record that he in fact received them.E. S.Pearson (1965) suggests that Fisher did notknow "until late in the day" of the criticism of his particular approachin a sectionof the study, "On theDeterminationof the 'Most Likely' Valueof the Correlation 7
R wasGosset's symbol for the population correlationcoefficient, p (the Greek letter rho) appearsto have beenfirst usedfor this valueby Soper(1913). The important pointis thata different symbol was usedfor sample statistic and populationparameter.
THE t DISTRIBUTION
119
in the Sampled Population." Egon Pearson argues his thatfather's criticism, for presumablythe elder Pearsonhad taken a major role in the writing of the "co-operative study," was a misunderstanding based on Fisher's failure to adequately define what he meantby "likelihood" in 1912and thefailureto make clear that it was not basedon the Bayesian principleof inverse probability. was asunexpectedas it wasunjust,and Fisher Box (1978) says, "their criticism it gavean impressionof something less than scrupulous regardfor a new and therefore vulnerable reputation"(p. 79). Fisher's response, published in Metron in 1921 (a),was thepaper, mentioned earlier, that Pearson summarily rejected because he could not afford controuse of the r =tanh z transformation, a versy. In this paper Fisher makes transformation thathad been introduced,a little tentatively,in the 1915 paper. Its immense utilityin transformingthe complex systemof distributionsof r to a simple functionof z, which is almost normally distributed, made the laborious work of the co-operative study redundant. In a paper publishedin 1919, Fisher examinesthe dataon resemblancein twins thathad been studiedby Edward L. Thorndike in 1905. Thisis the earliest exampleof the applicationof Fisherian statistics to psychological data, for the traits examined were both mental and physical. Fisher lookedat the questionof whether there wereany differences between the resemblancesof twins in different traits and here uses the z transformation: When the resemblances have been expressed in terms of the new variable, a correlation table may be constructedby picking out every pair of resemblances betweenthe same twinsin different traits. The valuesare nowcentered symmetrically about a meanat 1.28, and thecorrelationis found to be-.016 ± .048, negative but quite insignificant.The result entirelycorroboratesTHORNDIKE'S conclusions as to thespecializationof resemblance. (Fisher, 1919, p. 493)
These manipulations and thedevelopmentof the same general approach to the distribution of the intraclass correlation coefficient in the 1921 paperare important for the fundamentalsof analysis of variance. From the point of view of the present-day student of statistics, Fisher's (1925a) paperon the applicationsof the t distributionis undoubtedlythe most comprehensible. Here we see, in familiar notationand most clearly stated,the utility of the t distribution,a proof of its "exactitude ... for normal samples" (p. 92), and theformulae for testing the significanceof a difference between meansand thesignificanceof regression coefficients. Finally the probability integral withwhich we areconcernedis of valuein calculating the probability integralof awider classof distributionswhich is relatedto "Student's" distribution in the same manneras that of X2 is relatedto the normal distribution.
120
9. SAMPLING DISTRIBUTIONS
This wider classof distributions appears(i) in the study of intraclass correlations (ii) in the comparisonof estimatesof the variance,or of thestandard deviation from normal samples(iii) in testingthe goodnessof fit of regressionlines (iv) in testing the significance of a multiple correlation,or (v) of acorrelation ratio. (Fisher,1925a, pp. 102-103)
These monumental achievements were realized in less than10 years after Gosset's mixture of mathematicalconjecture,intuition, and thepractical necessities of his work led the way to thedistribution. t From 1912, Gossetand Fisherhad beenin correspondence (although there were some lengthy gaps), but they did not actually meetuntil September 1922, when Gosset visited Fisher at Harpenden.Fisher Box (1981) describes their from their letters. At the end ofWorld War relationshipandreproduces excerpts I, in 1918, Gossetdid not even knowhow Fisherwas employed,and when he learnedthat Fisherhad beena schoolmasterfor the durationof the war and was looking for a job wrote, "I hear that Russell [the head of the Rothamsted Experimental station] intends to get astatistician soon, when hegetsthe money I think, and it might be worth while to keep your ears open to news from Harpenden" (quoted by Fisher Box, 1981, p. 62). In 1922, work beganon thecomputationof a new setof t tables using values of t = z\n - 1 rather thanz, and thetables were entered with the appropriate degreesof freedom rather than n. Fisher and Gosset both worked on the new tables,and after delays,and fits and starts,and thecheckingof errors,the tables
Fig. 9.2 The Normal Distribution and the t Distribution for df = 10 and df = 4
THE F DISTRIBUTION
121
were publishedby "Student" in Metron in 1925. Figure9.2 compares two? distributions to the normal distribution. Fisher's "Applications" paper, menin the same volume, but it had,in fact, been written quite tioned earlier, appeared early in 1924. At that time Fisherwas completinghis 1925 bookand needed to offer the tablesto Pearsonandexpressed doubts about tables. Gosset wanted the copyright, because the first tableshadbeen published in Biometrika. Fisher went aheadandcomputedall thetableshimself,a task he completedlater in the year (Fisher Box, 1981). Fisher sentthe "Applications" paperto Gossetin July 1924. He says that the note is: larger thanI had intended,and tomake it at all complete shouldbe larger still, but I shall not have timeto makeit so, as I amsailingfor Canadaon the25th, and will not be back till September, (quoted by Fisher Box, 1981,p. 66)
The visit to Canadawas made to presenta paper (Fisher, 1924b) at the International Congress of Mathematics, meeting that year in Toronto. This paper discussesthe interrelationshipsof X 2,z, and /. Fisherwasonly 35 years old,and yet the foundationsof his enormousinfluence on statistics werenow securely laid. THE F DISTRIBUTION The Toronto paperdid not, in fact, appearuntil almost4 yearsafter the meeting, by which time the first edition of Statistical Methodsfor Research Workers (1925) had been published.The first use of ananalysisof variance technique was reported earlier (Fisher& Mackenzie, 1923),but this paper was not encounteredby many outsidethe area of agricultural research.It is possible that if the mathematicsof the Toronto paperhadbeen includedin Fisher's book then much of the difficulty that its first readershad with it would have been reducedor eliminated,a point that Fisher acknowledged. After a general introductionon error curvesand goodness-of-fit, Fisher examinesthe x,2 statisticand briefly discusseshis (correct) approachto degrees of freedom. He then pointsout that if a numberof quantitiesx1,,...,* „ are distributed independentlyin the normal distribution with unit standard deviation, then x.2 = Lx is distributedas "the Pearsonian measure of goodnessof 2 fit." In fact, Fisher uses S(x ) to denotethe latter expression,but here, and in what follows on the commentaryon this paper,the more familiar modern notation is employed. Fisher refers to "Student's"work on theerror curveof the standard deviationof a small sample drawn from a normal distributionand 2 shows its relation to . -
122
9. SAMPLING DISTRIBUTIONS
wheren - 1 is thenumberof degreesof freedom (one less than the numberin 2 2 the sample)and s is thebest estimatefrom the sampleof the true variancecr. 2 2 For the generalz distribution, Fisherfirst points out that s, and s2 are misleading estimatesof a, and 2a when sample sizes, drawn from normal distributions,are small: The only exacttreatmentis to eliminate the unknown quantitiesa, and cr 2 from the distribution by replacing the distribution of s by that of log s, and soderiving the distribution of log st/s2. Whereasthe samplingerrorsin s, areproportional to Oj the sampling errorsof log s, depend only uponthe size of the sample from which 5, was calculated. (Fisher, 1924b, p. 808)
then z will be distributed aboutlog a, /a2 asmode, in a distribution which depends wholly upon the integersnl and «2. Knowing this distributionwe cantell at once if an observed value of z is or is notconsistent withany hypothetical valueof the ratio CT, /o2. (Fisher, 1924b,p. 808)
The casesfor infinite and unit degreesof freedom are then considered.In the latter casethe "Student" distributions are generated. In discussingthe accuracyto beascribedto the mean of a small sample,"Student" took the revolutionary step of allowing for the random sampling variationof his estimate of the standard error.If the standard error were known with accuracy the deviation of an observed valuefrom expectation (sayzero),divided by the standard error would be distributed normally with unit standard deviation; but if for the accuratestandarddeviation we substitutean estimatebasedon n - 1 degreesof freedom we have
consequentlythe distribution of t is given by putting n, [degreesof freedom] = 1 and substitutingz = V* log t2. (Fisher, 1924b,pp. 808-809)
THE F DISTRIBUTION
123
For the caseof an infinite numberof degreesof freedom,the tdistribution becomesthe normal distribution. In the finalsectionsof thepaperthe r - tanh z transformationis appliedto the intraclass correlation,an analysisof variance summary table is shown,and z=loge s} /s2 is thevalue thatmay beusedto testthe significanceof the intraclass arealso shownto lead to testsof significance correlation. These basic methods for multiple correlationand r\ (eta),the correlation ratio. The transition from z to F was notFisher's work.In 1934, GeorgeW. Snedecor,in the first of anumberof texts designedto make the techniqueof analysisof variance intelligibleto a wider audience,defined F as theratio of the larger mean squareto the smaller mean square taken from the summary table, using the formula z = '/2loge F. Figure 9.3 shows two F distributions. Snedecorwas Professorof Mathematicsand Director of the Statistical Laboratory at Iowa State Collegein the United States. This institution was largely in North responsiblefor promotingthe valueof themodern statistical techniques America. Fisher himself avoidedthe use of thesymbol F becauseit was notusedby P. C. Mahalonobis,an Indian statisticianwho hadvisited Fisherat Rothamsted in 1926,and who hadtabulatedthe valuesof the variance ratioin 1932,andthus established priority. Snedecor did not know of the work of Mahalonobis,a clear-thinking and enthusiastic workerwho becamethe first Director of the Indian Statistical Institute in 1932. Todayit is still occasionallythe case thatthe ratio is referredto as"Snedecor'sF " in honorof both SnedecorandFisher.
Fig 9.3 The F Distributions for 1,5 and 8,8 Degrees of Freedom
124
9. SAMPLING DISTRIBUTIONS
THE CENTRAL LIMIT THEOREM The transformationof numerical datato statistics and theassessmentof the probability of the outcome beinga chance occurrence, or the assessmentof a probable rangeof values within whichthe outcome will fall, is a reasonable general descriptionof the statistical inferential strategy.The mathematical foundationsof theseprocedures rest on aremarkable theorem, or rathera set of theorems, thataregrouped togetheras theCentral Limit Theorem. Much of the work that led to this theoremhas already been mentioned but it is appropriate to summarizeit here. In the most general terms,the problem is to discover the probability distribution of the cumulativeeffect of many independently acting, and very small, randomeffects. The centrallimit theorem brings together the mathematical propositions that show that the required distributionis thenormal distribution. In other words,a summaryof a subsetof random observations^,^,.. . ,Xn, saytheir sum'LX=X] + X2 +... + Xn or their mean,(Ud/n has asampling distribution that approaches the shapeof the normal distribution. Figure9.4 is an illustration of a sampling distributionof means. This chapter has described distributionsof other statistical summaries, but it will have been noted that, in one way oranother, those distributions have links with the normal distribution.
Fig. 9.4 A Sampling Distribution of Means (n = 6). 500 draws from the numbers 1 through 10.
THE CENTRAL LIMIT THEOREM
125
The history of the developmentof the theoremhasbeen brought together in a useful and eminently readable little book by Adams (1974).The story begins with the definition of probability as long-run relative frequency, the ratio of the numberof ways an eventcanoccurto thetotal numberof possible events, given that the eventsare independentand equallylikely. The most important contribution of James Bernoulli'sArs Conjectandi is the first limit theorem of probability theory. The logic of Bernouilli's approachhasbeenthe subjectof some debate,and weknow that he himself wrestled withit. Hacking (1971) states: No one writes dispassionatelyaboutBernoulli. He hasbeenfathered with the first subjective conceptof probability, and with a completely objectiveconcept of probability as relative frequency determined by trials on a chance set-up. He has beenthought to favour an inductive theoryof probability akinto Carnap's.Yet he is said to anticipateNeyman'sconfidence interval technique of statistical inference, which is quite opposedto inductive probabilities.In fact Bernoulli was, likeso many of us, attractedto many of these seemingly incompatible ideas, and he wasunsure where to rest his case. He left his book unpublished, (pp. 209-210)
But Bernoulli's mathematics are not inquestion. Whenthe probability^ of an eventE is unknownand asequenceof n trials is observedand theproportion of occurrencesof £ is£nthen Bernoulli maintainedthat an "instinct of nature" causesus to use n£ as anestimateof p. Bernoulli's Law of Large Numbers 8, |p - En \ < e increasesto 1 as « shows thatfor any small assigned amount increasesindefinitely: It is concluded thatin 25,550trials it is more thanone thousand timesmore likely that the r/t [the ratio of what Bernoulli calls "fertile" events to thetotal, that is, the event of interest here called £] will fall between 31/50and 29/50 than thatr/t will fall outside these limits. (Bernoulli, 1713/1966, p. 65) The aboveresults holdfor known p. If/? is 3/5, thenwe can bemorally certain that ... thedeviation ... will be less than 1/50.But Bernoulli's problemis the inverse of this. When/?is unknown, can hisanalysis tell whenwe can bemorally certain that someestimateof p isright? Thatis theproblemof statistical inference. (Hacking, 1971, p. 222)
The determination of/?was ofcoursethe problem tackledby De Moivre. He usedthe integrale~*2 to approximateit, but asAdams (1974)and others have noted, thereis no direct evidence thatimplies that De Moivre thoughtof what is now called the normal distribution as a probability distribution as such. Simpson introduced the notion of a probability distributionof observations,andothers, notably Joseph Lagrange (1736-1813)andDaniel Bernoulli
126
9. SAMPLING DISTRIBUTIONS
(1700-1782),who wasJames's nephew, elaborated laws of error.Thelate 18th centurysaw theculminationof this workin the development of the normallaw of frequencyof error and itsapplicationto astronomical observations by Gauss. It was Laplace's memoir of 1810 that introduced the central limit theorem,but the nub of hisdiscoverywas describedin nonmathematical language in his famous Essai publishedas the introduction to the third edition of Theorie Analytique desProbabilitesin 1820: The general problem consists in determiningthe probabilities thatthe valuesof one or several linear functions of the errors of a very great numberof observationsare contained withinany limits. The law of thepossibility of the errors of observations introduces intothe expressionsof these probabilities a constant, whose value seems to requirethe knowledgeof this law, whichis almost always unknown. Happily this constantcan bedeterminedfrom the observations. Thereoften existsin the observationsmany sourcesof errors:...The analysis which I have used leads easily, whatever the numberof the sourcesof error may be, to thesystemof factors which givesthe most advantageous results, or thosein which the same erroris less probable than in any other system. I ought to make herean important remark. The small uncertainty thatthe observations, when they are not numerous, leavein regard to the values of the constants...rendersa little uncertainthe probabilities determined by analysis. But it almost alwayssuffices to know if the probability, thatthe errors of the results obtained arecomprised within narrow limits, approaches closely to unity; andwhen it is not, it sufficesto know up to what pointthe observations should be multiplied, in orderto obtainaprobability such thatno reasonable doubt remains... The analytic formulae of probabilitiessatisfy perfectlythis requirement; . . . They are likewise indispensablein solving a great numberof problems in the natural and moral pp. 192-195in thetranslationby Truscott&Emory, 1951) sciences. (Laplace, 1820,
Theseareelegantandclear remarksby a geniuswho succeededin makingthe laborsof many years intelligible to a wide readership. Adams(1974) givesa brief accountof the final formal developmentof the abstract Central Limit Theorem.The Russian mathematician Alexander Lyapunoff (1857-1918), a pupil of Tchebycheff, provided a rigorous mathematicalproof of the theorem.His attentionwas drawnto theproblem when he waspreparing lecturesfor a course in probability theory,and his approachwas atriumph. His methodsand insights led,in the 1920s,to many valuablecontributionsand even morepowerful theoremsin probability mathematics.
10
Comparisons, Correlations and Predictions
COMPARING MEASUREMENTS Sincethe time of De Moivre, the variables that have been examined by workers in the field of probability have expressed measurements asmultiplesof a variety of basic units that reflectthe dispersionof the range of possiblescores.Today the chosen unitsare units of standard deviation,and thescoresobtained are called standard scoresor z scores. Karl Pearson usedthe term standard deviation andgave it the symbol a (the lower caseGreek letter sigma)in 1894, but the unit was known (althoughnot in itspresent-day form)to De Moivre. It correspondsto that point on theabscissaof a normal distribution such that an ordinateerectedfrom it would cut thecurveat itspoint of inflection, or, in simple terms, the point wherethe curvatureof the function changesfrom concaveto convex. Sir GeorgeAiry (1801-1892)namedaV2 themodulus (although this term had been used,in passing,for VK by De Moivre as early as 1733)and describeda variety of other possible measures, including theprobableerror, in 1875.' This latterwas theunit chosenby Gallon (whose workis discussed later), although he objected stronglyto the name: It is astonishing that mathematicians, who are themost preciseand perspicaciousof men, havenot long since revolted against this cumbrous, slip-shod, and misleading phrase Moreoverthe term Probable Erroris absurd when applied to the subjects
1 The probable erroris definedas onehalf of the quantity that encompasses the middle 50% of a normal distributionof measurements.It is equivalentto what is sometimes called the semi-interquartile range (that portionof the distribution betweenthe first quartileor the twenty-fifth percentile,and the third quartileor the seventy-fifth percentile, dividedby two). The probable erroris 0.67449times the standard deviation. Nowadays, everyone follows Pearson (1894), who wrote, "I have alwaysfound it more convenientto work with the standard deviation than with the probable erroror the modulus,in terms of which the error function is usually tabulated" (p.88).
127
128
10. COMPARISONS, CORRELATIONS AND PREDICTIONS
now in hand, suchas Stature, Eye-colour, Artistic Faculty,or Disease.I shall therefore usuallyspeakof Prob. Deviation. (Galton, 1889, p. 58)
This objection reflects Galton's determination, and that of his followers,to avoid the use of theconcept of error in describingthe variation of human of characteristics. It also foreshadowsthe well-nigh complete replacement probable error with standard deviation, and law of frequency oferror with normal distribution, developments that reflect philosophical dispositions rather than mathematical advance. Figure 10.1 shows the relationship between standard scoresandprobable error. Perhapsa word or two of elaborationand examplewill illustratethe utility of measurements made in units of variability. It may seem triteto make the statement that individual measurements and quantitative descriptions are made for the purposeof making comparisons. Someone who pays $1,000for a suit hasbought an expensive suit,aswell ashaving paida greatdeal more thanthe individual who haspickedup acheapoutfit for only $50.The labels"expensive" and "cheap"areapplied because the suit-buyingpopulation carries around with it some notion of the average priceof suitsand some notionof the rangeof
FIG. 10.1
TheNormal Distribution - Standard Scores and the Probable Error
GALTON'S DISCOVERY OF REGRESSION
129
prices of suits. This obviousfact would be made even more obvious by the reaction to an announcement that someone had just purchaseda new car for $1,000. Whatis a lot ofmoney for a suit suddenly becomes almost trifling for a brand new car. Again this judgment depends on aknowledge, whichmay not be at allprecise,of the average price,and theprice range,of automobiles. One can own anexpensive suitand acheapcar and have paid the same absolute amount for both, althoughit must be admitted that sucha juxtaposition of purchasesis unlikely! The point is that these examples illustrate the fundamental objective of standard scores,the comparisonof measurements. If the mean priceof cars is $9,000 and thestandard deviationis $2,500 then our $1,000car has an equivalentz scoreof ($1,000 - $9,000)/$2,500or-3.20. If suit prices havea meanof $350and astandard deviation of $150, thenthe $1,000.00suit has anequivalentz scoreof ($1,000 - $350)/$150or +4.33. We havea very cheapcar and avery, very expensive suit.It might be added that, at thetime of writing, thesefigures were entirely hypothetical. are well-known to usersof statistics, and,of These simple manipulations course, when they are appliedin conjunctionwith the probability distributions of the measurements, they enable us to obtainthe probability of occurrenceof particular scoresor particular rangesof scores. Theyare intuitively sensible. A more challenging problem is that of the comparisonof setsof pairs of scores and of determininga quantitative description of the relationship between them. It is a problem thatwas solvedduring the secondhalf of the 19th century. GALTON'S DISCOVERY OF "REGRESSION" Sir Francis Gallon (his workwas recognizedby the award of a knighthood in 1909) has been describedas aVictorian genius. If we follow Edison's oftquoted definition2 of this condition then therecan be noquarrel with the designation,but Galtonwas a man ofgreat flair as well asgreat energyand his ideas and innovations were many and varied. In a long life he produced over 300 publications, including17 books. But,in one of those odd happeningsin the history of human invention,it was Galton's discoveryand subsequent the beginningof the techmisinterpretation of a statistical artifact that marks nique of correlationas we nowknow it. 1860 was theyear of the famous debateon Darwin's work between Thomas (1805-1873). This enHuxley (1825-1895)and Bishop Samuel Wilberforce counter, which heldthe attentionof the nation, took placeat themeetingof the British Association (forthe Advancementof Science)held thatyearat Oxford. Galton attendedthe meeting,and althoughwe do notknow what rolehe had at it, we doknow thathe later becamean ardent supporter of his cousin's theories. In 1932, Thomas Alva Edison said that "Genius is onepercent inspiration and ninety-nineper cent perspiration."
130
10. COMPARISONS, CORRELATIONS AND PREDICTIONS
The Origin of Species,he said, "madea marked epochon my development,as it did in that of human thoughtgenerally"(Galton, 1908, p. 287). had made similar assertions Cowan (1977) notes that, although Galton before, the impact of The Origin must have been retrospective, describing his initial reactionto thebook as,"pedestrianin the extreme." Shealso asserts that the argumentfor evolutionby natural selection, "Galton never really understood nor was heinterestedin the problem of the creation of new species"(Cowan, 1977, p. 165). In fact, it is apparentthat Galtonwas quite selectivein using his cousin's work to supporthis ownview of the mechanismsof heredity. TheOxford debatewas notjust a debateabout evolution. MacKenzie (1981) as the"basicwork of Victorian scientific naturalism" describes Darwin's book (p. 54), the notion of the world, and thehuman species and itsworks, aspart of rational scientific nature, needing no recourseto thesupernaturalto explainthe mysteriesof existence. Naturalismhas itsorigins in the rise of sciencein the 17th and 18th centuries,and itsopponents expressed their concern, because of its attack, implicitandovert,on traditional authority.As one19th century writer notes,for example: Wider speculationsas to morality inevitably occuras soon as thevision of God becomesfaint; whenthe Almighty retires behind second causes,insteadof being felt as animmediate presence, and hisexistencebecomesthe subject of logical proof. (Stephen,1876, Vol II, p. 2).
The return to nature espousedby writers suchas Jean Jacques Rousseau (1712-1778)and hisfollowers would haverid the world of kings and priests and aristocrats whose authority rested on tradition and instinct rather than a simple"natural" stateof reason,and thus, they insisted, have brought about society. MacKenzie (1981), citing Turner (1978), adds a very practical noteto this philosophy.The battlewas notjust about intellectual abstractions but about who should have authorityand control and who should enjoy the material advantages that flow from the possession of that authority. Scientific naturalism was theweaponof the middle classin its strugglefor powerandauthority based on intellect andmerit andprofessional elitism,and not onpatronageor nobility or religious affiliation. These ideas,and the newbiology, certainly turned Galton away from religion, as well as providing him with an abiding interestin heredity. Forrest Gallon'sfascination for,and work on, heredity coincided (1974) suggests that with the realization thathis ownmarriage wouldbe infertile. This alsomay have he suffered in 1866. "Another been a factor in the mental breakdown that possibleprecipitating factorwas theloss of his religious faith which left him with no compensatory philosophy until his programmefor theeugenic improvement of mankind becamea future article of faith" (Forrest,1974, p. 85).
GALTON'S DISCOVERY OF REGRESSION
131
During the years 1866to 1869 Galtonwas ingenerally poor health, but he collectedthe material for,andwrote, one of hismost famous books, Hereditary Genius, whichwas published in 1869. In this book and in English Men of Science, published in 1874, Galton expounds and expands uponhis view that are innately ability, talent, intellectual power,and accompanying eminence The corollary of this view was,of rather than environmentally determined. course, that agencies of social control shouldbe established that would encourage the"best stock" to have children,and todiscourage those whom Galton of "civic worth" from breeding. It describedas havingthe smallest quantities has been mentioned that Galton was acollector of measurements to a degree that wasalmostcompulsive,and it iscertainthat he was nothappy withthe fact qualitative rather than quantitative data in supportof his that he had had to use argumentsin the two books just cited. MacKenzie (1981) maintains that: the needsof eugenicsin large part determinedthe content of Galton's statistical theory....If the immediateproblemsof eugenicsresearchwereto besolved,a new theory of statistics,different from that of the previouslydominanterror theoristshad to be constructed.(MacKenzie, 1981, p. 52)
Galton embarkedon acomparative study of the sizeandweightof sweetpea 3 seeds overtwo generations, but, as helater remarked,"It was anthropological evidence thatI desired, caringonly for the seedsas meansof throwing lighton heredityin man. I tried in vain for a long and weary timeto obtainit in sufficient abundance" (Galton, 1885a, p. 247). Galton began by weighing,and measuringthe diametersof, thousandsof sweet peaseeds. He computedthe meanand theprobable errorof the weights of theseseedsand madeup packetsof 10seeds, each of the seeds being exactly the same weight.The smallest packet contained seeds weighing the mean minus three timesthe probable error,the nextthe meanminustwice the probable error, and so on up topackets containing the largest seeds weighing the mean plus of the seven packets were sent to friends three timesthe probable error. Sets acrossthe length and breadthof Britain with detailed instructions on how they two crop failures, but theproduce were to beplantedand nurtured. There were of seven harvests provided Galton with the datafor a Royal Institution lecture, for Gallon's experi"Typical Laws of Heredity," given in 1877. Complete data ment are notavailable,but heobserved whathe statedto be asimplelaw that connected parentand offspring seeds.The offspring of each of the parental In Memoriesof My Life (1908), Galton says that he determinedon experimenting with sweet peas in 1885 and that the suggestionhad come to him from Sir Joseph Hooker (the botanist, 1817-1911)and Darwin. But Darwin had died in 1882 and theexperiments must have been suggested and were begun in 1875. Assuming that this is notjust a typographical error, perhaps Galton was recalling the dateof his important paperon regressionin hereditary stature.
132
10. COMPARISONS, CORRELATIONS AND PREDICTIONS
weight categorieshad weights that were what we would now call normally distributed, and,the probable error(we would now calculate the standard deviation)was thesame. However, the mean weightof each groupof offspring was not asextremeas theparental weight. Large parent seeds produced larger but mean offspring weight was not aslarge as parental than average seeds, on average, smaller weight. At the other extremesmall parental seeds produced, offspring seeds,but themeanof the offspring was found not to be assmall as that of the parents. This phenomenon Galton termed reversion. Seeddiameters, Galton noted, are directly proportionalto their weight,and show the same effect: By family variability is meantthe departureof the childrenof the sameor similarly descendedfamilies, from the ideal mean typeof all of them. Reversionis the tendency of that ideal mean filial type to departfrom the parent type, "reverting" towards whatmay beroughly and perhapsfairly describedas theaverageancestral type. If family variability had beenthe only processin simple descent that affected the characteristicsof a sample,the dispersionof the race fromits mean ideal type would indefinitely increase withthe numberof generations;but reversion checks this increase,andbrings it to a standstill. (Galton, 1877, p. 291)
In the 1877 paper, Galton gives a measureof reversion,\v\uchhe symbolized r, and arrivesat anumberof the basic properties of what we nowcall regression. to thetopic, but during those years Some years passed before Galton returned he devised waysof obtainingthe anthropometric data he wanted. He offered prizes for the most detailed accounts of family historiesof physicalandmental and temperament, occupations and illnesses, height characteristics, character and appearance,and so on, and in1884 he opened,at his own expense,an anthropometric laboratoryat theInternational Health Exhibition. For a small sum of money, members of the public were admittedto the laboratory. In return the visitors receiveda record of their various physical of strength, sensory acuities, breathing capacity, color dimensions, measures discrimination, and judgmentsof length. 9,337 persons were measured, of whom 4,726 were adult males and 1,657 adult females. At the end of the exhibition, Galton obtaineda site for the laboratoryat the South Kensington Museum,where data continued to becollectedfor closeto 8 more years. These of course,not entirely free data formed partof a numberof papers. They were, from errorsdue toapparatus failure, the circumstancesof the data recording, andother factors thatarefamiliar enoughto experimentalistsof the present day. in other ways. Galton (1884) Some artifacts might have been introduced commented, "Hardlyany trouble occurredwith the visitors, thoughon some few occasions rough persons entered the laboratorywho were apparentlynot altogethersober"(p. 206). In 1885, Gallon's Presidential Address to theAnthropological Sectionof
FIG 10.2
Plate from Galton's 1885a paper (Journal of the Anthropological Institute).
FIG. 10.3
Plate from Galton's 1885a paper (Journal of the Anthropological Institute).
GALTON'S DISCOVERY OF REGRESSION
135
the British Association (1885b), meeting that year in Aberdeen, Scotland, discussedthe phenomenonof what he nowtermed regression toward mediocrity in human hereditary stature.An extended paper (1885a) in the Journal of the Anthropological Institute gives illustrations, which are reproducedhere. First it may benoted that Galton used a measureof parental heights that he termed the height of the "mid-parent." He multiplied the mother's heightby 1.08 andtook the meanof the resulting valueand thefather's heightto produce 4 the mid-parent value. He found that a deviation from mediocrity of one unit of height in the parentswas accompaniedby a deviation,on average,of about only two-thirds of a unit in the children (Fig. 10.2). This outcome paralleled what he hadobservedin the sweetpea data. When the frequenciesof the (adult) children's measurements were entered into a matrix against mid-parent heights, the data being"smoothed"by computing the meansof four adjacent cells, Galton noticed that values of the same frequency fell on a line that constitutedan ellipse. Indeed,the data produceda series of ellipsesall centeredon themeanof the measurements. Straight lines drawn from this centerto pointson theellipse that were maximally distant (the points of contact of thehorizontalandvertical tangents- thelinesYN and XM in Fig. 10.3) producethe regressionlines,ON and OM, and theslopesof these lines give the regression values of \ andj . The elliptic contours, which Galton said he noticed whenhe waspondering on his datawhile waiting for a train, are nothing more thanthe contour lines that are producedfrom the horizontal sectionsof the frequency surface generated by two normal distributions (Fig. 10.4). The time had nowcome for some serious mathematics. All the formulae for Conic Sections having long since goneout of my head,I went on my return to London to the Royal Institutionto read themup. Professor, now Sir James,Dewar, camein, and probably noticingsigns of despair on my face, asked me what I was about;then said,"Why do you bother over this?My brother-in-law, J. Hamilton Dickson of Peterhouse loves problems and wants new ones. Sendit to him." I did so, under the form of a problem in mechanics,and hemost cordially helped me by working it out, as proposed,on thebasis of the usuallyacceptedand generally justifiable Gaussian Law of Error. (Galton, 1908,pp. 302-303) I may bepermitted to saythat I never felt sucha glow of loyalty and respecttowards the sovereigntyand magnificent swayof mathematical analysis aswhen his answer Galton (1885a) maintains that this factor "differs a very little from the factors employedby other anthropologists, who, moreover, differ a trifle between themselves; anyhow it suitsmy data better than 1.07 or 1.09" (p. 247). Galton also maintained (and checked in his data) "that marriage selection takes little or noaccountof shortnessor tallness...we maythereforeregardthe marriedfolk ascouples picked out of the general populationat haphazard" (1885a,pp. 250-251) - a statement thatis not only implausibleto anyonewho hascasually observed married couples but isalsonot borneout byreasonably careful investigation.
FIG. 10.4
136
Frequency Surfaces and Ellipses (From Yule & Kendall, 14th Ed., 1950)
GALTON'S DISCOVERY OF REGRESSION
137
reached me, confirming, by purely mathematical reasoning,my various and laboriousstatistical conclusions than I had daredto hope, for the original data ran somewhatroughly, and 1 had tosmooth them with tender caution. (Galton, 1885b, p. 509)
Now Galton was certainly not a mathematical ignoramus, but the fact that one ofstatistics' founding fathers sought help for the analysisof his momentous discovery may be ofsome small comfortto studentsof the social scienceswho sometimesfind mathematics such a trial. Another breakthroughwas to come, and again, it was to culminatein a mathematical analysis, this time by Karl Pearson, and the developmentof the familiar formula for the correlation coefficient. In 1886, a paperon "Family Likenessin Stature,"publishedin the Proceedings of the Royal Society, presents Hamilton Dickson's contribution, as well as data collectedby Galton from family records. Of passing interestis his use of the symbol w for "the ratio of regression" (short-lived, as it turns out)as he details correlations between pairs of relatives. Gallon's work had led him to amethod of describing the relationship on a particular between parentsand offspring and between other relatives characteristicby usingthe regression slope.Now heapplied himselfto the task of quantifying the relationship between different characteristics,the sort of data It dawnedon him in aflash of insight collectedat theanthropometric laboratory. was measuredon ascale basedon its ownvariability that if each characteristic (in other words,in what we now call standard scores), then the regression coefficient could be applied to these data. It was noted in chapter 1 that the location of this illumination was perhapsnot the place that Galton recalled in his memoirs (written whenhe was in hiseighties)so that the commemorative the discovery deserved will haveto be sited carefully. tablet that Pearson said Before examining someof the consequencesof Galton's inspiration,the phenomenonof regressionis worth a further look. Arithmeticallyit is real it: enough, but,as Forrest (1974) puts It is not that the offspring have been forced towards mediocrity by the pressureof their mediocreremote ancestry,but aconsequenceof a less than perfect correlation betweenthe parentsand their offspring. By restrictinghis analysisto the offspring of a selectedparentageand attemptingto understand their deviations from the mean Galton failsto account for the deviationof all offspring. ... Galton's conclusionis that regressionis perpetual and that the only way in which evolutionarychangecan occur is through the occurrenceof sports, (p. 206)5 5 In this contextthe term "sport" refersto ananimal or a plant that differs strikingly from its species type. In modern parlance we would speakof "mutations." It is ironic thatthe Mendelians,led byWilliam Bateson(1861-1926),used Galton's work in supportof their argument thatevolution was discontinuous and saltatory, and that the biometricians.led by Pearsonand Weldon,who held fast to the notion that continuousvariationwas thefountainheadof evolution, took inspiration from the same source.
138
10. COMPARISONS, CORRELATIONS AND PREDICTIONS
Wallis and Roberts (1956) present a delightful accountof the fallacy and give examplesof it in connection with companyprofits, family incomes, mid-term and finalgrades,and salesandpolitical campaigns.As they say: Take any set ofdata,arrange themin groups according to some characteristic, and then for eachgroup computethe averageof some secondcharacteristic. Thenthe variability of the second characteristic will usually appearto be less than thatof the first characteristic.( p. 263)
In statisticsthe term regressionnow means prediction,a point of confusion for many students unfamiliar with its history. Yule and Kendall (1950) observe: The term "regression"is not aparticularly happyone from the etymological point of view, but it is sofirmly embeddedin statistical literature that we makeno attempt to replace it by an expression whichwould more suitably expressits essential properties,(p. 213)
The word is now part of the statistical arsenal, and it servesto remind those of us who areinvolvedin its application of an important episodein the history of the discipline. GALTON'S MEASURE OF CO-RELATION On December 20th, 1888, Galton's paper, Co-relations and Their Measurement,Chiefly from Anthropometric Data,was read beforethe Royal Societyof London. It begins: "Co-relation or correlation of structure" is a phrase much used in biology, and not least in that branchof it which refersto heredity,and theideais even morefrequently present thanthe phrase;but 1 am notawareof anyprevious attempt to define it clearly, to traceits modeof action in detail, or to show how to measureits degree. (Galton, 1888a,p. 135)
He goeson to statethat the co-relation between two variable organs must be due, in part, to common causes, that if variation was wholly due to common causes then co-relation would be perfect,and if variation "were in no respect due tocommon causes, the co-relationwould be nil" (p. 135). His aim thenis to showhow this co-relationmay beexpressedas asimple number,and heuses as anillustrationthe relationship between the left cubit (the distance between the elbow of the bent left arm and the tip of themiddle finger) and stature, althoughhe presents tables showingthe relationshipsbetweena varietyof other physical measurements.His data are drawn from measurements made on 350 adult males at the anthropometric laboratory, and then, as now, there were missingdata. "The exact number of 350 is notpreserved throughout, as injury
GALTON'S MEASURE OF CO-RELATION
139
to some limbor other reducedthe available numberby 1, 2, or 3 indifferent cases"(Galton, 1888a,p. 137). After tabulatingthe data in order of magnitude, Galton noted the valuesat the first, second,andthird quartiles.One half the value obtainedby subtracting the value at thethird from the valueat thesecond quartile gives him Q, which, he notes,is the probable errorof any single measurein the series,and thevalue at the second quartileis the median. For staturehe obtaineda medianof 67.2 inchesand a Q of1.75, and for theleft cubit, a medianof 18.05 inchesand a Q of 0.56. It shouldbe noted that although Galton calculated the median,he refers to it as being practically the mean value,"becausethe series run with fair symmetry" (p. 137). Galton clearly recognized that these manipulations did not demand thatthe original unitsof measurement be thesame: It will be understood that the Q valueis auniversalunit applicableto themost varied measurements, such asbreathing capacity, strength, memory, keenness of eyesight, and enables themto be compared together on equal terms notwithstanding their intrinsic diversity. (Galton, 1888a,p. 137)
Perhaps in an unconscious anticipation of the inevitable universalityof the metric system,he also recordshis data on physical dimensions in centimeters. Figure 10.5 reproduces the data (TableIII in Galton's paper)from which the closenessof the co-relation between stature and cubit was calculated. A graph was plotted of stature, measured in deviationsfrom M s in units of Qs, against the mean of the correspondingleft cubits, again measured as deviations, this timefrom M c in units of Qc (columnA against columnB in the table). Betweenthe same axes,left cubit was plotted as adeviation from M c measuredin units of Qc againstthe mean of corresponding statures measured as deviationsfrom M s in units of Qs (columnsC and D in thetable). A line is then drawnto represent "the general run" of the plotted points: It is here seento be astraight line, and it wassimilarly found to be straight in every other figure drawnfrom the different pairsof co-related variables that I haveas yet tried. But the inclinationof the line to the vertical differs considerablyin different cases. In the presentone theinclination is such that a deviationof 1 on thepart of the subject [the ordinate values], whether it be statureor cubit, is accompaniedby a mean deviationon thepart of the relative [the valuesof the abscissa], whether it be cubit or stature,of 0.8. This decimalfraction is consequentlythe measureof the closenessof the co-relation. (Galton, 1888a, p. 140)
Galton also calculates the predicted values from the regression line.He takes what he terms the "smoothed"(i.e., readfrom the regression line) value for a given deviationmeasurein unitsof Qc or Qs, multipliesit by Qc or Qs, andadds the result to the meanMc or My For example, +1.30(0.56) + 18.05 = 18.8. In modern terms,he computesz' (s) + X~ X. It isilluminating to recomputeand
FIG. 10.5
Galton's Data 1888. Proceedings of the Royal Society.
THE COEFFICIENT OF CORRELATION
141
to replot Galton'sdataand tofollow his line of statistical reasoning. Finally, Galton returnsto his original symbol r, to representdegree of co-relation, thesymbol whichwe usetoday, andredefines/=\ 1 - r 2 as"the Q value of the distribution of any systemof x values,as x} , x2, x3, &c., round the mean of all of them, which we may call X " (p. 144), which mirrorsour modern-day calculationof the standard errorof estimate,and which Galton had obtainedin 1877. of themeasurementof correlation. This short paperis not acomplete account Galton shows thathe has not yetarrived at the notion of negative correlation nor of multiple correlation,but his concluding sentences show just how far he did go: Lety = the deviationof the subject, whichever of the two variablesmay betaken in that capacity;and let;c,, ;c2, x3, &c., be thecorresponding deviations of the relative, and let the meanof thesebe X. Then we find: (1) that y = rX for all valuesof y; (2) that r is the samewhicheverof the two variablesis taken for the subject;(3) that r is always less than1; (4) that r measuresthe closenessof co-relation. (Gallon,1888a, p. 145)
Chronologically, Gallon'sfinal contributionto regressionandheredity is his book Natural Inheritance, published in 1889. This bookwascompleted several months beforethe 1888 paperon correlationandcontains noneof its important findings. It was, however,an influential book that repeatsa great dealof Galton's earlier workon the statisticsof heredity,the sweetpea experiments, the data on stature,the recordsof family faculties, and so on. It wasenthusiastically received by Walter Weldon(1860-1906),who wasthen a Fellow of St John's College, Cambridge, and University Lecturer in Invertebrate Morpholto problems in species ogy, and it pointed him toward quantitative solutions variation thathad been occupyinghis attention. This booklinked Galton with Weldon, and Weldon with Pearson,and then Pearsonwith Galton, a concatenation that beganthe biometric movement. THE COEFFICIENT OF CORRELATION In June 1890, Weldon was electeda Fellow of the Royal Society,and later that of Zoology at University College, London.In year became Jodrell Professor March of the same yearthe Royal Societyhad receivedthe first of his biometric papers that describes the distribution of variationsin a numberof measurements made on shrimps. A Marine Biological Laboratoryhad been constructedat Plymouth2 yearsearlier, and since that time Weldon had spent partof the year there collecting-measurements of thephysical dimensions of these creatures and their organs.The Royal Society paperhad been sentto Galton for review, and with his help the statistical analyseshad been reworked. This marked the
142
10. COMPARISONS, CORRELATIONS AND PREDICTIONS
beginningof Weldon'sfriendshipwith Gallon and aconcentrationon biometric work that lastedfor the rest of his life. In fact, the paper doesnot present the resultsof a correlational analysis, although apparently one hadbeen carried out. I haveattemptedto apply to theorgansmeasuredthe testof correlation givenby Mr. Galton ... and theresult seemsto show thatthe degreeof correlation betweentwo organsis constantin all the racesexamined;Mr. Galton has,in a letter to myself, predicted thisresult A result of this kind is, however,so important to the general theory of heredity, thatI prefer to postponea discussionof it until a larger body of evidencehas been collected. (Weldon, 1890, p. 453)
The year 1892saw these analyses published.Weldon beginshis paperby summarizingGalton's methods.He then describesthe measurements that he made, presents extensive tables of his calculations,the degreeof correlalion between the pairs of organs, and the probable errorof the distributions (Qvl -r 2). The actualcalculation of the degreeof correlation departs somewas inclose touch wilt whal from Gallon's method,although,because Gallon Weldon, we canassume thatit had hisblessing.It is quite straightforwardand is here quotedin full: (1.). . . let all thoseindividuals be chosenin which a certain organ,A, differs from its averagesizeby a fixed amount,Y; then, in these individuals,let thedeviationsof a secondorgan,B, from its averagebe measured.The variousindividualswill exhibit deviationsof B equal tox} ,x2,x3,..., whose meanmay becalled xm. Theratio jcm/Y will be constantfor all valuesof Y. In the sameway, supposethoseindividuals are chosenin which the organB has a constant deviation, X; then,in theseindividuals,ym. themean deviation of the organ A, will have the sameratio to X, whatevermay be thevalueof X. (2.) The ratios * m/Y and ym/X are connectedby an interesting relation. Let Qa representthe probable errorof distribution of the organA about its average,and Qb that of theorgan B; then -
a constant. So that by taking a fixed deviationof either organ, expressed in terms of its probable error,and byexpressingthe mean associated deviation of the second organ in terms of its probable error,a ratio may bedetermined,whosevalue becomes± 1 when a changein either organ involves an equal changein the other, and 0when the two organsarequite independent. This constant, therefore, measures the "degreeof correlation"betweenthe twoorgans.(Weldon, 1892,p. 3)
In 1893, more extensive calculations were reported on data collectedfrom two large (eachof 1,000adult females)samplesof crabs,one from the Bay of
THE COEFFICIENT OF CORRELATION
143
Naples and the other from Plymouth Sound.In this work, Weldon (1893) computesthe mean,the meanerror, and themodulus. His greater mathematical sophistication in this work is evident. He states: The probableerror is given below, instead of the'meanerror, becauseit is the constant which has thesmallest numerical value of any ingeneral use. This property renders the probable error more convenient than either the meanerror, the modulus,or the error of mean squares,in the determinationof the degreeof correlation whichwill be described below. (Weldon, 1893, pp. 322-323)
Weldon found thatin the Naples specimens the distributionof the "frontal breadth"producedwhat he terms an "asymmetricalresult." This finding he hoped might arise from the presencein the sampleof two racesof individuals. He notesthat Karl Pearsonhad tested this supposition and found that it was likely. Pearson (1906)statesin his obituary of Weldon thatit was this problem that led to his(Pearson's)first paper in the Mathematical Contributionsto the Theory of Evolution series, receivedby the Royal Societyin 1893. Weldon definedr and attemptedto nameit for Galton: a measureof the degree to which abnormalityin one organ is accompaniedby abnormality in a second. It becomes±1 when a changein one organ involvesan equal changein the other, and 0when the two organsare quite independent.The importance of this constantin all attemptsto deal with the problems of animal variation was first pointedout by Mr. Galton ... theconstant...may fitly be known as "Galton'sfunction." (Weldon, 1893,p. 325)
The statisticsof heredityand amutual interestin the plansfor the reform of the Universityof London (whichare describedby Pearsonin Weldon's obituary) drew Pearson and Weldon together,and they were close friendsand colleagues until Weldon's untimely death. Weldon's primary concern was to make his discipline, particularlyas it related to evolution, a more rigorous scienceby introducing statistical methods. He realized thathis own mathematical abilities were limited,and hetried, unsuccessfully,to interest Cambridge mathematical colleagues in his endeavor.His appointmentto the University College Chair broughthim into contactwith Pearson, but,in the meantime,he attempted to remedy his deficienciesby an extensive studyof mathematical probability. Pearson (1906) writes: Of this the writer feels sure, that his earliest contributionsto biometry werethe direct resultsof Weldon's suggestions and would never have been carried out without his inspiration and enthusiasm. Both were drawn independentlyby Galton's Natural Inheritance to these problems, (p. 20)
Pearson'smotivationwas quitedifferent from that of Weldon. MacKenzie
144
10. COMPARISONS, CORRELATIONS AND PREDICTIONS
(1981) hasprovided us with a fascinating accountof Pearson's background, philosophy, and political outlook.. He wasallied with the Fabian socialists 6; he held strong viewson women's rights;he wasconvincedof the necessityof adopting rational scientific approachesto a range of social issues;and his advocacy of the interestsof the professional middle-class sustained his promotion of eugenics. His originality, his real transformation rather than re-ordering of knowledge, is to be found in his work in statisticalbiology, wherehe took Galton'sinsights and made out of them a newscience. It was thework of his maturity - he startedit only in his mid-thirties- and in it can befound theflowering of mostof themajor concernsof his youth. (MacKenzie, 1981,pp. 87-88) It can beclearly seen that Pearson was notmerely providinga mathematicalapparatus for othersto use ... Pearson'spoint wasessentiallya political one:the viability, and indeedsuperiorityto capitalism,of a socialiststatewith eugenically-plannedreproduction. The quantitative statistical formof his argument providedhim with convincing rhetorical resources.(MacKenzie, 1981,p. 91)
MacKenzie is at painsto point out that Pearsondid not consciouslyset out to found a professional middle-class ideology. His analysis confines itselfto the view that herewas a match of beliefs and social interests that fostered Pearson'sunique contribution,and that this sortof sociological approachmay be usedto assessthe work of such exceptional individuals. Pearsondid not seek to becomethe leader of a movement; indeed,the compromises necessary for such an aspiration would have been anathema to what he saw as the role of the scientist and theintellectual. the operationof the forces that molded Pearson's However one assesses the introduction,at this work, it is clear that,from a purely technical standpoint, to the field of statistical juncture, of an able, professional mathematician 7 methodsbroughtabouta rapid advanceand agreatly elevated sophistication. The third paper in the Mathematical Contributions series was read beforethe Royal Society in November 1895.It is an extensive paper which dealswith, the general theoryof correlation. It containsa numberof among other things, historical misinterpretations and inaccuracies, that Pearson (1920) later 6
The Fabian Society (the Webbs andGeorge Bernard Shaw were leading members) its took name from Fabius,the Roman Emperorwho adopteda strategyof defenseand harrassmentin Rome's war with Hannibaland avoided directconfrontations. The Fabiansadvocated gradual advance and reform of society, rather than revolution. 7
Pearsonplaced Third Wranglerin the Mathematical Triposat Cambridgein 1879. The "Wranglers" were the mathematics students at Cambridgewho obtainedFirst Class Honors- the oneswho most successfully"wrangle" with math problems. This method of classification was abandonedin the early yearsof this century.
THE COEFFICIENT OF CORRELATION
145
attemptedto rectify, but for today's usersof statistical methods it is of crucial for the coeffiimportance,for it presentsthe familiar deviation score formula cient of correlation. It might be mentioned here that the latter termhad been introducedfor r by F. Y. Edgeworth (1845-1926)in an impossibly difficult-to-follow paper published in 1892. Edgeworthwas Drummond Professor of Political Economyat Oxford from 1891 untilhis retirementin 1922,and it is ofsomeinterestto note that he hadtried to attract Pearsonto mathematical economics, but without success.Pearson (1920)saysthat Edgeworthwas also recruitedto correlation by Galton's Natural Inheritance,and he remainedin close touch withthe biometricians over many years. in the Philosophical Transactionsof Pearson's important paper (published the Royal Societyin 1896) warrants close examination. He begins withan introductionthat statesthe advantagesandlimitations of thestatistical approach, pointing out that it cannot give us precise information about relationships between individualsand that nothingbut meansand averagesand probabilities with regard to large classescan bedealt with. On theotherhand, the mathematical theorywill be of assistanceto the medicalman by answering,inter alia, in its discussionof regressionthe problemas to theaverage effect upon the offspring of given degreesof morbid variationin the parents.It may enable the physician, in many cases,to state a belief basedon a high degreeof probability, if it offers no ground for dogma in individual cases.(Pearson,1896, p. 255)
Pearsongoeson to define the mean, median, and mode, the normal probability distribution, correlation,and regression,as well as various termsemployed in selection and heredity. Next comesa historical section,whichis examined laterin this chapter. Section4 of Pearson'spaper examinesthe "special caseof two correlatedorgans." He derives whathe terms the "wellknown Galtonian formof the frequency for two correlated variables,"and says that r is the "GALTON function or coefficient of correlation" (p. 264). However, he is notsatisfied thatthe methods usedby Galton and Weldon give practically the best methodof determiningr, and hegoes on to show by what we would now call the maximum likelihood method that S(xy)/(nG\ a2) is the bestvalue (todaywe replaceS by I forsummation).This expressionis familiar to every beginning student in statistics. As Pearson (1896) puts it, "This value presentsno practical difficulty in calculation,and thereforewe shall adoptit" 8 (p. 265). It is now well-known thatwe have done precisely that. Pearson (1896,p. 265) notes "thatS(xy) correspondsto theproduct-moment of dynamics,as S ( x ) to the momentof inertia." This is why r is often referred to as the"product-momentcoefficient of correlation."
146
10. COMPARISONS, CORRELATIONS AND PREDICTIONS
There follows the derivation of the standard deviation of the coefficient of 2 2 correlation,(1 - r )/Vn( 1 - r ), which Pearson translates to theprobable error, 0.674506(1- r 2)/\n( 1- r 2). These statistics arethen usedtorework Weldon's shrimp and crab data, and the results show that Weldon was mistaken in assuming constancy of correlationin local racesof the same species. Pearson's of Galton's dataon family stature also shows that exhaustive re-examination in error. The detailsof these analyses are someof the earlier conclusions were now of narrow historical interest only, but Pearson's general approach is amodel for all experimentersand usersof statistics. He emphasizesthe importanceof samplesizein reducingthe probable error,mentionsthe importanceof precision in measurement, cautions against general conclusions from biased samples, and treats his findings with admirablescientific caution. V= (o/w) 100 , thecoefficient of variation, as a way of Pearson introduces comparing variation,andshows that "the significance of themutual regressions o f . . . two organsare as thesquaresof their coefficients of variation" (p. 277). It should alsobe noted that,in this paper, Pearson pushes much closer to the solution of problems associated with multiple correlation. This almost completes the accountof a historical perspective of the developmentof the Pearsoncoefficient of correlation as it iswidely knownand used. However,the record demands that Pearson's account be lookedat inmore detail and the interpretation of measuresof associationasthey were seenby others, most notably GeorgeUdny Yule (1871-1951),who made valuable contributions to thetopic, beexamined.
CORRELATION - CONTROVERSIES ANDCHARACTER Pearson's (1896) paper includes a sectionon the history of the mathematical foundationsof correlation. He says thatthe fundamentaltheorems were "exhaustivelydiscussed"by Bravaisin 1846. Indeed,he attributesto Bravaisthe invention of the GALTONfunction while admitting that "a single symbolis not used for it" (p. 261). He also statesthat S(xy)/(na\ a2) "is the value givenby Bravais, but he doesnot show thatit is the best" (p. 265). In examiningthe general theoremof a multiple correlationsurface, Pearson refersto it as Edgeworth's Theorem. Twenty-five years later he repudiates these statements and attemptsto set hisrecord straight.He avers that: They have been accepted by later writers, notablyMr Yule in his manualof statistics, who writes (p. 188): "Bravais introduced the product-sum,but not asingle symbol for a coefficient of correlation. Sir Francis Galton developed the practical method, determining his coefficient (Galton'sfunction as it wastermed at first) graphically. Edgeworth developed the theoretical sidefurther and Pearson introduced the product-sum formula."
CORRELATION - CONTROVERSIES AND CHARACTER
147
Now I regretto saythat nearly the whole of the abovestatementsare hopelessly incorrect.(Pearson,1920, p. 28)
Now clearly, it is just not thecase "that nearlythe whole of the above is trying to do is to statementsare hopelessly incorrect."In fact, what Pearson emphasizethe importanceof his own andGalton's contributionand toshift the blame for the maintenanceof historical inaccuracies to Yule, with whom he had come to have some serious disagreement. The natureof this disagreement is of interest and can beexaminedfrom at least three perspectives. The first is that of eugenics,the secondis that of the personalitiesof the antagonists,and the third is that of thefundamental utility of, and theassumptions underlying, are examined, some measuresof association. However, before these matters brief commenton the contributionsof earlier scholarsto the mathematicsof correlation is in order. The final yearsof the 18th centuryand the first 20yearsof the 19th was the period in which the theoreticalfoundations of the mathematicsof errors of observation were laid. Laplace (1749-1827)and Gauss(1777-1855)are the best-knownof the mathematicianswho derivedthe law of frequencyof error anddescribedits application,but anumberof other writers also made significant for an accountof these developments). These contributions (see Walker, 1929, scholarsall examinedthe questionof the probabilityof thejoint occurrenceof two errors,but "None of them conceivedof this as amatter which could have andgeodesyor gambling" application outsidethe fields of astronomy, physics, (Walker, 1929,p. 94). In fact, put simply, these workers were interested solely in the mathematics associatedwith the probabilityof the simultaneousoccurrenceof two errorsin, say,the measurement of the positionof a pointin a planeor in three dimensions. They were clearlynot looking for a measureof a possible relationship between the errors and certainly not consideringthe notion of organic relationships andsurveyors sought among directly measured variables. Indeed, astronomers to make their basic measurements independent. Having said this,it is apparent by these earlier scientists that the mathematicalformulationsthat were produced are strikingly similarto those deduced by Galtonand Hamilton Dickson. Auguste Bravais(1811-1863),who had careers as a naval officer, an astronomerand physicist, perhaps came closest to anticipatingthe correlation coefficient; indeed, he even usesthe term correlation in his paper of 1846. Bravais derivedthe formula for the frequency surfaceof the bivariate normal distribution and showed thatit was aseriesof concentricellipses,as didGalton and Hamilton Dickson40 years later. Pearson's acknowledgment of the work of Bravaisled to thecorrelationcoefficient sometimes being called the BravaisPearson coefficient. When Pearson came, in 1920, to revisehis estimationof Bravais' role,he
148
10. COMPARISONS, CORRELATIONS AND PREDICTIONS
describeshis ownearly investigationsof correlationandmentionshis lectures on the topic to research students at University College. He says: I was far tooexcited to stop to investigate properly whatother peoplehad done. I wanted to reachnew resultsandapply them. AccordinglyI did not examine carefully either Bravais or Edgeworth,andwhen I cameto put mylecturenoteson correlation into written form, probablyaskedsomeonewho attendedthe lecturesto examinethe papersand saywhat was in them. Only when I now comeback to the papersof Bravaisand Edgeworthdo I realisenot only that I did grave injusticeto others,but mademostmisleadingstatements which havebeenspreadbroadcastby thetext-book writers. (Pearson,1920, p. 29)
The "Theory of Normal Correlation"was one of thetopics dealt withby Pearson whenhe startedhis lectureson the'Theoryof Statistics"at University College in the 1894-1895session, "givingtwo hoursa week to a small but enthusiastic class of two students- Miss Alice Lee, Demonstrator in Physicsat the Bedford College,andmyself (Yule, 1897a,p. 457). One ofthese enthusiastic students, Yule, published his famous textbook,An Introduction to the Theoryof Statistics,in 1911,a book thatby 1920 was in its fifth edition. Laterin the 1920 paper, Pearson again deals curtly with Yule. He is discussingthe utility of a theory of correlation thatis not dependenton the assumptionsof the bivariate normaldistribution andsays,"As early as 1897 Mr G. U. Yule, thenmy assistant, made an attemptin this direction" (Pearson, 1920, P- 45). In fact, Yule (1897b)in a paperin theJournal of the Royal Statistical Society, had derived least squares solutions to the correlation of two, three, and four variables. This methodis a compelling demonstration of appropriatenessof Pearson'sformula for r under the least squares criterion. But Pearsonis not impressed,or at leastin 1920 he is notimpressed: Are we notmakinga fetish of themethodof leastsquaresasothersmadea fetish of the normal distribution? ... It is by no means clear therefore that Mr Yule's generalisationindicatesthe real line of future advance.(Pearson,1920, p. 45)
This, to say theleast, cold approachto Yule's work (and thereare many wascertainlynot evident overthe other examplesof Pearson's overt invective) 10 years that span the turn of the 19thto the 20th centuries. During what Yule describedas "the old days" he spent several holidays with Pearson, and even when their personal relationship had soured,Yule states thatin nonintellectual matters Pearsonremained courteousand friendly (Yule, 1936), althoughone cannotbut help feel that herewe havethe wordsof anessentially kindand gentle man writing an obituary noticeof one of thefathersof his chosen discipline. What was thenatureof this disagreement that so aroused Pearson's wrath? In 1900, Yule developed a measureof associationfor nominal variables,the
CORRELATION - CONTROVERSIES AND CHARACTER
149
frequenciesof which are entered intothe cells of a contingencytable. Yule presentsvery simple criteriafor such measures of association, namely, that they shouldbe zero when thereis norelationship (i.e.,the variablesareindependent), +1 when there is complete dependence or association,and -1 when thereis a complete negative relationship. The illustrative example chosenis that of a matrix formed from cells labeledas follows:
Survived A Died
VaccinatedB AB aB
Unvaccinated A0 ap
and Yule deviseda measure,Q (named,it appears,for Quetelet), that satisfies the statedcriteria, and isgiven by,
Yule's paper had been "received"by the Royal Society in October 1899, and "read"in Decemberof the same year.It was describedby Pearson (1900b), in a paper that examines the same problem,as "Mr Yule's valuable memoir" (p. 1). In his paper, Pearson undertakes to investigate "the theoryof the whole subject" (p. 1) andarrives at his measureof associationfor two-by-two continan index that he called the tetrachoric coefficient of gency table frequencies, correlation. He examines other possible measure's of association, including to the tetrachoric Yule's Q, but heconsiders themto bemerely approximations coefficient. The crucial differencebetween Yule's approach andthat of Pearson is that Yule's criteriafor ameasureof associationareempiricalandarithmetical, for Pearsonwasthat the attributes whose whereasthe fundamental assumption frequencieswere countedin fact arosefrom an underlying continuous bivariate normal distribution. The details of Pearson's method are somewhat complex and are notexamined here. There were a numberof developmentsfrom this in 1904,of the mean square contingency work, including Pearson's derivation, and thecontingencycoefficient. All these measures demanded the assumption of anunderlying continuous distribution, even thoughthe variables,asthey were considered, were categorical. Battle commenced in late1905 when Yule criticized Pearson'sassumptionsin a paper readto the Royal Society (Yule, to see, "Replyto Certain Criticismsof 1906). Biometrika'sreaders were soon Mr G. U. Yule" (Pearson, 1907) and, after Yule's discussionof his indices appearedin the first edition of his textbook, were treatedto David Heron's exhortation, "The Danger of Certain Formulae Suggested asSubstitutesfor the Correlation Coefficient" (Heron,1911). These were stirring statistical times, marked by swingeing attacksas thebiometricians defended their position:
150
10. COMPARISONS, CORRELATIONS AND PREDICTIONS
If Mr Yule's viewsare accepted,irreparable damage will be done to the growth of modern statistical theory... we shall termMr Yule's latest methodof approaching the problemof relationshipof attributesthe methodof pseudo-ranks...w e . .. reply to certain criticisms, not to saycharges,Mr Yule hasmade againstthe work of one or both of us. (Pearson & Heron, 1913,pp. 159-160)
Articulate sniping and mathematicalbombardment werethe methodsof attack usedby the biometricians. Yulewas more temperatebut nevertheless quite firm in his views: All those who have diedof small-poxare all equally dead:no one ofthem is more deador lessdeadthan another,and thedeadare quite distinct fromthe survivors. The introductionof needlessand unverifiable hypotheses does not appearto me to be adesirableproceedingin scientific work. (Yule, 1912,pp. 611-612)
Yule, and his great friend Major Greenwood, poked private fun at the opposition.Partsof a fantasy sentto Yule by Greenwoodin November 1913 arereproduced here: Extractsfrom The Times, April 1925 G. Udny Yule, who hadbeen convictedof high treasonon the 7thult, was executed this morning on a scaffold outside GowerSt. Station. A short but painful scene the criminal made some occurred on thescaffold. As the rope was being adjusted, observation, imperfectly heard in the press enclosure, the only audible words being "the normal coefficientis —." Yule was immediately seizedby theImperial guard and gagged. Up to thetime of going to pressthe warrantfor the apprehensionof Greenwoodhad not been executed,but the police have what they regard to be animportant clue. During the usual morning service at St. Paul's Cathedral, which was well attended, the carlovingian creed was, in accordancewith an imperial rescript, chantedby the choir. Whenthe solemn words,"I believein one holy and absolutecoefficient of four-fold correlation" were uttereda shabbily dressedman near the North door the vergers armed with shouted"balls." Amid a sceneof indescribable excitement, severalvolumes of Biometrika made their way to thespot.(Greenwood,quotedby MacKenzie, 1981,pp. 176-177)
The logical positionsof the two sides werequite different. For Pearsonit was absolutely necessary to preservethe link with interval-level measurement wherethe mathematicsof correlationhad beenfully specified. Mr Yule ... doesnot stop to discuss whether his attributesarereally continuousor discrete,or hide under discrete terminology true continuous variates. We seeunder such class indices as"death"or "recovery","employment"or "non-employment"of mother, only measures of continuousvariates... (p. 162) The fog in Mr Yule's mind is well illustrated by his table...(p. 226)
CORRELATION - CONTROVERSIES AND CHARACTER
151
Mr Yule is juggling with class-namesas if they representedreal entities,and his statisticsonly a form of symbolic logic. No knowledgeof a practical kindevercame out of theselogical theories,(p. 301) (Pearsonand Heron, 1913)
Yule is attackedon almost everyone ofthis paper's157 pages,but for him the issue was quite straightforward. Techniques of correlation were nothing of dependencein nominal data. If more and nothing less than descriptions different techniques gavedifferent answers,a point that Pearsonand his followers frequently raised, thenso be it. Themean, median,and mode give different answersto questions about the central tendencyof a distribution,but each has itsutility. Of coursethe controversywas never resolved.The controversycan never be resolved,for there is no absolute,"right" answer. Each camp started with and areconciliationof their viewswas notpossible certain basic assumptions, unlessone orboth sides wereto have abrogated those assumptions, or unlessit could have been shown with scientific certainty thatonly oneside's assumptions were viable.For Yule it was asituation thathe accepted.In his obituary notice of Pearsonhe says, "Timewill settlethe questionin duecourse"(p. 84). In this he waswrong, because there is nolongera question.The pragmatic practitioners of statistics in the presentday arelargely unaware that there ever even was a question. The disagreementhighlightsthe personalitydifferencesof the protagonists. Pearson couldbe describedas adifficult man. He held very strong viewson a variety of subjects,and he wasalways readyto takeup his pen and write scathing attackson those whomhe perceivedto be misguidedor misinformed. He was not ungenerous,and hedevotedimmenseamountsof time and energy to the work of his studentsand fellow researchers, but it is likely that tactwas not one of his strong pointsand anysort of compromisewould be seenas defeat. In 1939, Yule commentedon Pearson's polemics, noting that, in 1914, . . . understandthe almost religious hatred Pearson said that "Writers rarely which arisesin the true man of science whenhe sees error propagated in high places"(p. 221). Surely neitherin the best typeof religion nor in thebest typeof science should hatred enter in at all. ... In onerespectonly has scientific controversy perforce improved since the seventeenth century.If A disagrees with B's arguments, dislikeshis personality and isannoyedby the cock of his hat, he can nolonger, failingall else, resort to abuseof £'s latinity. (Yule, 1939, p. 221)
Yule had respect and affection for "K.P." even though he haddistanced himself from the biometriciansof Gower Street (where the Biometric Laboratory was located). He clearly disliked the way in which Pearsonand his followers closed ranksandpreparedfor combatat thesniff of criticism. In some
152
10. COMPARISONS, CORRELATIONS AND PREDICTIONS
respectswe canunderstand Pearson's attitudes. Perhaps he didfeel isolatedand beset. Weldon's death in 1906 and Gallon's in 1911 affectedhim greatly He was thecolossusof his field, and yetOxfordhadtwice(in 1897and 1899) turned down his applicationsfor chairs,and in 1901he applied, again unsuccessfully, for the chair of Natural Philosophyat Edinburgh. He felt the pressureof monumentalamountsof work and longed for greater scopeto carry out his research.This was indeed to come with grantsfrom the Drapers' Company, which supportedthe Biometric Laboratory,and Galton's bequest, which made him Professorof Eugenicsat University Collegein 1911.But more controversy, and more bitter battles, this time with Fisher, were just over the horizon. Once more,we areindebtedto MacKenzie,who soably puts togetherthe eugenicaspectsof the controversy.It is his view that this providesa much more one derivedfrom the examinationof a personality adequate explanation than to deny thatthis was animportant factor. clash. It is certainly unreasonable Yule appearsto have been largely apolitical. He came from a family of professionaladministratorsand civil servants,and hehimself worked for the War Office and for theMinistry of Food during World War I, work for which he receivedthe C.B.E. (Commanderof the British Empire) in 1918.He was not a eugenist,and his correspondencewith Greenwood shows that his attitude was far from favorable.He was anactive toward the eugenics movement a body that awardedhim its highest memberof the Royal Statistical Society, who honor, the Guy Medal in gold, in 1911. The Society attracted members were interestedin an ameliorativeand environmentalapproachto social issues - debatesonvaccination werea continuingfascination.Although Major Greenwood, Yule's closefriend, was at first anenthusiasticmemberof thebiometric school, his career,as astatisticianin the field of public healthand preventive medicine, drewhim toward the realization that povertyand squalor were powerful factors in the status and condition of the lower classes,a view that hardly reflected eugenic philosophy. For Pearson, eugenics and heredity shapedhis approachto the questionof correlation,and thenotion of continuousvariation was of critical importance. His notion of correlation,as afunction allowing direct prediction from one variable to another,is shown to have its roots in the task that correlationwas supposedto perform in evolutionary and eugenic prediction.It was notadequate simplyto know that offspring characteristicswere dependenton ancestral characteristics: this dependencehad to bemeasuredin sucha way as toallow the prediction of the effects of natural selection,or of conscious intervention in reproduction. To move in the direction indicated here, from prediction to potential control over evolutionary processes,required powerful and accurate predictive tools: mere statements of dependencewould be inadequate. (MacKenzie, 1981, p. 169)
MacKenzie's sociohistorical analysis is both compellingand provocative,
CORRELATION - CONTROVERSIES AND CHARACTER
1 53
but at leasttwo points, bothof which are recognizedby him, needto bemade. The first is that this viewof Pearson's motivations is contrary to his earlier expressedviews of the positivistic natureof science (Pearson,1892), and second, the controversy mightbeplacedin thecontextof atightly knit academic group defendingits position. To the first MacKenzie says that practical considand yet it isclear that practical demands erations outweighed philosophical ones, did not lead the biometriciansto form or tojoin political groups that might have The second viewis mundaneand realistic. The made their aspirations reality. discipline of psychologyhasseena numberof controversiesin its short history. Notable among themis the connectionist (stimulus-response) versus cognitive argument in the field of learning theory.The phenomenological,the psychodynamic, the social-learning, and thetrait theorists have arraigned themselves against each otherin a variety of combinationsin personalityresearch.Arguments aboutthe continuity-discontinuityof animalsand humankindare exemplified perhaps by the controversy over language acquisition in the higher primates. All thesedebatesare familiar enoughto today'sstudentsof psychology, and it is notinconceivableto view the correlation debateas part of the system of academic "rows" that will remain as long as there is freedom for intellectual controversyand scientific discourse. But the Yule-Pearson debateand its implications are not among those the globe. For amultitudeof reasons, discussedin university classrooms across not the least of which was thehorror of the negative eugenics espoused by the German Nazis,and thedemandsof the growing discipline of psychology for quantitative techniques that would help it deal with its subject matter, veryfew if any of today'spractitionersand researchersin the social sciences think of biometrics when theythink of statistics. Theybusily get onwith their analyses for and, if they give thanksto Pearsonand Gallon at all, they remember them their statistical insights rather than for their eugenic philosophy.
11 Factor Analysis
FACTORS A featureof the idealscientific method thatis always hailedasmost admirable is parsimony.To reducea massof factsto a single underlying factoror evento just a few conceptsand constructsis anongoing scientific concern. In psychology, the statisticaltechnique knownas,factor analysisis most oftenassociated with the attemptto comprehendthe structureof intelligenceand thedimensions of personality. The intellectual struggles that accompany discussion of these mattersare avery longway from being over. Of all the statistical techniques discussed in this book, factor analysis may be justly claimedas psychology's own.As Lawley and Maxwell (1971)and others have noted, some of the early controversies that swirled around the methods employed arose from arguments about psychological, rather than mathematical, matters. Indeed, Lawley andMaxwell suggest that the controversies "discouragedthe interest shownby mathematiciansin the theoretical from amuch problems involved."(p. 1). Infact, the psychological quarrels stem older philosophical debate that is most often tracedto Francis Bacon,and his assertion that"the mere orderly arrangement of data would makethe right hypothesisobvious"(Russell, 1961,p. 529). Russellgoeson: The thing thatis achievedby thetheoretical organization of scienceis thecollection of all subordinateinductions into a few that arevery comprehensive- perhapsonly are confirmedby so many instances that it is one. Such comprehensive inductions thought legitimateto accept,as regards them,an inductionby simple enumeration. This situation is profoundly unsatisfactory,(p. 530)
To bejust a little more concrete,in psychology: We recognizethe importanceof mentality in our lives and wish to characterizeit, in 154
FACTORS
155
part sothat we canmakethe divisionsanddistinctionsamongpeoplethatour cultural and political systemsdictate. We therefore givethe word "intelligence" to this wondrously complex and multifaceted set of human capabilities.This shorthand symbol is then reified and intelligence achievesits dubiousstatusas aunitary thing. (Gould, 1981, p. 24)
To besomewhat trite,it is clear that among employed persons there is ahigh correlationbetweenthe sizeof a monthly paycheckandannualsalary,and it is easy to seethat both variablesare ameasureof income - the "cause"of the correlation is clear andobvious. It is therefore temptingnot only to look for the underlyingbasesof observed interrelationships among variables but to endow suchfindings with a substance that implies a causal entity, despite the protestations of thosewho assertthat the statistical associations do not, by themselves, provide evidencefor its existence. The refined searchfor these statistical distillations is factor analysis,and it is important to distinguish betweenthe mathematicalbasesof the methodsand theinterpretationof the result of their application. Now the correlation between monthly paycheck and annual salarywill not be perfect. Interest payments, profits, gifts, even author's royalties, will make r less than+1, but presumablyno onewould arguewith the proposition thatif one requires a reasonable measure of income and material well-being, either variable is adequateand the other thereby redundant. Galton (1888a), in commentingon Alphonse Bertillon's work thatwas designedto provide an anthropometric indexof identification, in particularthe identification of criminals, pointsto thenecessityof estimatingthe degreeof interdependence among the variables employed,as well as theimportanceof precisionin measurement and of not making measurement classifications too wide. There is little or nothing to begainedfrom including in the measurements a numberof variables that are highly correlated, when one would suffice. A somewhatdifferent line of reasoningis to derivefrom the intercorrelations of themeasured variables a set ofcomponents that areuncorrelated.Thenumber of components thus derived would be thesameas thenumberof variables.The componentsareorderedsothat the initial ones accountfor moreof the variance thanthe later ones thatare listed in the outcome. Thisis themethodof principal is usually cited as theoriginal component analysis,for which Pearson (1901) inspiration and Hotelling (1933, 1935)as thedeveloperand refiner, although there is no indication thatHotelling was influenced by Pearson's work.In fact, Edgeworth (1892, 1893)had suggesteda schemefor generatinga function containing uncorrelated terms that was derivedfrom correlated measurements. Macdonell (1901)was also an early workerin the field of "criminal anthropometry." He acknowledgesin his paper that Pearson haspointedout to him a methodof arriving at ideal characteristics that "would be given if we calculated
156
11. FACTOR ANALYSIS
the seven [there were seven measures] directions of uncorrelated variables, that is, theprincipal axesof the correlation 'ellipsoid'"(p. 209). The method,as Lawley and Maxwell (1971) have emphasized, although it has links with factor analysis,is not to betaken as avariety of factor analysis. The latter is amethod that attempts to distill downor concentratethe covariances in a set ofvariablesto a much smaller number of factors. Indeed, none other than Godfrey Thomson(1881-1955),a leading British factor theoristfrom the 1920sto the 1940s, maintained in a letterto D. F.Vincentof Britain's National Institute of Industrial Psychology that"he did not regard Pearson's 'lines of closestfit as anythingto do with factor analysis.'" (noted by Hearnshaw, 1979, p. 176). The reasonfor the ongoing association of Pearson's name with the birth of factor analysisis to be found in the role playedby another leading British psychologist,Sir Cyril Burt (1883-1971)in the writing and therewriting of the history of factor analysis,an episode examined later in this chapter. In essence,the method of principal componentsmay be visualizedby imagining thevariables measured - thetests- aspointsin space.Teststhatare correlatedwill be close togetherin clusters,andtests thatare notrelatedwill be further away fromthe cluster. Axesor vectorsmay beprojected into thisspace through the clusters in a fashion that allowsfor as much of the variance as possibleto be accounted for. Geometrically this projection can take placein only three dimensions, but, algebraically an w-dimensionalspacecan beconceptualized. THE BEGINNINGS In 1904 Charles Spearman (1863-1945)publishedtwo papers (1904a, 1904b) in the same volumeof the American Journalof Psychology, thenthe leading English language journal in psychology.The first ofthese,a general accountof correlation methods, their strengths and weaknesses,and thesourcesof error that may beintroduced that might "dilate"or "constrict" the results, included criticism of Pearson'swork. In his Huxley lecture (sponsored by the Anthropological Instituteof Great on a great amountof data that Britain and Ireland) of 1903, Pearson reported had been collectedby teachersin a numberof schools.The physical variables measured included health, hair and eyecolor, hair curliness, athletic power, head length, breadth,and height, and a"cephalic index."The psychological characteristics were assertiveness, vivacity, popularity, introspection, conscientiousness, temper, ability,and handwriting.The average correlations (omitting athletic power) reported between the measuresof brother, sister,and brotherfrom .51 to .54. sister pairsare extraordinarily similar, ranging We areforced, 1 think literally forced,to thegeneral conclusion that the physicaland
THE BEGINNINGS
157
psychical charactersin men areinherited withinbroadlines in the samemannerand with the sameintensity. (Pearson,1904a,p. 156).
Spearmanfelt that the measurements taken could have been affected by "systematic deviations," "attenuation" produced by the suggestion thatthe teacher'sjudgments werenot infallible, and by lack of independencein the judgements,finally maintaining that: When we further consider that each of these physicaland mental characteristics will have quitea different amountof such error(in the former this being probably quite insignificant) it is difficult to avoid the conclusion thatthe remarkable coincidences announced between physical and mental hereditycan hardly be more than mere 1904a,p. 98). accidental coincidence. (Spearman,
But Pearson's statistical procedures were not underfire, for later he says: If this work of Pearsonhas thus been singled out for criticism, it is certainly fromno desire to undervalueit. ... My present objectis only to guard against premature conclusionsand topoint out the urgent needof still further improvingthe existing methodicsof correlational work. (Spearman, 1904a,p.99)
Pearsonwas notpleased. When the addresswas reprintedin Biometrika in 1904he includedan addendumreferringto Spearman's criticism that added fuel to the fire: The formula inventedby Mr Spearmanfor his so-called "dilation" is clearly wrong ... not only are hisformulae, especiallyfor probable errorserroneous,but hequite misunderstandsand misuses partial correlation coefficients, (p. 160).
It might be noted that neither man assumed that the correlations, whatever they were, mightbe due toanything other than heredity, and that Spearman largely objected to the data and possible deficienciesin its collection and Pearson largely objected to Spearman's statistics. The exchange ensured that the opponents would never even adequately agree on the location of the battlefield, let alone resolvetheir differences. Spearmanwas to become, successively, Reader andheadof a psychological laboratory at University College, London,in 1907, Grote Professor of the Philosophyof Mind and Logic in 1911, and Professorof Psychology in 1928 until his retirement in 1931. He was therefore a colleagueof Pearson'sin the same collegein the same university during almost the whole of the latter's tenureof the Chair of Eugenics. Their interests and their methodshad a great deal in common, but they never collaborated and they disliked each other. They clashed on the matter of Spearman's rank-difference correlation technique, and that and Spearman's new criticism of Pearsonand Pearson's reaction to it beganthe hostilities that
158
11. FACTOR ANALYSIS
continuedfor many years,culminating in Pearson's waspish, unsigned review (1927) of Spearman's(1927) book TheAbilities of Man. Pearson dismisses the book as,"distinctly written for the layman"(p. 181)andclaims that, "what ProfessorSpearman considers proofs are notproofs . . . With the failure of ChapterX, that is, 'Proof thatG and Sexist,' the very backbone disappears from the body of Prof. Spearman's work" (p. 183). In his 1904 paper, Nowhere in this book does Pearson's name appear! Spearman acknowledges that Pearson had given the name the method of "product moments" to the calculationof the correlation coefficient, but,in a footnote in his book, Spearman (1927), commenting again on correlation measures, remarks that: easily foremostis theprocedure whichwassubstantially givenin a beautiful memoir of Bravais and which is now called thatof "product moments." Such procedures by Galton who inventedthe device - since adopted have been further improved everywhere- of representingall gradesof interdependence by asingle number,the "coefficient," which rangesfrom unity for perfect correlation down to zero for entire absenceof it. (p. 56)
So much for Pearson! The second paper of 1904hadproduced Spearman's initial conclusion that: The . . . observed facts indicate that all branches of intellectual activity havein commononefundamentalfunction (or group offunctions) , whereasthe remaining or specific elementsof the activity seemin every caseto bewholly different from that in all the others. (Spearman, 1904b, p. 284)
This is Spearman'sfirst general statementof the two-factor theory of intelligence,a theory thathe was tovigorously expoundanddefend. Centralto this early workwas thenotionof a hierarchyof intelligences.In examining data drawnfrom the measurement of childrenfrom a "High Class Preparatory School for Boys" on a variety of tests,Spearman starts by adjustingthe correlation coefficientsto eliminateirrelevantinfluencesanderrorsusingpartial correlation techniquesand then orderingthe coefficients from highest to lowest. He observeda steady decrease in the valuesfrom left to right andfrom top tobottom in the resulting table. The samplewas remarkablysmall (only 33), but Spearman (1904b) confidently and boldly proclaims thathis methodhasdemonstrated the existenceof "GeneralIntelligence,"which,by 1914,he waslabelingg. Hemadeno attempt at this timeto analyzethe natureof his factor, but henotes: An important practical consequence of this universalUnity of the Intellectual Function, the various actual formsof mental activity constitutea stableinter-
THE BEGINNINGS Classics Classics
French
English
Math
Discrim
Music
0.83
0.78
0.70
0.66
0.63
0.67
0.67
0.65
0.57
0.64
0.54
0.51
0.45
0.51
French
0.83
English
0.78
0.67
Math
0.70
0.67
0.64
Discrim
0.66
0.65
0.54
0.45
Music
0.63
0.57
0.51
0.51
159
0.40 0.40
FIG 11.1 Spearman's hierarchical ordering of correlations connectedHierarchy accordingto thedifferent degreesof intellective saturation,(p. 284)
This work may bejustly claimed to include the first factor analysis. The intercorrelationsare shown in Figure 11.1. The hierarchical orderof the correlationsin the matrix is shown by the tendencyfor the coefficients in a pair of columnsto bear the same ratioto one another throughout that column. Of course, the techniqueof correlation that began with Galton (1888b) is central to all forms of factor analysis, but, more particularly, Spearman's work relied on the concept of partial correlation, which allowsus to examinethe relationships between, for example,two variables whena third is held constant. The method gave Spearman the meansof making mathematical the notion that the correlation of a specific test witha general factorwas common to all tests of intellectual functioning.The remaining portionof the variance, errorexcepted, is specificand uniqueto the test that measures the variable. Thisis the essenceof what cameto be known as thetwo-factor theory. The partial correlationof variables1 and 2when a third, 3, is held constant is given bv
So that— = — and therefore, r aj ri,d
160
11. FACTOR ANALYSIS
If there are twovariablesa and b and g is constant a and theonly causeof the correlation (rab) betweena and b is g,thenrah_g would be zero and hence
If a is set toequal b then This latter represents the variancein the variablea that is accountedfor by g andleadsus to thecommunalitiesin a matrix of correlations. If now we were to take four variablesa, b, c, and d, and to considerthe correlation of a and bwith c and d,then
The left-hand sideof this latter equationis Spearman's famous tetrad difference, and hisproof of it is givenin an appendixto his 1 921book. The term & Holzinger, tetrad difference first appearsin the mid 1920s (see e.g., Spearman 1925). So, amatrix of correlations such as
generatesthetetrad differenceracr M - r hcr ail and, without beingtoo mathematical, this is the value of the minor determinantof order two. Whenall these minor determinantsare zero,the matrix is of rank one. Moreover,the correlations can beexplainedby onegeneral factor. Wolfle (1940) has noted thatconfusion about whatthe tetrad difference meant or implied persisted over a good many years. It was sometimes thought that whena matrix of correlations satisfied the tetrad equation, every individual measurementof every variable could onlybe divided into two independent parts. Spearman insisted that what he maintainedwas that this division could
THE BEGINNINGS
161
be made,not that it was theonly possible division. Nevertheless, pronouncements aboutthe tetrad equation by Spearman werefrequently followed by assertions such as,"The onepart hasbeen calledthe 'general factor'and denoted by the letter g . . . Thesecond parthasbeen calledthe 'specific factor'and denotedby the letter s" (Spearman, 1927, p. 75). Moreover, Spearman'sfrequent referencesto "mental energy" gavethe impression thatg was "real," even though, especially when challenged, he denied that it was anything more thana useful, mathematical, explanatory but it seems that construction. Spearman made ambiguous statements g, about he was convinced thatthe two-factor theorywas paramountand the tetrad difference formed an important partof his argument.Not only that, Spearman was satisfied that Garnett's earlier "proof (1920)of the two-factor theoryhad effectively dealt with any criticism: There is another importantlimitation to the division of the variables into factors.It is that the division into generaland specific factorsall mutually independentcan be effected in one wayonly; in other words, it is unique.(Spearman,1927, p. vii)
In 1933 BrownandStephenson attempted a comprehensive test of the theory using an initial battery of 22 testson asampleof 300 boys aged between 10 and 10'/2 years. But they "purified"the batteryby dropping tests that for one reason or another did not fit and found support for the two-factor theory. Wolfle's (1940) reviewof factor analysis saysof this work: "Whetherone credits the attempt with success or not, all that it provesis this; if one removesall tetrad differenceswhich do not satisfy the criterion, the remaining onesdo satisfy it" (p. 9). Pearson and Moul (1927), in a lengthy paper, attempta dissection of Spearman'smathematics,in particular examiningthe sampling distributionof the tetrads and whether or not it can bejustified as being closeto the normal distribution.They conclude: "the claim of Professor Spearman to have effected a Copernican revolution in psychology seems at present premature" (p. 291). the theory of general However, theydo state that even though they believe and specific factorsis "too narrow a structureto form a frame for the great variety of mental abilities,"they believe thatit shouldandcould be testedmore adequately. Spearman's annoyance with Pearson's view of his work was not sotroubling as thepronouncements on it madeby theAmerican mathematician E. B.Wilson (1879-1964),Professorof Vital Statisticsat Harvard's Schoolof Public Health. The episodehas been meticulously researched and acomprehensive account given by Lovie and Lovie (1995). Spearman met Wilson at Harvard duringa visit to the United States latein 1927. Wilson had read The Abilities of Man
162
11. FACTOR ANALYSIS
and hadformedthe opinion thatits author's mathematics were wanting and had tried to explainto himthat it waspossibleto "getthe tetraddifferencesto vanish, one andall, in so many ways thatone might suspect thatthe resolution intoa general factorwas notunique" (quotedby Lovie & Lovie, 1995,p. 241). Wilson subsequently reviewed Spearman's book for Science.It is generally a friendly review. Wilson describesthe book as "animportantwork," written "clearly, spiritedly, suggestively,in places even provocatively," and hiswords concentrateon the mathematical appendix.Wilson admits thathis review is "lop-sided," but hemaintains that Spearman has missed someof the logical implicationsof the mathematics,and inparticular thathis solutionsare indeterminate. But he leavesa loopholefor Spearman: Do gx, g ,. .. whether determinedor undeterminable represent the intelligence of x, y,.. . ? Theauthor advancesa deal of argumentand ofstatisticsto show that they do. This is for psychologists,not for me toassess (Wilson, 1928,p. 246)
Wilson did not want to destroy the basic philosophyor to undermine completely the thrust of Spearman's work. Later in 1929, Wilson publisheda more detailed mathematical critique (1929b) and acorrespondence between him and Spearman ensued, lasting at least until 1933.The Lovies have analyzed these exchangesand show that a solution, or, as they term it, "an uneasy compromise" was reached. A "socially negotiated" solutionsaw Spearman to modifying his earlier "proof of the accepting Wilson's critique, Garnett two-factor theory,and Wilsonoffering ways in which the problems mightbe at least modified,if not overcome. Morespecifically, the idea of a partial indeterminacyin g was suggested. REWRITING THE BEGINNINGS In 1976, the world of psychology,and British psychologyin particular, was shaken by allegations publishedin the Sunday Timesby Oliver Gillie, to the effect that Sir Cyril Burt (1883-1971), Spearman's successor as Professorof Psychology at University College,had faked a large partof the data for his researchon twins and, among other labels in a seriesof articles, called Burt"a plagiarist of long standing." Burtwas aprominentfigure in psychologyin the United Kingdom and hiswork on the heritability of intelligence bolsteredby his dataon itsrelationship among twins reared apart, was notonly widely cited, but influenced educational policy. Leslie Hearnshaw, Professor Emeritus of Psychology at the University of Liverpool, was,at that time, researching his biography (publishedin 1979) of Burt, and when it appearedhe had notonly examinedthe apparentfraudulent natureof Burt's databut healso reviewedthe contributions of Burt to thetechniqueof factor analysis.
REWRITING THE BEGINNINGS
163
The "Burt Scandal"led to adebateby the British Psychological Society, the publication of a "balancesheet"(Beloff, 1980), and anumberof attemptsto rehabilitatehim (Fletcher, 1991; Jensen, 1992; Joynson, 1989). It must be said that Hearnshaw's view that Burt re-wrote history in an attemptto place himself as thefounder of factor analysishas notbeen viewedas rather than Spearman seriously as the allegations of fraud. Nevertheless, insofaras Hearnshaw's words have been picked over in theattemptsto atleast downplay,if not discredit, his findings, and insofar as therecord is importantin the history and development of factor analysis, theyare worth examininghere. Once more the credit must go to theLevies (1993),who have provideda "side-by-side"comparison of Spearman'sprepublication notesand commentson Burt's work and the relevant sectionsof the paper itself. The publicationof Burt's (1909) paperon "general intelligence"was preceded by some important correspondence with Spearman,who saw thepaper as offering support for his two-factor theory.In fact, Spearman re-worked the paper, and large sectionsof Spearman's notes on it were used substantially verbatim by Burt. As the Lovies point out, thiswas notblatant plagiarismon Burt's part, even though the latter's acknowledgement of Spearman's role was incomplete: but was akind of reciprocatedself-interest aboutthe Spearman-Burtaxis at this time which allowed Burt to copy from Spearman,and Spearman to view this with comparativeequanimitybecausethe article provided suchstrongsupportfor the two factor theory, (p. 315)
However,of more importas far as ourhistory is concernedis Hearnshaw's contention that Burt,in attempting to displace Spearman, used subtle and sometimes not-so-subtle commentary to placethe originsof factor analysis with Pearson's(1901) articleon "principle axes"and tosuggest thathe (Burt) knew of thesemethods beforehis exchangeswith Spearman. Indeed, Hearnshaw (1979) reportson Burt's correspondence with D. F. Vincent (noted earlier)in which Burt claimed thathe hadlearnedof Pearson's work when Pearson visited Oxford and that it was then thathe andMcDougall (Burt's mentor) became interestedin the techniques. After Spearman died, Burt's papers repeatedly emphasizethe priority of Pearsonand Burt's rolein the elaborationof methods that became knownasfactor analysis. Hearnshaw notes that Burt did not mention Pearson's 1901 work until 1947and implies that it was thepublication of Thurstone's book (1947) Multiple Factor Analysis and its final chapter on "The Principal Axes" that alerted Burt.It should alsobe noted that Wolfle's (1940) review, although citing11 of Burt's papersof the late 1930s, doesnot mention Pearsonat all. It will be some years beforethe "Burt scandal" becomes no more thanan
164
11. FACTOR ANALYSIS
historical footnote. Burt'smajor work The Factors of the Mind remainsa significant and important contribution,not only to the debate aboutthe nature of intelligence,but alsoto theapplicationof mathematicalmethodsto thetesting of theory. Eventhe most vehement critics of Burt would likely agreeand, with Hearnshaw, might say, "It is lamentable thathe should have blottedhis record by the delinquenciesof his later years." In particular, Burt's writings moved factor theorizing awayfrom the Spearman contention that g was invariant no matter what typesof tests were used - in other words, thatdifferent batteriesof tests should produce the same estimates of g. The idea of a numberof group factors thatmay beidentified from sets of teststhat had similar, althoughnot identical, content,- for example, verbal factors, numeracy factors, spatial central in thedevelopmentof Burt's writings. Overlap factors and so on - was among the tests within each factordid not suggest thatthe g values were out of hand, but identical. Again, Spearmandid not dismiss these approaches it is plain thathe alwaysfavored the two-factor model. THE PRACTITIONERS Although the distinction may besomewhatsimplistic, it is possibleto separate the theoreticians from the applied practitionersin these early yearsof factor the essentially philosophical andexperimental analysis. Lovie (1983) discusses work, aswell asthat of Thomsonand Thurstone. drive underlyingSpearman's with an introduction that discusses the Spearman's early paper (1904a) begins "signs of weakness"in experimental psychology;he avers that "Wundt's disciples havefailed to carry forward the work in all the positive spiritof their master,"and heannouncesa: "correlationalpsychology,"for the purposeof positively determiningall psychical tendencies,and in particular those which connect togetherthe so-called"mental tests"with psychical activitiesof greatergenerality and interest, (p. 205)
The work culminatedin a book that set out hisfundamental, if not yet watertight and complete, conceptual framework. Burt's work with its continuing premiseof theheritabilityof intelligencealso showsthe needto seek support for a particular constructionof mental life. On the other hand,a numberof contemporary researchers, whose contributions to factor analysis were also impressive, were fundamentally applied workers. Two, KelleyandThomson, werein fact Professorsof Education,and a third, Louis Thurstone,was agiant in the field of scaling techniquesand testing. Their contributions were marked by the task of developing testsof abilities and skills, measuresof individual differences. Truman L. Kelley (1884-1961),Professorof Educationand Psychologyat Stanford University,
THE PRACTITIONERS
165
published his Crossroads in the Mind of Man in 1928. In his preface he welcomes Spearman'swork, but in his first chapter, "Boundariesof Mental Life, " he makesit clear that his work indicates that several traits are included in Spearman'sg, and heemphasises these group factors throughout his book: Mental life doesnot operatein a plain but in anetworkof canals. Though each canal may have indefinite limits in length and depth, it doesnot in width; though each mentaltrait may grow andbecome moreandmore subtle,it doesnot lose its character and discretenessfrom other traits,(p. 23)
Lovie and Lovie (1995) reporton correspondence between Wilson and H. W. Holmes, Deanof the School of Educationat Harvard, after Wilsonhad read andreviewed (1929a) Kelley' s book "If Spearmanis ondangerous ground, Kelley is sitting on a volcano." Wilson's (1929a) reviewof Kelley again concentrates on the mathematics and again raises,in even more acute fashion, the problem of indeterminacy.It is not an unkind review,but it turns Kelley's statement (quoted above) on its head: Mental life in respect of its resolutioninto specific and general factorsdoes not operate in a network of canalsbut in acontinuouslydistributed hyperplane... no trait has anyseparateor discrete existence, for eachmay bereplacedby proper linear combinationsof the others proceeding by infinitesimal gradationsand it isonly the complex of all that has reality. Insteadof writing of crossroadsin the mind of man one shouldspeakof man's trackless mental forest or tundraor jungle - mathematically mind you, accordingto theanalysisoffered. The canalsor crossroads have been put in without any indication,so far as I cansee, that they have been put in anyother way than we put in roads acrossour midwestern plains, namely to meet the convenienceor to suit the fancy of the pioneers,(p. 164)
Wilson, a highly competent mathematician, had obviouslyhad a struggle with Kelley's mathematics.He concludes: The fact of the matter is that the authorafter an excellent introductory discussion of the elementsof thetheoryof the resolutioninto generaland specific factors... gives up the general problem entirely, throws up his handsso to speak,and proceedsto develop special methods of examininghis variablesto seewhere there seem to be specific bonds between them. (p. 160)
Kelley, perhapsa little bemused, perhaps a little rueful, and almost certainly grateful that Wilson's reviewhad not been more scathing, responds: "The mathematical toolis sharpand I may nick, or may already have nicked myself with it. At any rate I do still enjoy the fun of whittling and of fondling such clean-cut chipsas Wilson has letfall" (p. 172). Sir Godfrey Thomsonwas Professorof Education at the University of
166
11. FACTOR ANALYSIS
Edinburgh from 1925 to 1951. He had little or no training in psychology; indeed,his PhD was inphysics.After graduate studies in Strasbourg,he returned he wasobligedto takeup apostin education hometo Newcastle, England, where to fulfill requirements that were attached to the grants he hadreceived as a student.His work at Edinburghwas almost wholly concerned withthe development of mental testsand therefining of his approachto factor analysis. In 1939 his book TheFactorial Analysisof Human Ability was published,a book that establishedhis reputation and set him firmlyoutsidethe Spearmancamp. A later edition (1946) expanded his views and introducedto a wider audience the work of Louis Thurstone(1887-1955).Thomson's nameis not aswellknown as Spearman's,nor is hiswork much cited. Thisis probably because he did not develop a psychological theoryof intelligence, althoughhe did offer explanationsfor his findings. Hecan, however,be regardedasSpearman's main British rival, and heparts companywith him on three main grounds:first, that Spearman'sanalysishad not andcould not conclusively demonstrate the "existence"of g; second, that evenif the existenceof g was admitted, it was misleading and dangerousto reify it so that it becamea crucial psychological entity; and third, that Spearman's hierarchical model had been overtakenby multiple factor models that were at once more sophisticated and had more explanatory power. The Spearmanschool of experimenters, however, tend always to explain as much as possibleby onecentral factor.... Thereareinnumerable other ways of explaining these same correlations ... And the final decision between them has to bemade on some othergrounds.The decision may bepsychological.... Or thedecisionmay bemadeon theground that we should be parsimoniousin our inventionof "factors," andthat whereonegeneral and onegroupfactor will servewe shouldnot inventfive group factors..(Thomson, 1946, p. 14-15)
Thomsonwas notsaying that Spearman's view doesnot make sense but that it is not inevitably the only view. Spearmanhad agreed thatg was presentin all the factors of the mind, but in different amountsor "weights," and that in some factorsit was small enoughto beneglected. Thomson welcomed these views, "provided thatg is interpreted as a mathematical entity only,and judgementis suspendedas towhetherit is anything more thanthat" (p. 240). And his verdict on two-factor theory?After commenting thatthe method of two factors was an analytical devicefor indicating their presence, that advancesin method had been made thatled to multiple factor analysis,and further "It was Professor Thurstone of Chicagowho sawthatonesolutionto the problem couldbe reachedby ageneralizationof Spearman's ideaof zerotetrad differences."(p. 20).
THE PRACTITIONERS
167
Thomson himselfhadsuggesteda "sampling theory"to replaceSpearman's approach: The alternativetheory to explain the zero tetrad differencesis that eachtest calls upon a sampleof bonds whichthe mind can form, andthat someof thesebondsare common to two testsand causetheir correlation,(p. 45)
Thomsondid notgive more thana very general viewof whatthe bonds might be, although it seems that they were fundamentally neurophysiological and, from a psychological standpoint, akin to the connectionsin the connectionist views of learning first advancedby Thorndike: What the "bonds"of the mind are, we do notknow. But they are fairly certainly associatedwith the neuronesor nerve cellsof our brains...Thinking is accompanied by the excitationof theseneuronesin patterns.The simplestpatternsare instinctive, more complex onesacquired. Intelligenceis possibly associatedwith the number andcomplexity of thepatterns whichthe braincan (orcould) make. (Thomson, 1946, p. 51)
And, lastly, to complete this short commentary on Thomson's efforts,his demonstrationof geometrical methodsfor the illustration of factor theory contrasts markedly with both Spearman and Burt's work and ismuch morein tune with the "rotation" methodsof later contributors, notably Thurstone. In 1939, as part of a General Meetingof the British Psychological Society, and Stephenson each gave papers on the Burt, Spearman, Thomson (1939a), factorial analysisof humanability. These papers were published in the British Journal of Psychology, together with a summingup by Thomson (1939b). Thomson pointsout that neither Thurstone nor any of hisfollowers were present to defend their viewsand gives a very brief and favorable summaryof them. He even says, citing Professor Dirac at a meetingof the Royal Society of Edinburgh: When a mathematical physicist finds a mathematical solutionor theorem whichis particularly beautiful... he canhave considerable confidence that it will prove to correspondto something realin physical nature. Something of the samefaith seems to lie behind Prof.Thurstone'strust in the coincidenceof "Simple Structure"in the matrix of factor loadings with psychological significance in the factors thus defined. (Thomson,1939b,p. 105)
He also noted that Burt had "expressedthe hope thatthe 'tetraddifference'of the four symposiasts would be found to vanish!" (p. 108). Louis L. Thurstone produced his first work on "the multiple factor problem" in 1931. The Vectorsof the Mind appearedin 1935and what he himself defined
168
11. FACTOR ANALYSIS
as adevelopmentand expansionof this work, Multiple Factor Analysis,in 1947. Reprintsof the book werestill being published longafter his death.The following schememight give somefeel for Thurstone'sapproach.We looked earlier at the matrix of correlations that producea tetrad difference.Now consider:
This is a minor determinantof order three.Now a newtetrad can beformed:
If this new tetrad is zero then the determinant "vanishes." This procedure can becarried out with determinantsof any order and if we come to a stage whenall theminors"vanish"the "rank" of thecorrelation matrixwill be reduced accordingly. Thurstonefound that testscould be analyzed intoas many common factors as thereduced rankof the correlation matrix.The present account follows that of Thomson (1946, chap. 2), who gives a numerical example.The rank of the matrix of all the correlationsmay bereducedby inserting valuesin the diagonalof thematrix - thecommunalities.Thesevaluesmay bethought of asself-correlations,in whichcasetheyare all 1. Butthey may alsobe regarded asthat part of the variancein the variable thatis due to thecommon factors,in which casethey haveto be estimatedin some fashion. They might even be merely guessed.Thurstone chose values that made the tetrads zero. If these methods seemnot to be entirely satisfactory now, they were certainly not welcomed by psychologistsin the 1920s, 1930sand 1940s who were tryingto cometo grips with the newmathematical approaches of the factor theoristsand their views of human intelligence. Thurstone's centroid method bears some similarities to principal components analysis. The word centroid refersto a multivariate average,and Thurstone's "first centroid"may bethoughtof as an averageof all thetestsin a battery. Centralto theoutcomeof the analysisis the production of'factor loadings - values that express the relationshipof thetests to the presumed underlying factors.The ways in which these loadingsare arrived at aremany and various,and Thurstone's centroid technique was just one early approach.However, it was anapproach thatwas relatively easyto apply and (with some mathematics) relatively easy to understand.How the factors wereinterpretedpsychologically was,and is, largely a matter for the researcher.Thus tests that were related that involved arithmeticor numerical
THE PRACTITIONERS
169
reasoningor number relationships might be labeled numerical,or numeracy. Thurstone maintained that complete interpretation of factors involvedthe techniqueof rotation. Loadingsin a principal factors matrix reflect common factor variances in test scores.The matrices are arbitrary for they can be manipulatedto show different axes - different "frames of reference." We can only make scientific senseof the outcomes whenthe axes are rotated to pass through loadings that show an assumed factor that represents a "psychological reality." Such rotations,which may produce orthogonal (uncorrelated) axes or oblique (correlated)axes,were first carriedout by amixture of mathematics, intuition, and inspiration and have becomea standard partof modern factor analysis. It is perhaps worth noting that Thurstone's early training was in engineeringand that he taught geometryfor a time at theUniversity of Minnesota. The configurational interpretations are evidently distastefulto Burt, for he doesnot havea singlediagramin his text. Perhapsthis is indicative of individual differences in imagery types which leadsto differencesin methods and interpretation among scientists. (Thurstone, 1947, p. ix)
Thurstone'swork eventually produceda set ofseven primary mental abilities, verbal comprehension, word fluency,numerical, spatial, memory, perceptual speed,and reasoning. Thiskind of scheme became widely popular among psychologistsin the United States, although many were disconcerted by the fact that the numberof factors tendedto grow. J. P.Guilford (1897-1987)devised a theoretical model that postulated 120 factors (1967),and by 1971 the claim was made that almost100 of them had been identified. An ongoing problemfor the early researcherswas theevident subjective element in all the methods. Mathematicians stepped into the picture (Wilson was one of theearly ones),often at therequestof the psychologistsor educationists who were not generally mathematical sophisticates (Thomson, not trained as apsychologist,was anexception),but they were remarkably quick learners. The search for analytical methodsfor assessing "simple structure" began. Carroll (1953)was amongthe earliestwho tackledthe problem: A criticism of current practicein multiple factor analysisis that the transformation of the initial factor matrix Fto arotated "simple stucture" matrix ^must apparently be accompaniedby methods which allow considerable scopefor subjectivejudgement, (p. 23)
Carroll's papergoes on to describea mathematical method that avoids subjectivedecisions. Modern factor analysis had begun. From then on agreat variety of methodswasdeveloped,and it is noaccident that such solutions were developed coincidentallywith the rise of the use of thehigh-speed digital
170
11. FACTOR ANALYSIS
computer,which removed the computational labor fromthe more complex procedures. Harman (1913-1976), building on earlier work (Holzinger [1892-1954]& Harman, 1941), published Modern Factor Analysis in 1960, with a secondand athird (1976) edition,and thebook is a classicin the field. In researchon thestructureof human personality,the giants in the areawere Raymond Cattell(1905-1998)and Hans Eysenck(1916-1997).Cattell had beena studentof Cyril Burt, and in hisearly work had introduced the concept of fluid intelligence thatis akin to g, abroad, biologicallybasedconstruct of general mental abilityandcrystallized intelligence that depends on learningand environmental experience. Later, Cattell's work concentrated on theidentification andmeasurementof the factorsof humanpersonality (1965a),and his16PF (16 personality factors) testis widely used. His Handbook of Multivariate Experimental Psychology (1965b) is a compendiumof the multivariate approach witha numberof eminent contributors. Cattell himself contributed 6 of the 27chaptersand co-authored another. a very long career,was massive,as wasthat Cattell's research output, over of Eysenck.The latter's method, which he termed criterion analysis,was a use of factor analysisthatattemptedto confirm the existenceof previously hypothesized factors. Three dimensions were eventually identified by Eysenckand his work helpedto set off one ofpsychology's ongoing arguments about just how many personality factors there really were. This is not theplaceto review this debate,but it does,once again, place factor analysis at thecenterof sometimes quite heated discussion about the "reality" of factors and theextent to which they reflect the theoreticalpreconceptionsof the investigators."What comes out is no more than whatgoesin," is the cry of thecritics. Whatis absolutely clear from this situation is that it is necessaryto validate the factors by experimental investigation that stands aside from the methods used to identify the factors and avoid capitalizingon semantic similarities among the tests that were originally employed.It has to beacknowledged,of course, that both Cattell and Eysenck (1947; Eysenck & Eysenck, 1985)and their many collaborators have triedto dojust that. an increasein the use offactor-analytictechniques Recent years have seen in a variety of disciplinesand settings, even apparently in the analysisof the performanceof racehorses!In psychology,the sometimes heated discussion of the dangersof the reification of factors, their treatment aspsychological entities, as well as arguments over,for example, justhow many personality dimensions are necessaryand/or sufficient to define variationsin human temperament, the use of the continue. At bottom, a large partof these problems concerns technique eitheras aconfirmatory toolfor theory or as asearchfor new structure in our data.Unless thesetwo quitedifferent viewsare recognizedat thestart of discussionthe debatewill take on anexasperatingfutility that stiflesall progress.
12 The Design of Experiments
THE PROBLEM OF CONTROL When Ronald Fisher accepted the post of statisticianat Rothamsted Experimental Station in 1919, the tasksthat facedhim were to make whathe could of a large quantityof existing datafrom ongoing long-term agricultural studies (one had begunin 1843!) and to try toimprovethe effectivenessof future field trials. Fisher later describedthe first of these tasksas "raking over the muck heap"; the secondhe approached withgreatvigor and enthusiasm, layingas he did so the foundationsof modern experimental design and statistical analysis. The essential problemis the problem of control. For the chemist in the laboratory it is relatively easyto standardizeand manipulatethe conditionsof a specific chemical reaction. Social scientists, biologists, and agriculturalresearchers have to contend withthe fact that their experimental material (people, animals, plants)is subjectto irregular variation that arises as aresultof complex interactions of genetic factorsand environmental conditions.These many variations, unknownand uncertain, makeit very difficult to be confident that observeddifferencesin experimental observations are due to the manipulations of the experimenter rather than to chance variation.The challenge of the psychological sciences is thesensitivity of behaviorand experienceto amultiplicity of factors. But in many respectsthe challengehas notbeen answered becausethe unexplained variationin our observationsis generallyregardedas a nuisanceor asirrelevant. It is useful to distinguish between experimental control and the controlled experiment.The former is the behaviorist's ideal,the state where some consistent behavior can be set offand/or terminatedby manipulating precisely specified variables. On the other hand,the controlled experiment describes a procedurein which the effect of the manipulationof the independent variable or variables is, as it were, checked against observations undertaken in the 171
172
12. THE DESIGN OF EXPERIMENTS
absenceof the manipulation. Thisis themethod thatis employedand supported by the followersof the Fisherian tradition.The uncontrolled variables that affect the observationsareassumedto operatein a random fashion, changing individual behaviorin all kinds of ways sothat whenthe dataareaveraged their effects arecanceledout, allowingthe effect of themanipulated variableto beseen.The assumption of randomnessin the influence of uncontrolled variablesis, of course,not onethat is always easyto justify, and therelegation of important influenceson variability to error may leadto erroneous inferences anddisastrous conclusions. Malaria is a disease thathas been knownand feared for centuries. It decimatedthe Roman Empirein its final years,it wasquite widespreadin Britain during the 17th century,and indeedit was still found there in the fencountryin the 19th century. The names malaria, marsh fever, andpaludism all reflect the view that the causeof the diseasewas thebreathingof damp, noxiousair in swamp lands.The relationship between swamp lands and the incidence of malariais quite clear. The relationship between swamp lands and thepresence of mosquitosis also clear. But it was notuntil the turn of the century thatit was realized thatthe mosquitowas responsiblefor the transmissionof the malarial parasite,and only 20 years earlier,in 1880, was theparasite actually observed. In 1879, Sir Patrick Manson(1844-1922),a physician whose work played a role in the discovery of the malarial cycle, presenteda paper in which he suggestedthat the diseaseelephantiasiswas transmitted throughinsect bites. The paperwasreceived with scornanddisbelief. The evidencefor the life cycle of the malarial parasitein mosquitosandhuman beingsand itsbeing established as the causeof the illness camein a numberof ways - not theleast of which was thehealthy survival, throughout the malarial season,of three of Manson's assistantsliving in a mosquito-proofhut in themiddleof the Roman Campagna (Guthrie, 1946,pp. 357-358).This episodeis an interesting exampleof the control of a concomitantor correlated biasor effect that was thedirect causeof the observations. In psychological studies, some of the earliest work that used true experimental designsis that of Thorndike and Woodworth (1901)on transferof training. They used "before-after" designs, control group designs, and correlational studiesin their work. However,the nowroutine inclusion of control groupsin experimental investigations in psychology doesnot appearto have beenan acceptednecessityuntil about50 years ago.In fact, controlled experimentation in psychology moreor less coincidedwith the introductionof Fisherian statistics, and the twoquite quickly became inseparable. Of courseit would be both foolish and wrong to imply that early empirical investigations in psychology rife. The point is were completely lackingin rigor, and mistaken conclusions that it was not until the 1920s and 1930s that the "rules" of controlled
METHODS OF INQUIRY
173
experimentation were spelled out andappreciatedin the psychological sciences. The basic ruleshad been in existencefor many decades, having been codified by John StuartMill (1806-1873)in a book, first publishedin 1843, thatis usually referredto as theLogic (1843/1872/1973). These formulationshadbeen precededby an earlier British philosopher, Francis Bacon, who made recommendationsfor what he thought wouldbe sound inductive procedures. METHODS OF INQUIRY Mill proposedfour basic methodsof experimentalinquiry, and the fiveCanons, the first of which is, "If two or more instancesof the phenomenon under investigation have only one circumstancein common,the circumstancein which alone all the instances agree,is the cause (or effect) of the given phenomenon"(Mill, 1843/1872,8th ed., p. 390). If observationsa, b, and c aremade in circumstancesA, B, and C, and observationsa, d, and e incircumstances A, D, and £,then it may beconcluded that A causes a. Mill commented,"As this method proceedsby comparing different instancesto ascertainin what they agree,I have termedit the Method of Agreement"(p. 390). As Mill points out,the difficulty with this methodis the impossibility of ensuringthat A is the only antecedentof a that is commonto both instances. The second canonis the Methodof Difference. The antecedent circumstances A, B, and C arefollowed by a, b, and c.When A is absentonly b and c are observed: If an instancein which the phenomenon under investigation occurs, and aninstance in which it doesnot occur, have every circumstance in common saveone, thatone occurring only in the former; the circumstancein which alone the two instances differ, is the effect or the cause, or an indispensable partof the cause of the phenomenon,(p. 391)
This method containsthe difficulty in practiceof being unableto guarantee that it is the crucial difference that has beenfound. As part of the wayaround this difficulty, Mill introducesa joint methodin his third canon: If two or more instancesin which the phenomenon occurs have only one circumstancein common, whiletwo or more instancesin which it doesnot occur have nothing in common savethe absenceof that circumstance; the circumstancein which alonethe two setsof instancesdiffer, is the effect, or the cause,or an indispensable part of the cause,of the phenomenon,(p. 396)
In 1881 Louis Pasteur (1822-1895) conducted a famous experiment that exemplifies the methodsof agreementand difference. Some30 farm animals
174
12. THE DESIGN OF EXPERIMENTS
were injected by Pasteur witha weak cultureof anthrax virus. Later these animalsand asimilar numberof others thathad notbeenso "vaccinated"were given a fatal dose of anthrax virus. Withina few days the non-vaccinated animals were dead or dying, the vaccinated ones healthy. The conclusion that was enthusiastically drawnwas that Pasteur's vaccination procedure had produced the immunity thatwas seenin the healthy animals.The effectivenessof vaccinationis now regardedas anestablished fact.But it is necessaryto guard The healthof the vaccinated animals could have been against incautious logic. due to some other fortuitous circumstance. Because it is known that some do recover,an experimental group composed of animals infected with anthrax theseresistantanimals could have resulted in a spurious conclusion.It should be noted that Pasteur himself recognized as thisapossibility. Mill's Method of Residues proclaims that having identified by the methods of agreementand differences that certain observed phenomena are theeffects of certain antecedent conditions, the phenomena that remain are due to the circumstancesthat remain. "Subductfrom any phenomenon such part as is known by previous inductions to be theeffect of certain antecedents, and the residue of the phenomenonis the effect of the remaining antecedents" (1843/1872,8th ed., p. 398). Mill here uses a very modern argument for the use of themethod in providing evidencefor the debateon racial and gender differences: Thosewho assert,what no one hasshown any real groundfor believing, that there is in onehuman individual,onesex, or oneraceof mankindover another,aninherent and inexplicable superiorityin mental faculties, could only substantiate their proposition by subtracting fromthe differencesof intellect which we in fact see,all that can be traced by known laws either to the ascertaineddifferences of physical organization,or to thedifferences which have existed in the outward circumstances in which the subjectsof thecomparison have hitherto been placed. What thesecauses might fail to accountfor, would constitutea residual phenomenon, which and which alone wouldbe evidence of an ulterior original distinction,and themeasureof its amount. But the assertorsof such supposed differences have not provided themselveswith thesenecessarylogical conditionsin the establishmentof their doctrine, (p. 429)
The final method and the fifthcanon is the Method of Concomitant Variin any manner whenever another pheations: "Whatever phenomenon varies nomenon variesin some particular manner, is eithera causeor an effect of that phenomenon,or is connected withit through some factof causation"(p. 401). This methodis essentially thatof the correlational study, the observationof covariation: Let us supposethe questionto be,what influencethe moon exertson thesurfaceof
METHODS OF INQUIRY
175
the earth. We cannottry an experimentin the absenceof the moon, so as toobserve what terrestrial phenomenon her annihilation would put an end to; but whenwe find that all the variations in the positionsof the moon are followed by corresponding variationsin the time and placeof high water,the place always being either the part of the earth which is nearestto, or that whichis most remote from,the moon, we have ample evidence that the moonis, wholly or partially, thecause which determines the tides, (p. 400)
Mill maintained that these methods were in fact the rulesfor inductive logic, that they were both methods of discoveryandmethodsof proof. His critics, then and now, argued against a logic of induction (see chapter2), but it isclear that experimentalistswill agreewith Mill that his methods constitute the meansby which they gather experimental evidence for their views of nature. The general structureof all the experimental designs that are employedin the psychological sciences may beseenin Mill's methods.The applicationand withholding of experimental treatments acrossgroups reflectthe methodsof agreementand differences. The useof placebosand thesystematic attempts to eliminate sourcesof error makeup the methodof residues,and themethodof concomitant variationis, asalready noted,a complete descriptionof the correlational study. It is worth mentioning thatMill attemptedto deal with the difficulties presentedby the correlationalstudy and, in doing so, outlinedthe basics of multiple regressionanalysis,the mathematicsof which werenot to come for many years: Suppose, then, that when A changesin quantity, a also changesin quantity,and in such a manner thatwe cantracethe numericalrelationwhich the changesof the one bear to such changes of the otheras take placewithin the limits of our observation. We maythen safely conclude that the samenumericalrelationwill hold beyondthose limits. (Mill, 1843/1872,8th ed., p. 403)
Mill elaborateson this propositionand goeson to discussthe casewherea is not wholly the effect of A but nevertheless varies with it: It is probably a mathematicalfunction not of A alone,but of A andsomething else: its changes,for example,may besuchas would occurif part of it remained constant, or varied on some other principle, and the remainder variedin some numerical relation to the variationsof A. (p. 403)
Mill's Logic is his principal work, and it may befairly castas thebook that first describes botha justification for and the methodologyof, the social sciences.Throughouthis works, the influence of the philosopherswho were, in fact, the early social scientists,is evident. David Hartley (1705-1757) publishedhis Observationson Man in 1749. This bookis a psychology rather
176
12. THE DESIGN OF EXPERIMENTS
than a philosophy (Hartley, 1749/1966).It systematically describes associationism in a psychological contextand is the firsttext that deals with physiological psychology. James Mill (1773-1836),John Stuart's father, much admired Hartleyand hismajor work becameone of themain source books that Mill the elder introducedto his sonwhenhe startedhis formal educationat the age of 3(when he learned Ancient Greek), although he did not get toformal logic until he was 12.Another important earlyinfluence was Jeremy Bentham (1748-1832),a reformer who preachedthe doctrine of Utilitarianism, the essential featureof which is the notion that the several and joint effects of pleasure andpain governall our thoughtsandactions. LaterMill rejected strict Benthamismandquestionedthe work of a famousandinfluential contemporary, AugusteComte (1798-1857),whose work marksthe foundation of positivism and of sociology. The influenceof the ideasof thesethinkerson early experimental psychologyis strong and clear,but they will not beexplored here.The main point to be madeis that John StuartMill was anexperimental psychologist's philosopher. More,he was themethodologist's philosopher. In a letter to a friend, he said: If there is any sciencewhich I am capableof promoting, I think it is the science of scienceitself, the scienceof investigation- of method. I once heard Maurice say ... that almostall differencesof opinion when analysed, weredifferencesof method, (quoted by Robson in his textual introductionto the Logic, p. xlix)
And it is clear that all subsequent accounts of method and experimental designcan betraced backto Mill.
THECONCEPTOF STATISTICAL CONTROL The standard designfor agricultural experiments at Rothamstedin the days before Fisher was to divide a field into a numberof plots. Each plot would receivea different treatment, say,a different manureor fertilizer or manure/fertilizer mixture. The plot that producedthe highest yield wouldbe taken to be the best, and thecorresponding treatment considered to be themost effective. Fisher, and others,realized that soilfertility is by no meansuniform acrossa large field and that this,as well as other factors,can affect the yields. In fact, the differencesin the yields could be due tomany factors other thanthe particular treatments and the highest yield might be due to some chance combinationof these factors.The essential problem is to estimatethe magnitude of these chance factors - the errors - to eliminate, for example,the differencesin soil fertility. Someof the first data thatFisher saw atRothamsted werethe recordsof
THE CONCEPT OF STATISTICAL CONTROL
177
daily rainfall and yearly yields from plots in the famous Broadbalk wheat field. Fertilizers had been appliedto these plots, using the same pattern, since 1852. Fisher used the methodof orthogonal polynomialsto obtain fits of the yields over time. In his paper (1921b) on these data published in 1921 he describes analysis of variance (ANOVA) for the first time. When the variation of any quantity (variate)is producedby theaction of two or more independentcauses,it is known that the variance produced by all the causes simultaneouslyin operation is the sum of thevaluesof the variance producedby eachcauseseparately... In Table II is shown the analysisof the total variancefor each plot, divided according as it may beascribed(i) to annual causes, (ii)to slow changesother than deterioration, (iii) to deterioration;the sixth columnshowsthe probability of larger valuesfor the variancedue toslow changes occurring fortuitously. (Fisher,192 Ib, pp. 110-111)
The method of data analysis thatFisher employedwas ingeniousand painstaking,but he realizedquickly that the data that were available suffered from deficienciesin the designof their collection. Fisherset out on a newseries of field trials. He divided a field into blocksand subdividedeach block into plots. Each was plot within the block was given a different treatment,and each treatment assignedto each plot randomly. This, as Bartlett (1965) putsit, was Fisher's "vital principle." When statisticaldataarecollected asnatural observations,the most sensible assumptions aboutthe relevant statistical model have to be inserted. In controlled experimentation, however, randomness could be introduced deliberately into the design, so that any systematic variability other than [that] due toimposed treatments could be eliminated. The second principle Fisher introduced naturally went with the first. With statistical analysisgearedto the design, all variability not ascribedto theinfluence of treatmentsdid not have to inflate the random error. With equal numbersof replicationsfor the treatments each replication could be containedin a distinct block, and only variability among plotsin the same block werea source of error - that between blocks could be removed. (Bartlett, 1965, p. 405)
The statistical analysis allowed for an even more radical break with traditional experimental methods: No aphorism is more frequently repeatedin connectionwith field trials, than thatwe must ask Nature few questions,or, ideally, one question,at a time. The writer is convinced that this viewis wholly mistaken. Nature, he suggests,will best respond to a logical and carefully thoughtout questionnaire; indeed, if we ask her asingle question, shewill often refuseto answeruntil some other topichasbeen discussed. (Fisher, 1926b,p. 511)
178
12. THE DESIGN OF EXPERIMENTS
Fisher's"carefully thoughtout questionnaire"was thefactorial design. All possible combinationsof treatments wouldbe applied with replications. For example,in the applicationof nitrogen (N), phosphate (P), andpotash(K) there would be eight possible treatment combinations: no fertilizer, N, P, K, N & P, N & K, P & K, and N & P & K. Separate compact blocks would be laid out and these combinations would be randomly appliedto plots withineachblock. This design allowsfor an estimation of the main effects of the basicfertilizers, the first-order interactions (theeffect of two fertilizers in combination), and the second-orderinteraction (theeffect of the three fertilizersin combination).The 1926(b)papersetsout Fisher'srationale for field experimentsand was,as he noted, the precursorof his book, The Design of Experiments (1935/1966), published9 years later.The paperis illustratedwith a diagram(Fig.12.1)of a "complex experiment with winteroats" that had been carriedout with a colleagueat Rothamsted (Eden & Fisher, 1927). Here 12 treatments,including absenceof treatments- the"control" plots were tested.
FIG. 12.1 Fisher's Design 1926. Journal of the Ministry of Agriculture
THE CONCEPT OF STATISTICAL CONTROL
179
Any general difference between sulphate and chloride, betweenearly and late application, or ascribable to quantity of nitrogenous manure,can bebasedon thirty-two comparisons,each of which is affected by such soil heterogeneityas existsbetweenplots in the sameblock. To make thesethreesetsof comparisons only, with the sameaccuracy,by single question methods, would require 224 plots, againstour 96; but inaddition many other comparisons can bemade with equal accuracy,for all combinationsof the factors concernedhave been explored. Most important of all, the conclusions drawn from the single-factor comparisonswill be given, by the variation of non-essential conditions, a very much wider inductive basis than could be obtained, by single question methods, without extensive repetitions of the experiment. (Fisher, 1926b, p. 512)
The algebraand thearithmeticof the analysisaredealt within thefollowing chapter.The crucial pointof this work is the combinationof statistical analysis of the stimulus for this paperwas Sir John with experimental design. Part on field experiments,which had appearedin the same Russell's (1926) article journaljust months earlier. Russell's review presents the orthodox approachto field trials and advocatedcarefully planned, systematic layouts of the experimental plots.Sir JohnRussellwas theDirectorof the Rothamsted Experimental Station, he hadhired Fisher,and he wasFisher's boss,but Fisher dismissedhis methodology.In a footnotein the 1926(b) paper, Fisher says: This principlewas employed in an experimenton theinfluence of the weatheron the effectivenessof phophatesand nitrogen alludedto by SirJohn Russell.The author must disclaimall responsibilityfor the designof this experiment, whichis, however, a good example of its class. (Fisher, 1926b,p. 506)
And as Fisher Box (1978) remarks: It is a measure of the climate of the times that Russell,an experiencedresearch scientist who . . . had had the wisdom to appoint Fisher statistician for the better analysis of the Rothamsted experiments, did not defer to theviews of his statistician when he wrote on how experiments were made. Design was, in effect, regardedas an empirical exerciseattemptedby the experimenter;it was not yet thedomain of statisticians, (p. 153)
In fact the statistical analysis, in a sense, arises from the design. Nowadays, when ANOVA is regardedasefficient and routine,the various designs that are availableandwidely usedaredictatedto us by theknowledge thatthe reporting of statistical outcomes andtheir related levels of significance is thesine qua non of scientific respectabilityandacceptabilityby the psychological establishment. Historically, the newmethodsof analysis camefirst. The confounds, defects, and confusionsof traditionaldesigns became apparent when ANOVA was used to examinethe data and so newdesigns were undertaken.
180
12. THE DESIGN OF EXPERIMENTS
Randomizationwas demandedby the logic of statistical inference. Estimatesof error and valid testsof statisticalsignificancecan only be made when the assumptions that underlie the theory of sampling distributionsare upheld. Put crudely, this means that "blind chance" should not be restricted in the assignmentof treatmentsto plots, or experimental groups. It is, however, important to note that randomization does not imply that no restrictions or structuringof the arrangementswithin a designare possible. Figure 12.2 showstwo systematic designs: (a) ablock designand (b) aLatin squaredesign,and two randomized designs of the same type,(c) and(d). The essentialdifference is that chance determines the applicationof the various treatments applied to theplots in the latter arrangements, but therestrictionsare apparent.In the randomizedblock and in therandomizedLatin square, each block containsone replicationof all thetreatments. The estimateof error is valid, because,if we imaginea large numberof different results obtainedby different random arrangements, the ratio of the real to the estimated error, calculated afreshfor each of these arrangements, will be actually distributed in the theoreticaldistribution by which the significanceof the result is tested. Whereasif a groupof arrangementsis chosen such that the real errorsin this group are on thewhole less than those appropriate to random arrangements, it has now beendemonstrated that the errors,asestimated,will, in such a group,be higher than is usualin random arrangements, andthat,in consequence, within sucha group, the test of significance is vitiated. (Fisher, 1926b, p.507)
Block 1 Block 2 Blocks Block 4 Blocks
Block 1 Block 2 Block3 Block 4 Blocks
Treatments 2 3 4 5 B C D E B C D E B C D E B C D E B C D E (a) Treatments 1 2 3 4 5 D C E A B A D B C E B A E C D E D C A B B A D E C (c)
A Standard Latin Square
1 A A A A A
A B C D
B A D C
C D B A
D C A B
(b) A Random Latin Square
FIG. 12.2Experimental Designs
D C B A
A B D C
C D A B
(d)
B A C D
THE LINEAR MODEL
181
Fisherlater examinesthe utility of the Latin square design, pointing out that it is by far themost efficient andeconomicalfor "thosesimple typesof manurial trial in which every possible comparison is of equal importance"(p. 510). In 1925 andearly 1926, Fisher enumerated the 5 x 5 and 6x6squares,and in the 1926 paperhe madean offer that undoubtedly helped to spreadthe name,and the fame,of the Rothamsted Station to many partsof the world: The Statistical Laboratoryat Rothamstedis preparedto supply these,or other typesof randomized arrangements, to intending experimenters; this procedure is consideredthe more desirable sinceit is only too probable thatnew principleswill, at their inception,be, insome detail or other, misunderstood andmisapplied;a consequence for which their originator,who has made himself responsible for explaining them, cannot be held entirely freefrom blame. (Fisher, 1926b,pp. 510-511)
THE LINEAR MODEL Fisher described ANOVAas a way of"arrangingthe arithmetic" (Fisher Box, 1978, p. 109), an interpretationwith which not a fewstudents would quarrel. However,the description does point to thefact thatthe componentsof variance are additive and that this propertyis an arithmeticalone and notpart of the calculusof probability and statisticalinferenceas such. The basic construct that marks the culmination of Fisher's workis that of specifying valuesof an unknown dependent variable, >>, in termsof a linear set of parameters,eachone of which weightsthe several independent variables jc,, jc2, # 3 , . . . , jt n, that are usedfor prediction, togetherwith an error component8 that accountsfor the randomfluctuations in y for particularfixed valuesof A:,, x2, J C3 , . . . , xn. In algebraic terms,
As we noted earlier,the random component in the modeland thefact thatit is sample-based make it a probabilistic model,and the properties of the distribution of this component, real or assumed, govern the inferences thatmay be made aboutthe unknown dependent variable. Fisher's work is the crucial link betweenclassical least squares analysis and regression analysis. As Seal (1967) notes, "The linear regression model owes so muchto Gauss that we believe it should bearhis name"(p. 1). However, thereis little reasonto suppose that this will happen. Twentyyears ago Seal found that veryfew of the standard textson regression,or the linear model, or ANOVA made more thana passing reference to Gauss, and the situation is little changed today. Some of the reasonsfor this have alreadybeen
182
12. THE DESIGN OF EXPERIMENTS
mentionedin this book.The European statisticians of the 18thand 19th centuries and political arithmetic,and inferenceand were concerned with vital statistics prediction in the modern sense were, generally speaking, a long way off in these fields. The mathematicsof Legendreand Gaussand otherson the theory of errors did not impingeon thework of the statisticians. Perhaps more strikingly, the early links between social and vital dataand error theory that were made by Laplace and Quetelet were largely ignored by Karl Pearsonand Ronald Fisher. Why, then, could not theTheory of Errors be absorbed intothe broaderconceptof statisticaltheory ... ? ... Theoriginal reasonwasPearson'spreoccupation withthe multivariate normal distributionand itsparameters.The predictive regressionequation of his pathbreaking 'regression'paper (1896) was notseento be identical in form and solution to Gauss'sTheoria Motus (1809) model.R. A. Fisher and his associates... wererediscovering manyof themathematical results of leastsquares(or error) theory,apparentlyagreeingwith Pearsonthat this theory held little interestto the statistician. (Seal, 1967, p. 2)
There mightbe other more mundane, nonmathematicalreasons. Galton and others were strongly opposed to the use of theword error in describing the variability in human characteristics, and themany treatiseson thetheory might thus have been avoided by the newsocial scientists, who were, in the main,not mathematicians. In his 1920 paperon the history of correlation, Pearsonis clearly most anxious to downplay any suggestion that Gaussian theory contributed to its development.He writes of the "innumerabletreatises" (p. 27) onleast squares, of the lengthy analysis,of his opinion that Gaussand Bravais "contributed nothingof real importanceto theproblemof correlation"(p. 82), and ofhis view the real line of that it is not clear thata least squares generalization "indicates future advance"(p. 45). The generalizationhad been introduced by Yule, who Pearsonand hisGower Street colleagues clearly saw as theenemy. Pearson regardedhimself as the father of correlation and regression insofaras the mathematics were concerned. Galton and Weldon were,of course, recognized as important figures,but they werenot mathemaliciansand posedno threatto Pearson'sauthorily. In other respects,Pearsonwas driven to try to show that his contributions were supreme andindependent. The historical recordhasbeen tracedby Seal (1967),from the fundamental work of LegendreandGaussat thebeginningof the 19th centuryto Fisher over 100 years later. THE DESIGN OF EXPERIMENTS A lady declaresthat by tasting a cup of teamade with milk she can discriminate whether the milk or the teainfusionwas firstaddedto thecup. We will considerthe
THE DESIGN OF EXPERIMENTS
183
problem of designingan experimentby meansof which thisassertioncan betested. (Fisher, 1935/1966,8th ed., p. 11)
With thesewords Fisher introduces the example that illustrated his view of the principles of experimentation. Holschuh (1980) describes it as "the somewhat artificial 'lady tasting tea' experiment" (p. 35), and indeedit is, but perhapsan American writer doesnot appreciatethe fervor of the discussionon the best methodof preparing cupsof teathat still occupiesthe British! FisherBox (1978) reports thatan informal experimentwascarriedout atRothamsted.A colleague, Dr B. Muriel Bristol, declineda cup of teafrom Fisheron thegrounds thatshe preferredone towhich milk had firstbeen added.Her insistence thatthe order in which milk and teawere poured intothe cup made a differenceled to a lightheartedtest actually being carried out. Fisher examinesthe design of such an experiment. Eight cups of tea are prepared. Fourof them havetea addedfirst and four milk. The subjectis told that thishasbeen doneand thecupsof tea arepresentedin a randomorder. The task is, of course,to divide the set ofeight intotwo setsof four accordingto the methodof preparation. Because there are 70waysof choosinga set of 4objects from 8: A subject withoutany faculty of discrimination wouldin fact divide the 8 cups correctly intotwo setsof 4 in onetrial out of 70, or,more properly, witha frequency which would approach1 in 70 more and more nearlythe more oftenthe test were repeated.. . . The odds could be made much higherby enlargingthe experiment, while if the experiment were much smaller even the greatest possiblesuccesswould give oddsso low that the result, might with considerable probability, be ascribedto chance. (Fisher, 1935/1966,8th ed., pp. 12-13)
Fishergoeson to saythat it is "usualand convenientto take 5 per cent,as a standard levelof significance,"(p. 13) and so anevent that would occurby chance oncein 70 trials is decidedlysignificant. The crucial pointfor Fisheris the act of randomization: Apart, therefore, fromthe avoidable errorof the experimenter himself introducing with his test treatments,or subsequently, other differences in treatment,the effects of which the experiment is not intendedto study, it may be said that the simple precaution of randomisation will suffice to guaranteethe validity of the test of significance, by which the result of the experiment is to be judged. (Fisher, 1935/1966,8th ed., p. 21)
This is indeedthe crucial requirement. Experimental design when variable measurementsare being made,and statistical methodsare to beusedto tease out the information from the error, demands randomization. But there are
184
12. THE DESIGN OF EXPERIMENTS
ironies herein this, Fisher's elegant account of the lady tastingtea.It has been hailed as themodel for the statisticalinferential approach: It demandsof the readerthe ability to follow a closely reasonedargument,but it will repay the effort by giving a vivid understandingof the richness, complexityand subtlety of modern experimental method. (Newman, 1956, Vol. 3, p.1458)
In fact, it usesa situation and amethod that Fisher repudiated elsewhere. to the Neyman-Pearson approach are More discussionof Fisher's objections given later. For themoment,it might be noted that Fisher's misgivings center on theequationof hypothesis testing with industrial quality control acceptance procedures where the population being sampled has anobjective reality,and that populationis repeatedly sampled. However, thetea-tasting example appears to follow this model! Kempthorne (1983) highlighted problemsand indicated the difficulties that so many havehad in understandingFisher's pronouncements. In his book Statistical Methodsand Scientific Inference,first publishedin 1956, Fisher devotes a whole chapterto "Some Misapprehensions about Tests of Significance." Herehe castigatesthe notion that 'thelevel of significance' should be determined by "repeatedsamplingfrom the same population", evidently with no clear realization that the population in questionis hypothetical" (Fisher,1956/1973,3rd ed.,pp. 81-82). He determinesto illustrate "the more generaleffects of the confusion betweenthe level of significanceappropriately assigned to a specific test, with the frequencyof occurrenceof a specifiedtypeof decision" (Fisher, 1956/1973, 3rd ed., p. 82). He states,"In fact, as amatterof principle,the infrequencywith which, in particular circumstances, decisive evidence is obtained, shouldnot be confusedwith the force, or cogencyof such evidence" (Fisher, 1956/1973,3rd ed., p. 96). Kempthorne(1983), whose perceptions of both Fisher's genius and inconsistenciesare ascogentand illuminating as onewould find anywhere, wonders if this book's lack of recognitionof randomizationarose because of Fisher's belated, but of course not admitted, recognition thatit did not mesh with "fiduciating." Kempthorne quotes a "curious" statementof Fisher's and comments, "Well,well!" A slightly expanded version is given here: Whereasin the "Theory of Games"a deliberately randomized decision (1934) may often be usefulto give an unpredictable element to thestrategy of play; and whereas plannedrandomization(1935-1966)iswidely recognizedasessentialin theselection and allocation of experimental material,it has nouseful part to play in the formation of opinion, and consequentlyin testsof significance designedto aid theformation of opinion in the Natural Sciences.(Fisher, 1956/1973,3rd ed.,p. 102)
THE DESIGN OF EXPERIMENTS
185
Kendall (1963) wishes that Fisher had never writtenthe book, saying,"If we had tosacrifice any of hiswritings, [this book] would havea strong claim to priority" (p. 6). However he did write the book, and heusedit to attack his opponents. In marshaling his arguments,he introduced inconsistencies of both logic and method that haveled to confusion in lesser mortals.Karl Pearsonand the biometricians used exactly the same tactics.In chapter15 theview is presented of affairs that has led to thehistorical development that it is this rather sorry state of statistical procedures,as they are used in psychologyand the behavioral sciences, being ignored by the texts that made them available to a wider, and undoubtedlyeager,audience.
13 AssessingDifferences and Having Confidence
FISHERIAN STATISTICS Any assessmentof the impactof Fisher's arrivalon thestatistical battlefieldhas to recognizethat his forcesdid not really seekto destroy totally Pearson's work or its raison d'etre. The controversy betweenYule and Pearson, discussed earlier, had aphilosophical,not to sayideological, basis.If, at theheightof the conflict, one or other sidehad "won," then it is likely that the techniques andforgotten. Fisher's advocatedby thevanquishedwould have been discarded war wasmore territorial. The empireof observationand correlationhad to be taken over by the manipulationsof experimenters. Although he would never it - indeed,hecontinuedto attack Pearson and hisworks have openly admitted to the very end of hislife (which came26 yearsafter the end ofPearson's)the paradigmsandprocedureshe developeddid indeed incorporate andimprove on the techniques developed at Gower Street.The chi-square controversy was not a dispute aboutthe utility of the test or its essential rationale, but a bitter the efficiency and methodof its application. For anumber disagreement over of reasons,which havebeendiscussed,Fisher's views prevailed.He wasright. In the late 1920sand 1930s Fisher was at theheight of his powers and vigorously forging ahead. Pearson, although still a man to bereckoned with, was nearly 30 years awayfrom his best work,an old manfacing retirement, rather isolatedas heattackedall thosewho were not unquestioninglyybrhim. Last, but by no means least, Fisher was thebetter mathematician. He had an intuitive flair that broughthim to solutionsof ingenuityand strength.At the sametime, he wasable to demonstrateto the community of biological and behavioral scientists, a communitythatsodesperately needed a coherent system of data management and assessment, that his approachhad enormous practical utility. 186
THE ANALYSIS OF VARIANCE
187
Pearson'swork may be characterizedas large sampleand correlational, Fisher'sassmall sampleandexperimental. Fisher's contribution easily absorbs the best of Pearsonand expandson theseminal workof "Student." Assessing the import and significance of the variation in observations across groups subject to different experimental treatmentsis the essenceof analysis of variance, and a haphazard glanceat any research journalin the field of experimental psychology attests to its impact.
THEANALYSIS OFVARIANCE The fundamental ideas of analysis of variance appearedin the paper that examined correlation among Medelian factors (Fisher, 1918). At this time, and 1920, eugenic researchwas occupying Fisher's attention. Between 1915 he published halfa dozen papers that dealt with matters relevant to this interest, an interest that continued throughout his life. The 1918 paper uses the term variance for fai 2 + a22J, ai and a2 representingtwo independent causes of variability, ana referreato the normally distributed population. We may now ascribeto the constituentcausesfractions or percentagesof the total variance which they together produce. It is desirableon the onehand that the elementary ideas at the basis of the calculus of correlations should be clearly understood,and easily expressedin ordinary language,and on theother that loose phrasesaboutthe "percentage of causation,"which obscurethe essentialdistinction betweenthe individual and thepopulation, should be carefully avoided. (Fisher, 1918, pp. 399-400)
Here we seeFisher already moving away from Pearsonian correlational methodsas such and appealingto the Gaussianadditive model. Unlike Pearson'swork, it cannot be said thatthe particular philosophy of eugenics directly to new statistical techniques, but it is clear that governed Fisher's approach Fisheralways promotedthe valueof the methodsin genetics research (see, e.g., Fisher, 1952). Whatthe newtechniques were to achievewas arecognitionof the utility of statistics in agriculture,in industry, and in the biological and behavioral sciences,to an extent that couldnot possibly have been foreseen before Fisher cameon the scene. The first published account of an experiment that used analysis of variance to assessthe data was that of Fisherand MacKenzie (1923)on The Manurial Responseof Different Potato Varieties. Two aspectsof this paperare of historical interest. At that time Fisherdid not fully understandtherules of theanalysisof variance- hisanalysisiswrong - nor therole of randomization. Secondly,although the analysis of variance is closely tied to
188
13. ASSESSING DIFFERENCES AND HAVING CONFIDENCE
the additive modelin his first analysisof variance, additive models, Fisher rejects proceedingto a multiplicative modelas more reasonable. (Cochrane, 1980, p. 17)
Cochrane pointsout that randomizationwas notusedin the layout and that an attempt to minimize error usedan arrangement that placed different treatments nearoneanother. The conditions couldnot provide an unbiased estimate of error. Fisher thenproceedsto ananalysis basedon amultiplicative model: Rather surprisingly, practically all of Fisher's later workon theanalysisof variance usesthe additive model. Later papers give no indicationas to why theproduct model was dropped. Perhaps Fisher found, as 1did, that the additive model is a good approximation unless main effects are large, as well as being simplerto handle than the product model. (Cochrane, 1980, p. 21)
Fisher'sderivation of the procedureof analysis of variance and hisunderstandingof the importanceof randomizationin the planningof experimentsare fully discussedin Statistical Methodsfor Research Workers (1925/1970), first published in 1925. This work is now examinedin more detail. Over 45 yearsand 14editions, the general characterof the book did not change.The 14th edition was publishedin 1970, usingnotesleft by Fisher at the time of his death. Expansions, deletions, and elaborationsare evident over the years. Notable areFisher's increasing recognition of the work of othersand greaterattentionto thehistorical account. Fisher's concentration on his rowwith the biometriciansastime wentby is also evident.The prefaceto the last edition follows earlier onesin stating thatthe book was aproductof the research needs of Rothamsted. Further: It was clear thatthe traditional machinery inculcated by the biometrical schoolwas wholely unsuited to the needs of practical research.The futile elaboration of innumerablemeasuresof correlation, and the evasion of the real difficulties of sampling problems under cover of a contemptfor small samples, were obviously p. v) beginningto make its pretensionsridiculous. (Fisher, 1970, 14th ed.,
The opening sentence of the chapteron correlationin the first edition reads: No quantity is more characteristic of modern statistical work than the correlation coefficient, and nomethodhasbeen appliedsuccessfullyto such various data as the method of correlation. (Fisher, 1925, 1sted., p. 129)
and in the 14th edition: No quantity has been more characteristic of biometrical work thanthe correlation coefficient, and nomethodhas been appliedto such various data as themethod of correlation. (Fisher, 1970, 14thed., p. 177)
THE ANALYSIS OF VARIANCE
189
This not-so-subtle change is reflectedin the divisionsin psychology thatarestill evident. The twodisciplines discussed by Cronbachin 1957 (see chapter 2) are those of the correlational and experimental psychologists. In his opening chapter, Fisher sets out thescopeand definition of statistics. He notes that theyareessentialto social studiesandthat it is becausethe methods are used there that"thesestudiesmay beraisedto therank of sciences"(p. 2). The conceptsof populationsand parameters,of variationand frequency distributions, of probability and likelihood,and of thecharacteristicsof efficient statistics are outlined very clearly.A short chapteron diagrams oughtto be required reading,for it points up how useful diagramscan be in theappraisalof data.1 The chapteron distributions deals with the normal, Poisson,andbinomial distributions.Of interest is the introductionof the formula:
(Fisher usesS for summation) for variance, noting that s is thebest estimate of a. Chapter4 deals withtestsof goodness-of-fit, independence, and homoge2 neity, givinga complete description of the applicationof the X tests,including Yates'correction for discontinuityand theprocedurefor what is now known as the Fisher Exact Test. Chapter 5 is on tests of significance,about which more is said later. Chapter6 managesto discuss,quite thoroughly,the techniquesof interclass correlation without mentioning Pearsonby name exceptto acknowledge thatthe dataof Table 31 arePearsonand Lee's. Failureto acknowledge the work of others,which was acharacteristicof both Pearsonand Fisher,and which, to some extent, arose out of both spiteand arrogance,at least partly explainsthe anonymous presentation of statistical techniques that is to befound in the modern textbooksand commentaries. And then chapter7, two-thirdsof the waythroughthe book, introduces that most importantand influential of methods- analysis of variance. Fisher describesanalysisof variance as "the separationof the varianceascribableto one group of causesfrom the variance ascribable to other groups" (Fisher, 1925/1970,14th ed., p. 213), but heexaminesthe developmentof thetechnique from a considerationof the intraclass correlation. His exampleis clear and worth describing. Measurements from n' pairs of brothersmay betreated in two ways in a correlational analysis. The brothersmay be divided into two Scatterplots,that so quickly identify the presenceof "outliers," are critical in correlational analyses.The geometrical explorationof the fundamentalsof variance analysis provides insights which cannot be matched (see, e.g., Kempthorne, 1976).
1 90
13. ASSESSING DIFFERENCES AND HAVING CONFIDENCE
classes,say,the elder brotherand theyounger,and theusual interclass correlation on some measured variable may becalculated. When, on theother hand, the separationof the brothers intotwo classesis either irrelevantor impossible, then a common meanand standard deviation and anintraclass correlationmay be computed. Given pairs of measurements, x\ ,x'\ ; x2 , xr 2 ; *3 , x'3 ; . . .xn> , x'n> the following statisticsmay becomputed:
In the preceding equations, Fisher's S hasbeen replaced with S and r,used to designatethe intraclass correlation coefficient. The computationof r, is very tedious,as thenumberof classesk and thenumberof observationsin each class increases. Each pair of observationshas to beconsidered twice, (;ci , x'1) and (x'1 , x1) for example. A set of kvalues givesk (k - 1) entriesin asymmetrical table. "To obviate this difficulty Harris [1913] introducedan abbreviated method of calculation by which the value of the correlation givenby the symmetrical table may beobtained directlyfrom two distributions" (Fisher, p. 216). In fact: 1925/1970, 14th ed.,
Fisher goeson to discussthe sampling errorsof the intraclass correlation and refers themto his z distribution. Figure 13.1 shows the effect of the transformationof r to z. Curves of very unequal varianceare replaced by curves of equal variance, skew curves by approximately normal curves, curves of dissimilar form by curves of similar form. (Fisher, 1925/1970,14th ed.,p. 218)
The transformationis given by:
Fisherprovides tablesof the r to ztransformation. After giving an example
FIG. 13.1
r to z transformation (from Fisher, Statistical Methods for Research Workers).
191
192
13. ASSESSING DIFFERENCES AND HAVING CONFIDENCE
of the use of thetable and finding the significanceof the intraclass correlation, conclusionsmay bedrawn. Becausethe symmetrical table does not give the best estimateof the correlation,a negative biasis introduced intothe valuefor z andFisher showshow this may becorrected. Fisher then shows that intraclass correlationis anexampleof the analysisof variance:"A very great simplification is introduced into questionsinvolving intraclass correlation when we recognisethat in such casesthe correlation merely measures the relative p. 223). importanceof two groupsof variation" (Fisher, 1925/1970, 14th ed., "in the last column, Figure 13.2is Fisher's general summary table, showing, the interpretationput upon each expression in the calculationof an. intraclass correlation from a symmetricaltable" (p. 225).
FIG. 13.2
ANOVA summary table 1 (from Fisher's Statistical Methods for Research Workers
A quantity madeup of two independentlyand normally distributed parts with variancesA and B respectively,has atotal varianceof (A + B). A sample of n' values is taken from the first part and different samplesof k values from the second part added to them. Fisher notes that in the populationfrom which the values are drawn, the correlation between pairs of membersof the same family is:
and thevaluesof A and B may beestimatedfrom the set of kn'observations. The summary tableis then presented again (Fig. 13.3) andFisher pointsout that "the ratio between the sums of squares is altered in the ratio n ' : ( n ' ) ,
THE ANALYSIS OF VARIANCE
193
FIG. 13.3 ANOVA summary table 2 (from Fisher, Statistical Methods for Research Workers which precisely eliminatesthe negative bias observedin z derived by the previous method" (Fisher, 1925/1970, 14th ed., p. 227). The generalclassof significancetestsapplied hereis that of testingwhether an estimate of variance derivedfrom n1 degreesof freedom is significantly greater thana second estimate derived from n2 degrees of freedom. The significance may beassessedwithout calculatingr. The value of z may be calculatedas}/2 loge \(n' - 1)(kA + B) - n'(k- 1)B|. Fisher provides tables of the zdistribution for the 5% and 1% points. In the later editionsof the book he notes that these values were calculated from the corresponding values of the 12 variance ratio, e, andrefersto tablesof these values prepared by Mahalonobis, usingthe symbolx in 1932,andSnedecor, using the symbolF in 1934. In fact: Z =
l
/2\OgeF
"The wide use in theUnited States of Snedecor's symbolhas led to the distribution beingoften referredto as thedistributionof F " (Fisher, 1925/1970, 14th ed., p. 229). Fisher ends thechapterby giving a numberof examplesof the useof the method. It shouldbe mentioned here that the detailsof the history of therelationship of ANOVA to intraclass correlationis aneglected topicin almostall discussions of the procedure.A very useful referenceis Haggard (1958). The final two chaptersof the book discussfurther applicationsof analysis of varianceand statistical estimation. Of most interest here is Fisher's demonstration of the way inwhich the techniquecan beusedto test the linear model and the "straightness"of the regression line.The method is the link between andregression analysis. Also of importanceis Fisher's discussion least squares
194
13. ASSESSING DIFFERENCES AND HAVING CONFIDENCE
of Latin square designs and theanalysisof covariancein improvingthe efficiency andprecision of experiments. Fisher Box notes thatthe book did not receive a single good review.An example, which reflected the opinionsof many,was thereviewerfor the British Medical Journal: If he feared that he waslikely to fall betweentwo stools,to producea book neither full enoughto satisfy thoseinterestedin its statistical algebranor sufficiently simple to pleasethosewho dislike algebra,we think Mr. Fisher'sfears arejustified by the result. (Anonymous, 1926, p. 815)
Yates (1951) comments on these early reviews, noting that many of them expressed dismay at thelack of formal mathematical proofs and interpretedthe work as though it was only of interestto those who were involvedin small sample work. Whatever its receptionby thereviewers,by 1950, whenthe book was in its 11th edition, about 20,000 copies had been sold. But it is fair to saythat something like10 years wentby from the dateof the original publication before Fisher's methods really startedto havean effect on the behavioral sciences. Lovie (1979) traces its impact overthe years 1934to 1945. He mentions,as doothers,that the early textbook writers contributed to its acceptance. Notable here are theworks of Snedecor, published in 1934and 1937. Lush (1972) quotes a European researcher who told him, "Whenyou see Snedecoragain, tellhim that over herewe say, 'ThankGod for Snedecor;now we canunderstand Fisher' " (Lush, 1972,p. 225). Lindquist publishedhis Statistical Analysisin Educational Researchin 1940, and this book, too,was widely used. Even then, some authorities were skeptical,to say theleast. CharlesC. Petersin an editorial for the Journal of Educational Research rather condescendingly agreesthat Fisher'sstatisticsare "suitable enough"for agricultural research: And occasionallythesetechniqueswill be useful for rough preliminary exploratory researchin other fields, including psychology and education. But if educationists and psychologists,out of somesort of inferiority complex, grab indiscriminately at them and employ them where they are unsuitable, education and psychology will suffer another slump in prestige such as they have often hitherto sufferedin consequenceof the pursuit of fads. (Peters, 1943, p. 549)
That Peters'conclusion partly reflects the situationat that time,but somewhat missesthe mark, is best evidenced by Lovie's (1979) survey, which we look at in thenext chapter. Fifty or sixty years yearson, sophisticated designs and complex analyses are commonin the literaturebut misapprehensions and misgivings are still to be found there. The recipesare enthusiastically applied but their structureis not always appreciated.
MULTIPLE COMPARISON PROCEDURES
195
If the early workers reliedon writers like Snedecorto help them with Fisherianapplications, later ones are indebtedto workers like Eisenhart (1947) for assistance withthe fundamentalsof the method. Eisenhart setsout clearly the importanceof theassumptionsof analysisof variance, their critical function in inference,and the relative consequences of their not being fulfilled. Eisenhart's significant contributionhas been pickedup andelaboratedon by subsequent writers,but hisaccount cannotbe bettered. He delineatesthe two fundamentallydistinct classesof analysisof variance - what are nowknown as thefixed andrandom effects models.The first of these is the most familiarto researchersin psychology. Herethe task is to determine the significance of differences among treatment means: "Tests of significanceof employedin connection with problems of this classare simply extensionsto small samplesof the theory of least squares developed by Gauss andothers- theextensionof thetheoryto small samples being dueprincipally to R. A. Fisher" (Eisenhart, 1947, pp. 3-4). The secondclassEisenhart describes as thetrue analysisof variance. Here the problem is one ofestimating,andinferring the existenceof, the components of variance,"ascribableto random deviation of the characteristicsof individuals of a particular generic typefrom the meanvalue of these characteristics in the 'population' of all individuals of that generictype" (Eisenhart, 1947,p. 4). The failure of the then current literature to adequately distinguish between the twomethodsis becausethe emphasishadbeenon testsof significancerather than on problemsof estimation. But it would seem that, despite the best efforts of the writers of the most insightful of the nowcurrent texts(e.g., Hays, 1963, 1973),the distinctionis still not fully appliedin contemporary research. In other words, the emphasisis clearly on theassessmentof differencesamong pairsof treatment means rather than on therelativeand absolute sizeof variances. Eisenhart's discussion of the assumptionsof the techniquesis a model for later writers. Random variation, additivity, normality of distribution, homogeneity of variance, and zero covariance among the variablesare discussedin detail and their relative importance examined. This work can beregardedas a classicof its kind. MULTIPLE COMPARISON PROCEDURES In 1972, Maurice Kendall commented on howregrettableit was that duringthe 1940s mathematicshad begun to "spoil" statistics. Nowhereis the shift in emphasisfrom practice, withits room for intuition and pragmatism,to theory and abstraction more evident than in the area of multiple comparison procedures.The rulesfor making such comparisons have been discussed ad nauseam, and they continueto bediscussed. Among the more completeand illuminating
196
13. ASSESSING DIFFERENCES AND HAVING CONFIDENCE
accountsarethoseof Ryan (1959)and Petrinovichand Hardyck (1969). Davis and Gaito (1984) provide a very useful discussion of some of the historical background.In commentingon Tukey's (1949) intentionto replacetheintuitive by "Student")with some hard, cold facts, and to provide approach (championed simple and definite proceduresfor researchers, they say: [This is] symptomatic of the transition in philosophy and orientation from the early use ofstatisticsas apractical and rigorous aid for interpretingresearchresults, to a highly theoreticalsubject predicatedon theassumptionthat mathematicalreasoning was paramountin statistical work. (Davis& Gaito, 1984, p. 5)
It is also the case thatthe automaticinvoking, from the statistical packages, of any one ofhalf a dozen procedures following an Ftest hashelpedto promote the emphasison thecomparisonof treatment means in psychological research. No one would argue withthe underlying rationaleof multiple comparison procedures. Given that the task is to compare treatment means, it is evident that to carry out multiple t testsfrom scratchis inappropriate. Overthe long run it is apparent thatas larger numbers of comparisons are involved, using a procedure that assumes that the comparisonsare basedon independent paired data setswill increasethe Type I error rate considerably whenall possible comparisonsin a given set ofmeansare made. Put simply,the numberof false positiveswill increase.As Davis andGaito (1984) point out, witha set at the .05 level and H0 true, comparisons, using the t test, among10 treatment means would, in the long run, leadto thedifferencebetweenthe largestand thesmallest of them being reported assignificant some60% of thetime. One of theproblems here is that the range increases faster than the standard deviation as thesize of the sample increases.The earliest attempts to devise methods to counteract this effect come from Tippett (1925)and "Student" (1927),and later workers referred to the "studentizing" of the range,using tablesof the sampling distribution of the range/standard deviation ratio, known as the qstatistic. Newman (1939)publisheda procedure that uses this statistic to assess the significanceof multiple comparisons among treatment means. In general,the earlier writers followed Fisher, who advocatedperforming t tests following an analysis that produced an overall z that rejectedthe null hypothesis,the variance estimate being provided by the error mean square and its associated degrees of freedom. Fisher'sonly cautionary note comes in a discussionof the procedureto beadopted whenthe ztest fails to reject the null hypothesis: Much caution shouldbe used before claiming significance for special comparisons. Comparisons, whichthe experimentwasdesignedto make, may,of course,be made without hesitation. It is comparisonssuggestedsubsequently,by a scrutiny of the
MULTIPLE COMPARISON PROCEDURES
197
resultsthemselves,that are open to suspicion; for if the variants are numerous, a comparisonof thehighestwith thelowestobservedvalue,pickedout from theresults, will often appearto be significant, even from undifferentiated material.(Fisher, 1935/1966,8th ed., p. 59)
Fisher is here giving his blessing to planned comparisonsbut does not mention that these comparisons should, strictly speaking, be orthogonal or independent. Whathe doessay isthat unforeseen effectsmay betaken as guidesto future investigations. Davis and Gaito (1984)are atpainsto point out that Fisher's approach wasthat of the practical researcher and tocontrastit with the later emphasison fundamental logicand mathematics. from somewhat different Oddly enough,a numberof procedures, springing rationales,all appearedon thesceneat about the sametime. Among the best known are those of Duncan (1951, 1955), Keuls (1952), Scheffe (1953), and Tukey (1949, 1953). Ryan (1959) examines the issues. After contending that, fundamentally,the same considerations apply to both a posteriori (sometimes called post hoc) and apriori (sometimes called planned) comparisons, and drawing ananalogy with debates over one-tail andtwo-tail tests(to be discussed briefly later), Ryan definesthe problem as the control of error rate. Per comparison error ratesrefer to the probability thata given comparisonwill be wrongly judged significant.Per experiment error rates refer not toprobabilities as such,but to thefrequencyof incorrect rejectionsof the null hypothesisin an experiment, overthe long run of such experiments.Finally, the so-called experimentwise error rate is aprobability,the probability thatany oneparticular experimenthas atleast one incorrect conclusion.The various techniques that were developed have all largely concentratedon reducing,or eliminating,the effect of the latter. The exception seemsto be Duncan, who attempted to introduce a test based on the error rate per independent comparison. Ryan suggeststhat this special procedure seems unnecessary and Scheffe (1959),a brilliant mathematical statistician, is unableto understandits justification. Debateswill continue and, meanwhile, the packagesprovide us with all the methodsfor a keypressor two. For the purposesof the present discussion, the examinationof multiple comparison procedures provides a case historyfor the state of contemporary statistics. First,it is an example,to match all examples,of the questfor rules for decision makingand statistical inference that lie outsidethe structureand concept of the experiment itself. Ryan argues "that comparisons decided upon a priori from somepsychological theory should not affect the nature of the significance tests employedfor multiple comparisons" (Ryan, 1959, p. 33). Fisher believed that research should be theory drivenand that its results were always opento revision. "Multiple comparison procedures" could easily replace, with only some
198
13. ASSESSINGDIFFERENCESAND HAVING CONFIDENCE
slight modification,the subjectof ANOVA in the following quotation: The quick initial successof ANOVA in psychologycan beattributedto the unattracof analysing large experiments, combined tiveness of the then available methods with the appealof Fisher'swork which seemedto match, witha remarkabledegree of exactness,the intellectual ethosof experimental psychologyof the period, with its atheoreticalandsituationalistnatureand itswish for moreextensiveexperiments. (Lovie, 1979,p. 175)
Fisher would have deplored this; indeed, he diddeploreit. Second,it reflects the emphasis,in today's work,on the avoidanceof the Type I error. Fisher would havehad mixed feelings about this.On the onehand, he rejected the notion of "errorsof the second kind"(to bediscussedin the next chapter); only rejection or acceptance of the null hypothesisentersinto his schemeof things. On theother, hewould have been dismayed - indeed, he wasdismayed- by a concentrationon automaticacceptanceor rejection of the null hypothesisas the final arbiter in assessingthe outcomeof an experiment. Criticizing the ideologiesof both Russiaand theUnited States, where he felt such technological approaches were evident, he says: How far,within such a system [Russia], personal and individual inferencesfrom observed factsare permissiblewe do notknow, but it may perhapsbe safer... to conceal rather than to advertisetheselfish andperhapshereticalaim of understanding for oneselfthe scientific situation.In the U.S. alsothe great importanceof organized technology has Ithink madeit easyto confusethe process appropriate for drawing correct conclusions with those aimed rather at, let ussay,speeding production,or saving money. (Fisher, 1955, p. 70)
Third, the multiple comparison procedure debate reflects, as hasbeen noted, the increasingly mathematical approach to applied statistics.In the United States,a Statistical Computing Center at the State Collegeof Agriculture at Ames, Iowa (now Iowa State University), becamethe first centerof its kind. It was headedby George W. Snedecor, a mathematician,who suggestedthat Fisher be invited to lecture thereduring the summer sessionof 1931. Lush (1972) reportsthat academic policyat that institutionwas such that graduate coursesin statistics were administered by the Departmentof Mathematics. At Berkeley, the mathematics department headed by Griffith C. Evans,who went there in 1934, was to beinstrumentalin making thatinstitution a world center in statistics. Fisher visited there in the late summerof 1936 but made a very poor personalimpression. Jerzy Neyman was tojoin the departmentin 1939. And, of course,Annals of Mathematical Statisticswas foundedat Universityof Michigan in 1930.On more thanoneoccasionduring thoseyearsat the end of the 1930s, Fisher contemplated moving to the United States, and one may
CONFIDENCE INTERVALS AND SIGNIFICANCE TESTS
199
wonder whathis influence would have been on statistical developments had he becomepart of that milieu. a And, finally, the techniqueof multiple comparisons establishes, without backward glance,a systemof statistics thatis based unequivocally on along-run relative frequencydefinition of probability where subjective, a priori notions at best run in parallel withtheplanning of an experiment,for they certainlydo not affect, in the statistical context,the real decisions.
CONFIDENCE INTERVALS AND SIGNIFICANCE TESTS Inside the laboratory, at the researcher's terminal, as theoutcomeof the job reveals itself,andmost certainlywithin the pagesof thejournals, success means andcommentaries, some of statistical significance.A very great many reviews which have been brought together by Henkel and Morrison (1970), deplore the concentrationon the Type I error rate,the a level, as it is known, that this implies. To a lesserextent,but gaining in strength,is thepleafor an alternative approachto the reporting of statistical outcomes, namely, the examinationof confidence intervals. And,to an even lesser extent, judging from the journals, is the challenge thatthe statistical outcomes made in the assessmentof differencesshouldbe translatedinto the reportingof strengthsof effects. Here the wheel is turning full circle, backto the appreciationof experimental results 2 in terms of a correlational analysis. Fisher himselfseemeto believe thatthe notion of statistical significance was more or less self-evident.Even in the last edition of Statistical Methods, the words null and hypothesisdo not appearin the index, and significance and testsof significance, meaningof haveone entry, which refersto thefollowing: From a limited experience,for example,of individuals of a species,... we may obtain some ideaof the infinite hypotheticalpopulationfrom which our samplehas been drawn,and so of theprobablenatureof future samples If a second sample beliesthis expectationwe infer that it is, in the languageof statistics,drawn from a second population; that the treatment...did in fact makea material difference.... Critical testsof this kind may becalled testsof significance,and when such tests are availablewe maydiscover whether a second sample is or is notsignificantly different from the first. (Fisher, 1925/1970,14th ed.,p. 41)
A few pages later, Fisher does explain the use of thetail area of the probability intervaland notes thatthe p = .05level is the "convenient" limit for judging significance. He does thisin the context of examplesof how often 2
Of interest here,however, is the increasinguse,and theincreasingpower,of regression modelsin the analysisof data; see,for example,Fox (1984).
200
13. ASSESSING DIFFERENCES AND HAVING CONFIDENCE
deviationsof aparticular sizeoccur in agiven numberof trials - that twicethe standard deviation is exceeded about once in 22 trials, and so on.Thereis little as anextensionof the proporwonder that researchers interpreted significance an interpretationto which tion of outcomesin a long-run repetitive process, Fisherobjected! In TheDesignof Experiments,he says: In order to assertthat a natural phenomenon is experimentally demonstrable we need, not anisolated record,but areliable methodof procedure. In relation to the test of significance, we may saythat a phenomenonis experimentally demonstrable when we know how toconductan experiment whichwill rarely fail to give us a statistically significant result. (Fisher,1935/1966,8th ed., p. 14)
Here Fisher certainly seems to be advocating "rulesof procedure," againa situation which elsewherehe condemns. Of more interestis the notion that experiments mightbe repeatedto seeif they fail to give significant results. This seemsto be avery curious procedure, for surely experiments,if they are to be repeated,arerepeatedto find support for an assertion.The problems that these arebasedon thefact thatthe null hypothesisis astatement that statements cause is the negationof the effect that the experimentis trying to demonstrate,and that it is this hypothesis that is subjectedto statistical test.The Neyman-Pearson in chapter15) was anattemptto overcome these problems, approach (discussed but it was anapproach that again Fishercondemned. It appearsthat Fisheris responsiblefor the first formal statementof the .05 level as thecriterion for judging significance,but the convention predates his work (Cowles& Davis, 1982a). Earlier statements about the improbability of statistical outcomes were made by Pearsonin his 1900(a) paper,and "Student" (1908a) judged that three times the probable errorin the normal curve would be considered significant. Wood and Stratton (1910) recommend "taking 30 to 1 as thelowest oddswhich can beacceptedas giving practical certainty that a differencein a given directionis significant"(p. 433). Fisher Box mentions that Fisher took a courseat Cambridgeon thetheory of errors from Stratton duringthe academic year1912-1913. Odds of 30 to 1representa little more than three timesthe probable error (P.E.) referredto the normal probability curve. Because the probable erroris equivalentto a little more than two-thirds of a standard deviation, three P.E.s is almost two standard deviations, and, of course, reference to any table of the "areasunderthe normal curve" shows that a z scoreof 1.96 cutsoff 5% in the two tails of the distribution. With somelittle allowancefor rounding,the .05 probability levelis seento have enjoyedacceptancesome time before Fisher's prescription. A testof significanceis atestof theprobabilityof a statistical outcome under the hypothesisof chance. In post-Fisherian analyses the probabilityis that of
CONFIDENCE INTERVALS AND SIGNIFICANCE TESTS
201
making an error in rejecting the null hypothesis,the so-called TypeI error. It is, however,not uncommon to readof the null hypothesisbeing rejectedat the 5% level of confidence', an oddinversion that endows the act ofrejection with a sort of statementof belief about the outcome. Interpretationsof this kind reflect, unfortunatelyin the worst possible way,the notionsof significancetests in the Fisherian sense, and confidence intervals introduced in the 1930sby Jerzy Neyman. Neyman (1941)statesthat the theory of confidence intervalswas establishedto give frequency interpretations of problems of estimation. The classical frequency interpretation is best understood in the context of the long-run relative frequency of the outcomesin, say, the rolling of a die. Actual relative frequenciesin a finite run aretaken to be more or less equalto the probabilities,and in the"infinite" run are equalto the probabilities. In his 1937 paper,Neyman considersa systemof random variables x1,x2, xi,. .. xn designated E, and aprobability law p(E | 0i,0 2 , . .. 01)where 61,0 2 , . . . 0i are unknown parameters.The problem is to establish: single-valuedfunctions of the x 's 0 (£) and 0(£) havingthe property that, whatever the valuesof the O's,say 0'i, 0'2,. . . 0'i, the probability of 0(£) falling short of 9'i and at thesame timeof 9(£) exceedingO'i is equal to a numbera fixed in advance so that 0 < a < 1,
It is essentialto notice thatin this problem the probability refers to the valuesof 9(£) and§(£) which, beingsingle-valuedfunctions of the.x 's arerandom variables. 9'i being a constant,the left-hand side of [the above] doesnot representthe probability of 0'i falling within somefixed limits. (Neyman, 1937,p. 379)
The values 0(£) and 0(£) representthe confidencelimits for 0'1 andspan the confidence intervalfor theconfidencecoefficient a. Caremustbetaken here not to confuse thisa with the symbol for the Type 1error rate;in fact this a is 1 minus the Type I error rate. The last sentencein the quotation from Neyman just given is very important. First,an exampleof a statementof the confidence interval in morefamiliar termsis, perhaps,in order. Supposethat measurements on a particular fairly large random sample have produced a mean of 100 and that the standarderror of this meanhas been calculatedto be 3. Theroutine methodof establishingthe upperandlower limits of the 90%confidence interval would be tocompute 100 ± 1.65(3). Whathasbeen established? The textbooks will commonly saythat the probability thatthe population mean,u, falls within this interval is 90%, whichis precisely what Neyman says is not thecase. For Neyman, the confidence limitsrepresentthe solution to the statistical
202
13. ASSESSING DIFFERENCES AND HAVING CONFIDENCE
problemof estimating 0\ independentof a priori probabilities. What Neyman is saying is that, overthe long run, confidence intervals, calculated in this way, will containthe parameter90% of thetime. It is notjust being pedanticto insist to saythat the oneinterval actually calculated contains that, in Neyman's terms, the parameter90% of thetime is mistaken. Nevertheless, that is the way in which confidence intervals have sometimes come to be interpreted and used. A numberof writers have vigorously propounded the benefitsof confidence intervals asopposedto significance testing: Wheneverpossible,the basicstatisticalreportshould be in theform of a confidence interval. Briefly, a confidence intervalis a subset of the alternative hypotheses computedfrom the experimentaldatain such a way that for a selectedconfidence level a, theprobability thatthe thetrue hypothesisis includedin a set soobtained is a. Typically, an a -level confidence intervalconsistsof thosehypothesesunder which the p valuefor theexperimental outcome is larger than1 - a.... Confidence intervals are the closest we can atpresent cometo quantitative assessmentof hypothesis-probabilities. . . and arecurrently our most effectiveway to eliminate hypothesesfrom practicalconsideration- if we chooseto act asthough noneof the hypothesesnot included in a 95%confidence intervalare correct, we stand onlya 5% chanceof error. (Rozeboom,1960, p. 426)
Both Ronald Fisherand Jerzy Neyman would have been very unhappy with this advice! It does, however,reflect once againthe way inwhich researchers in the psychological sciences prescribe andpropound rules that they believe will leadto acceptanceof the findings of research. Rozeboom's paper is a thoughtful attempt to provide alternativesto the routine null hypothesis significance test and dealswith the important aspectof degreeof belief in an outcome. One final point on confidenceinterval theory:it is apparent that some early commentators(e.g., E. S.Pearson, 1939b; Welch, 1939) believed that Fisher's "fiducial theory" andNeyman's confidence interval theory were closely related. Neyman himself (1934)felt that his work was anextensionof that of Fisher. Fisher objected stronglyto the notion that therewas anything at all confusing about fiducial distributionsor probabilities and denied any relationshipto the theory of confidence intervals, which he maintained,was itself inconsistent.In 1941 Neyman attemptedto show that thereis no relationship betweenthe two theories,and herehe did notpull his punches: The presentauthor is inclined to think that the literature on thetheory of fiducial argumentwas born out of ideassimilarto those underlyingthe theory of confidence intervals. Theseideas,however, seemto have beentoo vague to crystallizeinto a mathematical theory. Instead they resulted in misconceptionsof "fiducial probability" and "fiducial distribution of a parameter"which seemto involve intrinsic inconsistencies In this light,thetheory of fiducial inferenceis simply non-existent
A NOTE ON "ONE-TAIL" AND "TWO-TAIL" TESTS
203
in the same senseas, for example, a theory of numbers definedby mutually contradictorydefinitions. (Neyman, 1941,p. 149)
To the confused onlooker, Neyman doesseem to have been clarifyingone aspectof Fisher's approach,and perhapsfor a brief momentof time therewas a hint of a rapprochement.Had it happened, thereis reason to believe that unequivocalstatementsfrom thesemen would have beenof overriding importancein subsequent applications of statistical techniques.In fact, their quarrels left the job to their interpreters. Debate and disagreement would,of course, to the have continued,but those who like to feel safe could have turned orthodoxy of the masters,a notion thatis not without its attraction. A NOTE ON "ONE-TAIL" AND "TWO-TAIL"
TESTS
In the early 1950s, mainlyin the pages of Psychological Bulletin and the Psychological Review- then, as now, immensely importantand influential journals - a debate took place on theutility anddesirabilityof one-tail versus two-tail tests(Burke, 1953; Hick, 1952; Jones, 1952,1954; Marks, 1951,1953). It had been stated that when an experimental hypothesis had a directional component- that is, notmerely thata parameteru, differed significantly from a second parameter 2,jj.but that, for example,ji, > u2 - thentheresearcherwas permitted to use thearea cut off in only one tail of the probability distribution when the test of significancewas applied. Referredto the normal distribution, this means thatthe critical value becomes 1.65 rather than 1.96. It was argued that becausemost assertions that appealed to theory were directional- for example, that spacedwas better than massed practicein learning, or that extraverts conditioned poorly whereas introverts conditioned well- the actual statistical testshould takeinto account these one-sided alternatives. Arguments againstthe use ofone-tailed testsprimarily centeredon whatthe researcher does when a very large differenceis obtained,but in theunexpected direction.The temptationto "cheat"is overwhelming!It was also argued that such data ought not to be treated withthe same reactionas a zero difference on scientific in its present grounds: "It is to be doubted whether experimental psychology, state,can afford suchlofty indifference toward experimental surprises" (Burke, 1953, p. 385). Many workers were concerned that a move toward one-tail tests represented a loosening or a lowering of conventional standards, a sort of reprehensible scientific conservatism breaking of the rulesand pious pronouncements about abound. Predictably, the debateled to attemptsto establishthe rulesfor the use of one-tail tests (Kimmel, 1957). What is found in these discussions is the implicit assumption that formally
204
13. ASSESSING DIFFERENCES AND HAVING CONFIDENCE
stated alternative hypotheses are anintegral partof statistical analysis. What is not found in these discussions is anyreferenceto thelogic of usinga probability distribution for the assessment of experimental data.Put baldly andsimply, using a one-tail test means that the researcheris using only halfthe probability distribution, and it is inconceivable that this procedure would have been acceptableto any of thefounding fathers.The debateis yet another exampleof the eagernessof practical researchers to codify the methodsof data assessment so that statistical significance has themaximum opportunityto reveal itself,but in the presenceof rules that discouraged, if not entirely eliminated, "fudging." Significance levels, or p levels,are routinely acceptedas part of the interpretationof statistical outcomes.The statistic thatis obtainedis examined with of the statistic,a distribution thatcan be respectto a hypothesized distribution completely specified. Whatis not soreadily appreciatedis the notion of an alternative model. This matter is examinedin the final chapter. For now, a summaryof the processof significancetesting givenin one of theweightierand more thoughtful texts might be helpful. 1. Specificationof ahypothesizedclassof modelsand analternativeclassof models. 2. Choice of a function of the observationsT. 3. Evaluationof the significance level, i.e.,SL =P(T> t\ where t is theobserved value of T and where the probability is calculated for the hypothesizedclass of models. In most applied writingsthe significancelevel is designatedby P, acustom which has engendereda vast amountof confusion. It is quite common to refer to the hypothesized classof models as the null hypothesisand to thealternativeclassof modelsas thealternative hypothesis.We shall omit the adjective "null" becauseit may bemisleading. (Kempthorne & Folks, 1971, pp. 314-315)
14 Treatments and Effects: The Rise of ANOVA
THE BEGINNINGS In the last chapterwe consideredthe developmentof analysis of variance, ANOVA . Herethe incorporationof this statistical technique into psychological methodologyis examinedin a little more detail. The emergenceof ANOVA as themost favored method of data appraisalin psychologyfrom the late 1930sto the 1960s represents a most interestingmix offerees. The first was thecontinuingneedfor psychologyto be"respectable" in the scientific senseand thereforeto seek out mathematical methods. The correlational techniques that were the norm in the 1920sand early 1930s were not enough, nor did they lend themselves to situations where experimental variables mightbe manipulated.The secondwas thegrowth of the commentators and textbook writerswho interpreted Fisher witha minimum needof mathematics. Froma trickle to a flood, the recipe books pouredout over a period of about 25 years and theflow continues unabated. And the third is the emergenceof explanationsof the models of ANOVA using the concept of expected mean squares. These explanations, which did not avoid mathematics, but which wereat alevel that required little more than high school math, opened the way formore sophisticated procedures to beboth understoodand applied. THE EXPERIMENTAL TEXTS It is clear, however, that the enthusiasm that wasmountingfor the newmethods is not reflectedin thetextbooksof experimentalpsychology that were published in those years. Considering four of the books that were classics in their own time shows that their authors largely avoided or ignoredthe impact of Fisher's statistics. Osgood's Method and Theory in Experimental Psychology (1953) 205
206
14. TREATMENTS AND EFFECTS: THE RISE OF ANOVA
mentions neither Fisher nor ANOVA, neither Pearsonnor correlation. Woodworth and Schlosberg's Experimental Psychology (1954) ignores Pearson by name,and theauthors state,"To go thoroughly into correlational analysis lies beyond the scopeof this book" (p. 39). The writers, however, showsome recognitionof the changing times, admitting that the "old standard 'ruleof one variable'" (p. 2)doesnot mean thatno more thanonefactor may bevaried.The experimental design must allow, however, for the effects of single variablesto be assessedand also the effect of possible interactions. Fisher's Design of Experiments is then cited, and Underwood's (1949) bookis referencedand recommendedas asourceof simple experimental designs. And that is it! Hilgard's chapterin Stevens'Handbookof ExperimentalPsychology (1951) not only givesa brief descriptionof ANOVA in the contextof the logic of the control group and factorial designbut also comments on theutility of matched groups designs, but other commentaries, explanations, and discussionsof ANOVA methodsdo not appearin the book. Stevens himself recognizes the utility of correlational techniquesbut admits that his colleague Frederick Mosteller had toconvincehim that his claim was overly conservativeand that: "rank order correlation does not apply to ordinal scales because the derivation of the formula for this correlation involves the assumption thatthe differences are equal,(p. 27)." An astonishing claim. between successive ranks Underwood's (1949) book is a little more encouraging,in that he does recognizethe importanceof statistics.His preface clearly states how important it is to deal with both methodand content.His own teachingof experimental psychology required that statisticsbe anintegral partof it: "I believethat the factual subject mattercan becomprehended readily without a statistical knowledge,but a full appreciation of experimental design problems requires some statistical thinking"(p. v). But, in general,it is fair to saythat many psychological researchers were not in tune with the statistical methods that were appearing."Statistics"seemsto havebeen seenas anecessaryevil! Indeed, thereis more thana hint of the same mindset in today'stexts.An informal surveyof introductory textbooks published in the last 10 years showsa depressingly high incidence of statistics being relegated to an appendix and of sometimesshrill claims that theycan be understood withoutrecourseto mathematics.In using statistics such as the t ratio and more comprehensive analyses such as ANOVA, the necessityof randomizationis always emphasized. Researchers in the social and educational areasof psychology realized that such a requirementwhenit cameto assigning participantsto "treatments"wasjust not possible. Levelsof ability, socioeconomic groupings, age, sex, and so oncannot be directly manipulated. When methodologists such as Campbelland Stanley (1963), authorities in the classification and comparison of experimental designs, showed that meaningful
THE JOURNALS AND THE PAPERS
207
analyses couldbe achieved through the use of"found groups"and what is often called quasi-experimental designs, the potential of ANOVA techniques widthe experiened considerably.The argument was,and is,advanced that unless for the menter can control for all relevant variables alternative explanations results other thanthe influenceof the independent variable can befound- the so-called correlated biases. The skeptic might argue that, in addition to thefact that statistical appraisals are bydefinition probabilisticand therefore uncertain, not assuredly guard against direct manipulationof the independent variable does mistakenattribution of its effects. asthey were,on experimenAnd, very importantly,the commentaries, such tal methods haveto be seen in the light of the dominant force,in American psychologyat least,of behaviorism. For example, Brennan's textbook History and Systemsof Psychology(1994)devotestwo chaptersandmore than40 pages out of about 100pagesto Twentieth-Century Systems (omitting Eastern Traditions, Contemporary Trends and theThird Force Movement). This is not to be taken as acriticism of Brennan: indeed, his book has run toseveral editionsand 1 is an excellentand readable text. The point is that from Watson's (1913) paper until well on into the 1960s, experimental psychology was, for many, the experimental analysisof behavior - latterly within a Skinnerian framework. Sidman's (1960) book Tactics of Scientific Research: Evaluating Experimental Data in Psychologyis most certainlynot about statistical analysis! Scienceis presumably dedicated to stampingout ignorance,but statistical evaluation of dataagainsta baselinewhosecharacteristicsaredeterminedby unknown variables constitutesa passiveacceptanceof ignorance. Thisis a curious negation of the professedaims of science.More consistent withthoseaims is the evaluation of data by meansof experimental control,(p. 45)
In general, the experimental psychologists of this ilk eschewed statistical approaches apartfrom descriptive means,frequency counts, and straightforward assessments of variability. THE JOURNALS AND THE PAPERS Rucci and Tweney (1980) have carried out acomprehensive analysis of the use of ANOVA from 1925 to 1950, concentrating mainly, but not exclusively,on American publications. They identify the earliest research to useANOVA as a paper by Reitz (1934).The author checkedfor homogeneityof variance, gave the method for computingz, andnoted its relationshipwith r\2. An early paper 1
Perhapsone might be permittedto ask Brennanto include a chapteron the history and influence of statistics!
208
14. TREATMENTS AND EFFECTS: THE RISE OF ANOVA
that usesa factorial designis that of Baxter (1940).The author,at theUniversity of Minnesota, acknowledges the statistical help given to him byPalmer Johnson, and healso credits Crutchfield (1938) with the first applicationof a factorial designto apsychological study. Baxter states that his aim is "anattemptto show how factorial design, as discussedby Fisherand more specificallyby Yates . . . , hasbeen appliedto a study of reactiontime" (p.494). The proposed study presentedfor illustration dealswith reaction timeand examined three factors: the hand usedby theparticipant,the sensory modality (auditory or visual),and discrimination (a single stimulus, two stimuli, three stimuli). Baxter explains how the treatment combinations could be arranged and shows how partial confounding(which resultsin some possible interactions becoming untestable) is used to reduce subject fatigue. The paper is altogethera remarkably clear accountof the structureof a factorial design.And Baxter followed throughby reporting (1942)on theoutcomeof a study which used this design. Rucci and Tweney examined 6,457 papers in six American psychological journals. Theyfind that from 1935to 1952 thereis a steady risein the use of ANOVA, which is paralleledby a rise in the use of the ratio. t The rise became a fall during the war years, and therise became steeper after the war. It is suggested thatthe younger researchers, who would have become acquainted with the newtechniquesin their pre-war graduate school training, would have been eligiblefor military serviceand the"old-timers" who were left used older as thecritical ratio. It is not thecasethatthe use ofcorrelational procedures, such techniques diminished. Rucci andTweney's analysis indicates that the percentage ofarticles using such methods remained fairly steady throughout the period examined. Their conclusion is that ANOVA filled the void in experimental 2, p. 35)other discipsychologybut did notdisplace Cronbach's (see chapter pline of scientific psychology. Nevertheless, the use ofANOVA was not establisheduntil quite lateand only surpassedits pre-war use in 1950. Overall, Rucciand Tweney's analysis leads, asthey themselves state,to the view that "ANOVA was incorporatedinto psychologyin logical and orderly steps"(p. 179), and their introduction avers that "It took less than15 yearsfor psychology to incorporateANOVA" (p. 166). Theydo notclaim - indeedthey specifically deny - that the introductionof these techniques constitutes a paradigm shift in the Kuhnian sense. An examinationof the use ofANOVA from 1934 to 1945 by a British historian of science, Lovie (1979), gives a somewhatdifferent view from that offered by Rucci and Tweney, who,unfortunately,do not cite this work,for it is an altogether moreinsightful appraisal, which comes to somewhatdifferent conclusions: The work demonstrates that incorporatingthe technique[ANOVA] into psychology
THE JOURNALS AND THE PAPERS
209
was along and painful process.This was duemore to the substantive implications of the method thanto the purely practicalmattersof increasedarithmetical labour and novelty of language,(p. 151)
Lovie also shows that ANOVAcan beregarded,as heputsit, as"one of the midwives of contemporary experimental psychology" (p. 152). A decidedly non-Kuhniancharacterization,but close enoughto an indicationof a paradigm shift! Threecase studies described by Lovie showthe shift, over a period from 1926 to 1932,from informal,as it were, "let-me-show-you" analysis through a mixture of informal andstatistical analysis to awholly statistical analysis. It was the design of the experiment that occupied the experimenteras far asmethod was concerned,and themeaningof the results couldbe consideredto be "a scientifically acceptableform of consensualbargaining between the authorand the readeron the basis of non-statistical common-sense statements about the data"(p. 155). When ANOVA was adopted,the old conceptualizations of the interpretations of datadied hard, and theresults broughtforth by the method were used selectivelyand asadditional supportfor outcomes thatthe old verbal methods might well have uncovered. Lovie's valuableaccountand hisexaminationof the early papers provides detailed support for the reasonswhy manyof the first applicationsof ANOVA were trivial and why more enlightened and effective useswere slowto emergein anyquantity. A seriesof papersby Carrington (1934, 1936, 1937) in the Proceedingsof the Societyfor Psychical Research report on resultsgiven by one-way ANOVA. The papers dealwith the quantitativeanalysisof trance statesand show thatthe pioneersin systematic research in parapsychology were among the earliest users of the newmethods. Indeed, Beloff (1993) makesthis interesting point: Statistics,after all, have always been a more criticalaspectof experimental parapsychology than they have been for psychophysicsor experimental psychology precisely becausethe results were more likely to be challenged.There is, indeed, someevidencethat parapsychology acted as aspurto statisticiansin developing their own discipline and inelaboratingthe concept of randomness,(p. 126)
It is worth noting that some of the most well-knownof the early rigorous experimental psychologists, for example, GardnerMurphy and William McDougall, were interestedin the paranormal,and it was thelatter who, leaving Harvard for Duke University, recruitedJ. B. Rhine, who helpedto found, and later directed, that institution's parapsychology laboratory. Of course, there were several psychological andeducational researchers who vigorously supportedthe "new" methods. Garrett andZubin (1943) observe that ANOVA had notbeen usedwidely in psychological research. These authors
210
14. TREATMENTS AND EFFECTS: THE RISE OF ANOVA
make the point that: Even as recently as 1940, the author of a textbook [Lindquist] in which Fisher's methodsare applied to problems in educationalrersearch,found it necessaryto use artificial datain severalinstancesbecauseof the lack of experimental materialin the field, (p. 233)
These authors also offer the opinion thatthe belief thatthe methods deal with small sampleshadinfluenced their acceptance - a belief, nay,a. fact, that has ledothers to point to it as areasonwhy they were welcomed! They also worried about "the language of the farm," "soil fertility, weights of pigs, effectivenessof manurial treatmentsand the like" (p. 233). Hearnshaw,the eminenthistorian of psychology, also mentions the samedifficulty as areason for his view that"Fisher'smethods were slow to percolate into psychology both in this country [Britain]and inAmerica" (1964,p. 226). Althoughit is the case that agricultural examples are cited in Fisher's books,and phrases that include the words "blocks," "plots," and "split-plots" still seema little odd to psychological researchers,it is not true to saythat the world of agriculture permeates all Fisher's discussions and explanations,as Lovie (1979) has pointed out. Indeed,the early chapterin The Design of Experimentson themathematicsof a lady tastingtea isclearly the story of anexperimentin perceptual ability, which is aboutas psychologicalas onecould get. It must be mentioned that Grant (1944) produced a detailed criticismof Garrett and Zubin's paper,in particular taking themto task for citing examples of reports where ANOVAis wrongly usedor inappropriatelyapplied. Fromthe standpointof the basisof the method, Grant states that in the Garrett andZubin piece: The impression is given . . . that the primary purposeof analysisof variance is to divide the total variance intotwo or more components (the mean squares) which are to be interpreteddirectly as therespective contributions to the total variance made by the experimental variablesand experimental error, (pp.158-159)
He goeson to saythat the purposeof ANOVA is to test the significanceof variation and, while agreeing that it is possibleto estimatethe proportional contribution of the various components, "the process of estimation mustbe clearly differentiatedfrom the test of significance"(p. 159). Theseare valid criticisms, but it must be said that suchconfusion as some commentaries may bring, reflects Fisher's (1934) own assertionin remarkson a paper givenby Wishart (1934) at a meetingof the Royal Statistical Society that ANOVA"is not a mathematical theorem, but rather a convenient methodof arrangingthe arithmetic" (p. 52). In additionthe estimationof treatmenteffects is considered
THEJOURNALSAND THE PAPERS 211
by many researchers to becritical (andseechapter7, p. 83). What is most telling about Grant's remarks, made more than 50 years ago, is that they illustratea criticismof data appraisal using statistics that persists to this day, thatis, the notion that the primary aim of the exerciseis to obtain significant resultsand that reportingeffect size is not aroutine procedure. Also of interestis Grant's inclusion of expressionsfor the expected values of the mean squares,of which more later. A detailed appraisalof the statistical contentof the leading British journal has notbeen carried out,but a review of that publication over10 year periods shows the rise in the use ofstatistical techniques of increasing varietyand sophistication.The progressionis not dissimilarfrom that observedby Rucci and Tweney. The first issueof the British Journal of Psychology appeared in 1904 underthe distinguished co-editorships of James Ward(1843-1925)and W. H. R. Rivers (1864-1922).Some papersin that volume madeuse ofsuch statisticsasweregenerally available,averages,medians, mean variation, standard deviation,and thecoefficient of variation. Statistics were there at the start. By 1929-1930,correlational techniques appeared fairly regularly in the journal. Volume30, in 1940, contained29 papersand includedthe use of 2x, the mean,the probable error,and factor analysis. There were no reports that usedANOVA and nor did any of the 14 papers published in the 1950 volume (Vol. 41). But x,2, and the tratio still found a place.Fitt and Rogers (1950),the authorsof the paper that used the t ratio for the difference betweenthe means felt it necessaryto includethe formula and cited Lindquist (1940)as thesource. The researchersof the 1950swho were beginningto useANOVA, followed the appearanceof a significant F valuewith post hoc multiple t tests. The use of multiple comparisonprocedures(chapter 13, p. 195) had not yetarrived on the data analysis scene. Six of the 36paperspublishedin 1960 included ANOVA, one ofwhich was anANOVA by ranks. Kendall's Tau, the t ratio, correlation, partial correlation, factor analysis, the Mann-Whitney U test - all were used. Statistics had indeed arrivedand ANOVA was in theforefront.. A third of the papers(17 out of 52, to bemore precise!)in the 1970 issue made use ofANOVA in the appraisalof the data. Also present were Pearson correlation coefficients,the Wilcoxon T, Tau, Spearman's rank difference correlation,the Mann-Whitney U, the phicoefficient, and the tratio. Educational researchersfigured largely in the early papers that used ANOVA, both as commentatorson the methodand inapplyingit to their own research. Of course, manyin the fieldwere quitefamiliar with the correlational techniques thathad led tofactoranalysisand itscontroversiesandmathematical complexities. There had beenmore than40 yearsof study and discussionof skill andability testingand thenatureof intelligence-the raisond'etre of factor analysis- andthis wascentral to thework of thosein the field.They wereready,
212
14. TREATMENTS AND EFFECTS: THE RISE OF ANOVA
even eager,to come to grips with methods that promised an effective way to use themulti-factor approachin experiments,an approach thatoffered the opportunity of testing competing theories more systematically than before. And it was from theeducational researcher's standpoint that Stanley (1966) offered an appraisalof the influence of Fisher's work "thirty yearslater." He makesthe interesting point that although there were experimenters with "considerable 'feel'for designing experiments, . . . perhaps someof them did not have the technical sophisticationto do the corresponding analyses."This to judgmentis basedon the fact that 5 of the 21volunteered papers submitted the Division of Educational Psychology of the American Psychological Association annual conventionin 1965 had to bereturnedto have more complex may have beendue tolack of knowledge,it may have analysis carried out. This been due to uncertainty aboutthe applicability of the techniques,but it also reflects perhapsthe lingering appealof straightforwardand even subjective as wenoted earlierin analysis thatthe researchersof 20 years before favored, the discussionof Lovie's appraisal.It is also Stanley's view that "the impact of the Fisherian revolutionin the designandanalysisof experiments came slowly to psychology" (p. 224). It surely is a matter of opinion as towhether Rucci and Tweney'scountof less than5 percentof papersin their chosen journals in 1940 to between 15 and 20 percent in 1952 is slow growth, particularly when it is observed thatthe growth of the use ofANOVA varied considerably across thejournals.2 This leadsto thepossibility thatthe editorial policyand practice of their chosen journals could have influenced Rucci and Tweney'scount.3 It is also surelythe case thatif a new techniqueis worth anythingat all, thenthe growth of its usewill show as apositively accelerated growth curve, because as more people takeit up, journal editors havea wider and deeper poolof potential refereeswho cangive knowledgeable opinions. Rucci and Tweney's on thelengthof time it took for ANOVA to become part remark, quoted earlier, of experimental psychology, has atone that suggests that they do not believe that the acceptancecan beregardedas"slow." THE STATISTICAL TEXTS The first textbooks that were written for behavioral scientists began to appear in the late 1940sandearly 1950s althoughit mustbe noted thatthe much-cited 2
Stanleyobservesthat theJournal of GeneralPsychologywas late to include ANOVA and it did not appearat all in the 1948 volume. 3 The author, many years ago, had apaper sent back with a quite kind note indicating thatthe journal generallydid notpublish "correlational studies." Younger researchers soon learn which journals are likely to bereceptiveto their approaches!
EXPECTED MEAN SQUARES
213
text by Lindquist was publishedin 1940.The impactof the first edition of this book may have beenlessened,however, by a great manyerrors, both typographic and computational. Quinn McNemar (1940b), in his review, observes that the volume,"in the opinionof thereviewerandstudentreaders,suffers from an overdoseof wordiness"(p. 746). He also notes that there are several major slips, although acknowledging the difficulty in explaining severalof the topics covered,anddeclares thatthe book "shouldbe particularlyuseful to all who are interestedin obtaining non-mathematical knowledge of the variance technique" (p. 748). Snedecor,an agricultural statistician, published a much-used textin 1937, and he isoften given creditfor making Fisher comprehensible (see chapter 13, p. 194). Rucci and Tweney place George Snedecor of Iowa State College, Harold Hotelling, an economist at Columbia, and Palmer Johnsonof the University of Minnesota,all of whom had spent time with Fisher himself, as the foundersof statistical trainingin the United States.But any informal surveyof psychologistswho were undergraduates in the 1950sand early 1960s would likely reveal thata large numberof them were rearedon thetexts of Edwards (1950) of the University of Washington,Guilford (1950) of the University of SouthernCalifornia,or McNemar (1949)of Stanford University. Guilford had publisheda text in 1942,but it was thesecondedition of 1950 and thethird of 1956 that became very popular in undergraduate teaching. In Canadaa somewasFerguson's (1959) Statistical Analysis in Psychology what later popular text and Education, and in the United Kingdom, Yule'sAn Introduction to the Theory of Statistics,first publishedin 1911,had a14th edition published (with Kendall) in 1950 that includes ANOVA. EXPECTED MEAN SQUARES In his well-known book,Scheffe (1959) says: The origin of the random effects models, like that of the fixed effects models, lies in astronomical problems; statisticians re-invented random-effects models long after they were introducedby astronomersand then developedmore complicatedones, (p. 221)
Scheffe citeshis own 1956(a) paper,which gives some historical background to the notion of expected mean squares - E(MS). The estimationof variance components using E(MS) was taken up by Daniels (1939)and later by Crump (1946, 1951). Eisenhart's (1947) influential paperon theassumptions underlying ANOVA models alsodiscusses them in this context. Andersonand Bancroft (1952) publishedone of theearliest texts that examines mathematical expectation - theexpected valueof a random variable"over thelong run"- andearly
214
14. TREATMENTS AND EFFECTS: THE RISE OF ANOVA
in this work theystatethe rules usedto operate with expected values. The concept is easily understoodin the context of a lottery. Supposeyou buy a $1 .00ticket for a draw in which 500 ticketsare to besold. The first and only prize is $400.00.Your "chances"of winning are 1 in 500 and of losing 499 in 500.If Y is therandom variable, When Y = -$1.00,the probabilityof Y, i.e.,p(Y) = 499/500. When Y= $399.00 (you won!),the probabilityp(Y)= 1/500. The expectationof gain overthe long run is: E(Y) = ZYp(Y) = (-1)(499/500)+ (399)(1/500)= -0.0998+ 0.798= - 0.20 If you joined in the draw repeatedlyyou would "in the long run" lose 20 cents. In ANOVA we are concernedwith the expected valuesof the various - over thelong run. components- the mean squares Although largely ignoredby the elementary texts,the slightly more advanced approachesdo treat their discussions and explanationsof ANOVA' s model from the standpointof E(MS). Cornfieldand Tukey (1956)examinethe statistical basisof E(MS), but Gaito (1960) claims that"there has been no systematic presentation of this approachin a psychological textor journal which will reach most psychologists" (p. 3), and hispaper aimsto rectify this. The approach, which Gaito rightly refers to as"illuminating," offers some mathematical insight intothe basisof ANOVA and frees researchersfrom the "recipe book" and intuitive methods that are often used. Eisenhart makes the point that worthwhile and that readersof the books for the these methods have been "non-mathematical" researcher have achieved sound and quite complex analyses inusing them. But these early commentators saw theneedto go further. Gaito'sshort paperoffers a clear statement of thegeneralE(MS) model,and his contention that this approach brings psychology a tool for tackling many statistical problemshas been borneout by thefact that now all thewell-known intermediate texts that deal with ANOVA modelsuse it. This is not theplaceto detail the algebraof E(MS) but here is the statement of the generalcasefor a two- factor experiment:
EXPECTED MEAN SQUARES
215
In thefixed effects model - often referred to asModel I - inferences can be made only aboutthe treatments that have been applied. In the random effects model the researchermay make inferencesnot only about the treatments actually appliedbut about the range of possible levels.The treatmentsapplied are arandom sampleof the range of possible treatments that could have been may bemade aboutthe effects of all levels from the applied. Then inferences sampleof factor levels. This model is often referredto asModel II. In the case where thereare two factors A and B thereare A potential levelsof A but only a levels are included in the experiment. Whena = A then factorA is a fixed factor. If the a levels includedare arandom sample of the potentialA levels then factor A is a random factor. Similarlyfor factor B there are Bpotential levels of the factor and blevels are used in the experiment. Whena = A then ^ = 1 and when B is a random effect,the levels sample size b is usually very, very much smaller thanB and | becomesvanishinglysmall, asdoes ^ where n is sample sizeand N ispopulation size. us to seewhich componentsare Obtaining the variance components allows included in the model, be the factor fixed or random, and to generate the appropriateF ratios to test the effects. For fixed effects: E(MSA ) = E(MSAB) = cre2 + noap2, and theerror termis oe2. 2 2 2 For random effects: E(MS A) =